2406.07550
Report |
An Image is Worth 32 Tokens for Reconstruction and Generation |
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen |
Recent advancements in generative models have highlighted the crucial role of
image tokenization in the efficient synthesis of high-resolution images.
Tokenization, which transforms images into latent representations, reduces
computational demands compared to directly processing pixels and enhances the
effectiveness and efficiency of the generation process. Prior methods, such as
VQGAN, typically utilize 2D latent grids with fixed downsampling factors.
However, these 2D tokenizations face challenges in managing the inherent
redundancies present in images, where adjacent regions frequently display
similarities. To overcome this issue, we introduce Transformer-based
1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images
into 1D latent sequences. TiTok provides a more compact latent representation,
yielding substantially more efficient and effective representations than
conventional techniques. For example, a 256 x 256 x 3 image can be reduced to
just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens
obtained by prior methods. Despite its compact nature, TiTok achieves
competitive performance to state-of-the-art approaches. Specifically, using the
same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT
baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages
of TiTok become even more significant when it comes to higher resolution. At
ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art
diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image
tokens by 64x, leading to 410x faster generation process. Our best-performing
variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still
generating high-quality samples 74x faster. |
This paper introduces TiTok, a novel 1D tokenization method that represents images as compact 1D latent sequences for efficient image reconstruction and generation, breaking away from traditional 2D grid-based representations. |
Existing 2D tokenization methods struggle to handle redundancies in images, limiting their ability to create highly compressed representations. TiTok overcomes this by leveraging the inherent redundancy in images to achieve significantly more compact and efficient representations. |
TiTok utilizes a Vision Transformer (ViT) encoder and decoder with a vector quantizer. It encodes image patches concatenated with latent tokens into a 1D sequence, which is then quantized. A ViT decoder reconstructs the image from these quantized tokens and mask tokens. |
As few as 32 tokens can effectively represent an image, achieving comparable reconstruction performance to 2D methods using 256 tokens.
Scaling up the tokenizer model size allows for even more compact representations without sacrificing performance.
1D tokenization leads to faster and better generative training, achieving competitive FID scores with significantly reduced training and inference time. |
The current implementation primarily focuses on VQ tokenization and a Masked Transformer generator; exploring other tokenizer/generator combinations is left for future work.
While the paper demonstrates results on image data, extending the applicability of 1D tokenization to other modalities like video is a potential future direction. |
image tokenization, 1d representation, image generation, vision transformer, vector quantization |
2406.07547
Report |
Zero-shot Image Editing with Reference Imitation |
Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shilong Zhang, Yu Liu, Yujun Shen, Hengshuang Zhao |
Image editing serves as a practical yet challenging task considering the
diverse demands from users, where one of the hardest parts is to precisely
describe how the edited image should look like. In this work, we present a new
form of editing, termed imitative editing, to help users exercise their
creativity more conveniently. Concretely, to edit an image region of interest,
users are free to directly draw inspiration from some in-the-wild references
(e.g., some relative pictures come across online), without having to cope with
the fit between the reference and the source. Such a design requires the system
to automatically figure out what to expect from the reference to perform the
editing. For this purpose, we propose a generative training framework, dubbed
MimicBrush, which randomly selects two frames from a video clip, masks some
regions of one frame, and learns to recover the masked regions using the
information from the other frame. That way, our model, developed from a
diffusion prior, is able to capture the semantic correspondence between
separate images in a self-supervised manner. We experimentally show the
effectiveness of our method under various test cases as well as its superiority
over existing alternatives. We also construct a benchmark to facilitate further
research. |
Introduces 'imitative editing', a new image editing paradigm where users simply provide a masked source image and an unmasked reference image, enabling editing by imitating corresponding parts from the reference. |
Addresses limitations of existing editing tools that rely heavily on textual descriptions or struggle with local component editing, providing a more convenient and intuitive editing experience. |
Presents MIMIC, a framework trained on video frames using dual diffusion U-Nets. It learns to locate and imitate corresponding regions from a reference image to fill masked areas in a source image, ensuring harmonious blending. |
MIMIC outperforms existing inpainting and composition methods qualitatively and quantitatively in terms of fidelity and harmonious blending.
A new benchmark for evaluating imitative editing is introduced, focusing on part composition and texture transfer tasks.
Ablation studies confirm the importance of video-based training, data augmentation, and the use of dual U-Net architecture for optimal performance. |
MIMIC may struggle to identify the correct reference region when it's too small or multiple similar candidates exist in the reference.
Future work will focus on improving reference region localization and extending MIMIC's capabilities to handle multiple reference images. |
image editing, imitative editing, diffusion models, semantic correspondence, image composition |
2406.07540
Report |
Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance |
Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou |
Recent controllable generation approaches such as FreeControl and Diffusion
Self-guidance bring fine-grained spatial and appearance control to
text-to-image (T2I) diffusion models without training auxiliary modules.
However, these methods optimize the latent embedding for each type of score
function with longer diffusion steps, making the generation process
time-consuming and limiting their flexibility and use. This work presents
Ctrl-X, a simple framework for T2I diffusion controlling structure and
appearance without additional training or guidance. Ctrl-X designs feed-forward
structure control to enable the structure alignment with a structure image and
semantic-aware appearance transfer to facilitate the appearance transfer from a
user-input image. Extensive qualitative and quantitative experiments illustrate
the superior performance of Ctrl-X on various condition inputs and model
checkpoints. In particular, Ctrl-X supports novel structure and appearance
control with arbitrary condition images of any modality, exhibits superior
image quality and appearance transfer compared to existing works, and provides
instant plug-and-play functionality to any T2I and text-to-video (T2V)
diffusion model. See our project page for an overview of the results:
https://genforce.github.io/ctrl-x |
\controlx is a training-free and guidance-free framework for structure and appearance control of text-to-image and text-to-video diffusion models. |
Existing methods for controlling the structure and appearance of diffusion models often require extensive training or computationally expensive guidance techniques, limiting their flexibility and efficiency. |
\controlx leverages feature injection and spatially-aware normalization in the attention layers of pretrained diffusion models to align generated images with user-provided structure and appearance images. Structure control is achieved through direct feature injection from a noisy structure latent, while appearance transfer utilizes self-attention correspondence to normalize output features with weighted feature statistics from a noisy appearance latent. |
\controlx accurately preserves structure from various input types, including natural images, ControlNet-supported conditions, and in-the-wild conditions not possible with existing training-based methods.
\controlx effectively transfers appearance from a given image, demonstrating superior performance compared to training-based and guidance-based methods, especially in challenging cases like cross-subject appearance transfer.
Being both training-free and guidance-free, \controlx achieves competitive runtimes comparable to training-based methods while being significantly faster than other guidance-based and guidance-free approaches. |
The semantic-aware appearance transfer may struggle to capture target appearance when the subject is small due to the low resolution of the feature map.
While \controlx inherits the same safeguards as the T2I and T2V models it builds upon, its accessibility could potentially be misused for malicious applications, raising ethical concerns regarding consent and artist credit. |
generative models, text-to-image synthesis, diffusion models, controllable image generation, appearance transfer |
2406.07537
Report |
Autoregressive Pretraining with Mamba in Vision |
Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie |
The vision community has started to build with the recently developed state
space model, Mamba, as the new backbone for a range of tasks. This paper shows
that Mamba's visual capability can be significantly enhanced through
autoregressive pretraining, a direction not previously explored.
Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's
unidirectional recurrent structure, enabling faster overall training speed
compared to other training strategies like mask modeling. Performance-wise,
autoregressive pretraining equips the Mamba architecture with markedly higher
accuracy over its supervised-trained counterparts and, more importantly,
successfully unlocks its scaling potential to large and even huge model sizes.
For example, with autoregressive pretraining, a base-size Mamba attains 83.2\%
ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our
huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet
accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing
all other Mamba variants in vision. The code is available at
\url{https://github.com/OliverRensu/ARM}. |
This paper introduces ARM, a novel autoregressive pretraining strategy tailored for Mamba architectures in computer vision, enhancing their visual capabilities, scalability, and benchmark performance. |
Prior Mamba architectures for vision, while promising, faced limitations in transferability, scalability, and struggled to match the success of autoregressive pretraining in NLP. |
The paper introduces ARM, which leverages the inherent unidirectional nature of Mamba for efficient autoregressive pretraining, using clustered image patches as prediction units for enhanced performance. |
ARM significantly boosts ImageNet accuracy, with ARM-B achieving 83.2%, outperforming its supervised counterpart by 2.0% and previous Mamba variants.
ARM enables successful training of the largest vision Mamba model to date (ARM-H) reaching 85.0% accuracy on ImageNet.
ARM enhances robustness, with significant performance gains over supervised counterparts on out-of-domain ImageNet variants like ImageNet-A, ImageNet-R, and ImageNet-S. |
The study primarily focuses on image classification, leaving its application to other vision tasks for future work.
Exploring more complex pretraining strategies or incorporating additional data augmentations could further enhance ARM’s performance. |
autoregressive pretraining, vision mamba, self-supervised learning, image classification, computer vision |
2406.07524
Report |
Simple and Effective Masked Diffusion Language Models |
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, Volodymyr Kuleshov |
While diffusion models excel at generating high-quality images, prior work
reports a significant performance gap between diffusion and autoregressive (AR)
methods in language modeling. In this work, we show that simple masked discrete
diffusion is more performant than previously thought. We apply an effective
training recipe that improves the performance of masked diffusion models and
derive a simplified, Rao-Blackwellized objective that results in additional
improvements. Our objective has a simple form -- it is a mixture of classical
masked language modeling losses -- and can be used to train encoder-only
language models that admit efficient samplers, including ones that can generate
arbitrary lengths of text semi-autoregressively like a traditional language
model. On language modeling benchmarks, a range of masked diffusion models
trained with modern engineering practices achieves a new state-of-the-art among
diffusion models, and approaches AR perplexity. We release our code at:
https://github.com/kuleshov-group/mdlm |
The paper presents a well-engineered masked discrete diffusion language modeling (MDLM) framework that outperforms existing diffusion models on language modeling benchmarks, approaching the perplexity of autoregressive (AR) models. |
Diffusion models have the potential to improve long-term planning, controllable generation, and sampling speed in language modeling, but previous approaches exhibit a performance gap compared to AR models. |
The authors utilize a simplified, Rao-Blackwellized objective and a substitution-based parameterization of the reverse diffusion process, along with efficient samplers that support semi-autoregressive generation. |
MDLM achieves a new state-of-the-art among diffusion models on language modeling benchmarks, including One Billion Words and OpenWebText.
Simple engineering choices significantly improve the performance of MDLM and previously discounted baselines like D3PM.
The MDLM framework extends to non-language domains, achieving comparable or superior downstream performance to classical BERT-style training on DNA sequence modeling. |
MDLM perplexity remains slightly higher than AR models.
Future work includes exploring more sophisticated denoising network architectures and extending the framework to other discrete data domains. |
diffusion models, language modeling, rao-blackwellization, semi-autoregressive generation, dna sequence modeling |
2406.07520
Report |
Neural Gaffer: Relighting Any Object via Diffusion |
Haian Jin, Yuan Li, Fujun Luan, Yuanbo Xiangli, Sai Bi, Kai Zhang, Zexiang Xu, Jin Sun, Noah Snavely |
Single-image relighting is a challenging task that involves reasoning about
the complex interplay between geometry, materials, and lighting. Many prior
methods either support only specific categories of images, such as portraits,
or require special capture conditions, like using a flashlight. Alternatively,
some methods explicitly decompose a scene into intrinsic components, such as
normals and BRDFs, which can be inaccurate or under-expressive. In this work,
we propose a novel end-to-end 2D relighting diffusion model, called Neural
Gaffer, that takes a single image of any object and can synthesize an accurate,
high-quality relit image under any novel environmental lighting condition,
simply by conditioning an image generator on a target environment map, without
an explicit scene decomposition. Our method builds on a pre-trained diffusion
model, and fine-tunes it on a synthetic relighting dataset, revealing and
harnessing the inherent understanding of lighting present in the diffusion
model. We evaluate our model on both synthetic and in-the-wild Internet imagery
and demonstrate its advantages in terms of generalization and accuracy.
Moreover, by combining with other generative methods, our model enables many
downstream 2D tasks, such as text-based relighting and object insertion. Our
model can also operate as a strong relighting prior for 3D tasks, such as
relighting a radiance field. |
This paper introduces Neural Gaffer, an end-to-end 2D relighting diffusion model capable of relighting objects from arbitrary categories under novel environmental lighting conditions specified as HDR environment maps. |
Single-image relighting is challenging due to the complex interplay between geometry, materials, and lighting, with prior methods often limited to specific object categories or requiring special capture conditions. |
The method leverages a pre-trained diffusion model fine-tuned on a synthetic relighting dataset (RelitObjaverse) derived from Objaverse. Key innovations include rotating the target environment map to align with the target camera frame and a novel HDR-LDR conditioning strategy to effectively encode the full lighting energy spectrum. |
Neural Gaffer exhibits superior generalization and accuracy in single-image relighting compared to recent baselines, accurately reproducing highlights, shadows, and reflections.
The model effectively supports downstream 2D tasks such as text-based relighting and object insertion, demonstrating its versatility.
Neural Gaffer serves as a powerful relighting prior for 3D tasks, enabling high-quality relighting of neural radiance fields within minutes using a proposed two-stage pipeline. |
The model may exhibit minor inconsistencies in relighting results under changing lighting conditions due to its generative nature.
The reliance on a low-resolution backbone diffusion model limits handling higher image resolutions |
relighting, diffusion models, neural radiance fields, image editing, computer vision |
2406.07516
Report |
Instant 3D Human Avatar Generation using Image Diffusion Models |
Nikos Kolotouros, Thiemo Alldieck, Enric Corona, Eduard Gabriel Bazavan, Cristian Sminchisescu |
We present AvatarPopUp, a method for fast, high quality 3D human avatar
generation from different input modalities, such as images and text prompts and
with control over the generated pose and shape. The common theme is the use of
diffusion-based image generation networks that are specialized for each
particular task, followed by a 3D lifting network. We purposefully decouple the
generation from the 3D modeling which allow us to leverage powerful image
synthesis priors, trained on billions of text-image pairs. We fine-tune latent
diffusion networks with additional image conditioning to solve tasks such as
image generation and back-view prediction, and to support qualitatively
different multiple 3D hypotheses. Our partial fine-tuning approach allows to
adapt the networks for each task without inducing catastrophic forgetting. In
our experiments, we demonstrate that our method produces accurate, high-quality
3D avatars with diverse appearance that respect the multimodal text, image, and
body control signals. Our approach can produce a 3D model in as few as 2
seconds, a four orders of magnitude speedup w.r.t. the vast majority of
existing methods, most of which solve only a subset of our tasks, and with
fewer controls, thus enabling applications that require the controlled 3D
generation of human avatars at scale. The project website can be found at
https://www.nikoskolot.com/avatarpopup/. |
AvatarPopUp, a method for instant generation of rigged full-body 3D human avatars from text, images, and/or body pose and shape. |
Existing text-to-3D human generation methods are optimization-based, taking minutes to hours per instance, while image-based methods lack control and diversity. AvatarPopUp closes this gap by enabling instant, controllable, and diverse 3D human avatar creation. |
AvatarPopUp decouples the generation process into two stages: (1) Text-to-image generation using fine-tuned Latent Diffusion models, conditioned on text prompts and optionally body pose and shape. (2) 3D lifting using a unimodal, feed-forward image-to-3D model trained on a smaller dataset, taking front and back views (generated or input) as input. |
AvatarPopUp generates high-quality, diverse 3D avatars consistent with text prompts and body controls in 2-10 seconds.
It achieves state-of-the-art performance in single-image 3D reconstruction, outperforming baselines in generating detailed back views and normals.
The method enables 3D virtual try-on applications, preserving identity while allowing garment editing with realistic wrinkles and details. |
Limitations inherent to pixel-aligned methods persist, with less detailed regions parallel to camera rays and potential artifacts in under-represented poses or clothing.
Future work includes exploring alternative 3D construction strategies beyond pixel-aligned features and expanding applications in various fields. |
3d human avatar generation, text-to-3d, image-to-3d, diffusion models, virtual try-on |
2406.07502
Report |
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions |
Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong Zhang |
Image description datasets play a crucial role in the advancement of various
applications such as image understanding, text-to-image generation, and
text-image retrieval. Currently, image description datasets primarily originate
from two sources. One source is the scraping of image-text pairs from the web.
Despite their abundance, these descriptions are often of low quality and noisy.
Another is through human labeling. Datasets such as COCO are generally very
short and lack details. Although detailed image descriptions can be annotated
by humans, the high annotation cost limits the feasibility. These limitations
underscore the need for more efficient and scalable methods to generate
accurate and detailed image descriptions. In this paper, we propose an
innovative framework termed Image Textualization (IT), which automatically
produces high-quality image descriptions by leveraging existing multi-modal
large language models (MLLMs) and multiple vision expert models in a
collaborative manner, which maximally convert the visual information into text.
To address the current lack of benchmarks for detailed descriptions, we propose
several benchmarks for comprehensive evaluation, which verifies the quality of
image descriptions created by our framework. Furthermore, we show that
LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved
capability to generate richer image descriptions, substantially increasing the
length and detail of their output with less hallucination. |
The paper proposes ImageTell (IT), a framework for automatically generating detailed and accurate image descriptions without human intervention. |
High-quality image descriptions are crucial for various applications, but existing datasets are limited by low quality (web-scraped) or high annotation cost (human-labeled). |
IT leverages MLLMs to generate a base description, uses vision expert models to extract fine-grained details and detect hallucinations, and finally employs LLMs to reconstruct a richer and more accurate description based on textual information. |
IT-generated descriptions outperform MLLM-generated descriptions in capturing comprehensive visual information and reducing hallucinations.
Fine-tuning MLLMs with IT-curated data significantly improves their ability to generate detailed and accurate descriptions, approaching the performance of more powerful MLLMs.
Evaluation benchmarks (DID-Bench, D2I-Bench, LIN-Bench) are proposed to assess the quality of detailed descriptions. |
Tuning larger MLLMs with IT-curated data was not explored due to computational limitations.
Future work could investigate incorporating additional vision experts and exploring different recaptioning strategies. |
image description generation, multi-modal large language models, vision expert models, hallucination detection, detailed image description datasets |
2406.07499
Report |
Trim 3D Gaussian Splatting for Accurate Geometry Representation |
Lue Fan, Yuxue Yang, Minxing Li, Hongsheng Li, Zhaoxiang Zhang |
In this paper, we introduce Trim 3D Gaussian Splatting (TrimGS) to
reconstruct accurate 3D geometry from images. Previous arts for geometry
reconstruction from 3D Gaussians mainly focus on exploring strong geometry
regularization. Instead, from a fresh perspective, we propose to obtain
accurate 3D geometry of a scene by Gaussian trimming, which selectively removes
the inaccurate geometry while preserving accurate structures. To achieve this,
we analyze the contributions of individual 3D Gaussians and propose a
contribution-based trimming strategy to remove the redundant or inaccurate
Gaussians. Furthermore, our experimental and theoretical analyses reveal that a
relatively small Gaussian scale is a non-negligible factor in representing and
optimizing the intricate details. Therefore the proposed TrimGS maintains
relatively small Gaussian scales. In addition, TrimGS is also compatible with
the effective geometry regularization strategies in previous arts. When
combined with the original 3DGS and the state-of-the-art 2DGS, TrimGS
consistently yields more accurate geometry and higher perceptual quality. Our
project page is https://trimgs.github.io |
Presents TrimGS, a novel technique for reconstructing accurate 3D geometry from images using a contribution-based Gaussian trimming strategy, complementing existing geometric regularization methods. |
Addresses the limitations of previous 3D Gaussian Splatting (3DGS) methods that rely heavily on geometric regularization, which often struggle to capture intricate geometric details. |
Introduces a novel contribution-based trimming strategy that selectively removes inaccurate or redundant Gaussians based on their contributions to the rendered images. It also proposes maintaining relatively small Gaussian scales during training to enhance detail representation and optimize high-frequency regions. |
TrimGS, when applied to both 3DGS and 2DGS, consistently produces more accurate geometry as measured by Chamfer Distance on the DTU dataset.
Analysis of raw point clouds (Gaussian centers) demonstrates the effectiveness of TrimGS in preserving geometric details.
TrimGS, particularly when combined with 2DGS, enhances the perceptual rendering quality, especially in high-frequency regions, mitigating the over-smoothness often observed in 2DGS. |
Despite emphasizing Gaussian trimming, TrimGS still relies on geometric regularization, which can slightly compromise rendering quality compared to the original 3DGS.
Future work will explore the challenge of simultaneously achieving high rendering fidelity and accurate geometry reconstruction. |
3d gaussian splatting, geometry reconstruction, novel view synthesis, gaussian trimming, perceptual rendering quality |
2406.07496
Report |
TextGrad: Automatic "Differentiation" via Text |
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou |
AI is undergoing a paradigm shift, with breakthroughs achieved by systems
orchestrating multiple large language models (LLMs) and other complex
components. As a result, developing principled and automated optimization
methods for compound AI systems is one of the most important new challenges.
Neural networks faced a similar challenge in its early days until
backpropagation and automatic differentiation transformed the field by making
optimization turn-key. Inspired by this, we introduce TextGrad, a powerful
framework performing automatic ``differentiation'' via text. TextGrad
backpropagates textual feedback provided by LLMs to improve individual
components of a compound AI system. In our framework, LLMs provide rich,
general, natural language suggestions to optimize variables in computation
graphs, ranging from code snippets to molecular structures. TextGrad follows
PyTorch's syntax and abstraction and is flexible and easy-to-use. It works
out-of-the-box for a variety of tasks, where the users only provide the
objective function without tuning components or prompts of the framework. We
showcase TextGrad's effectiveness and generality across a diverse range of
applications, from question answering and molecule optimization to radiotherapy
treatment planning. Without modifying the framework, TextGrad improves the
zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to
$55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard
coding problem solutions, improves prompts for reasoning, designs new druglike
small molecules with desirable in silico binding, and designs radiation
oncology treatment plans with high specificity. TextGrad lays a foundation to
accelerate the development of the next-generation of AI systems. |
Introduces TextGrad, a framework for automatic differentiation via text, which uses textual feedback from LLMs to optimize components of compound AI systems. |
Addresses the challenge of optimizing complex AI systems composed of multiple LLMs and tools, a task not easily addressed by traditional gradient-based methods. |
Represents AI systems as computation graphs where variables are connected by arbitrary functions (e.g., LLM calls, simulators). Employs LLMs to provide natural language feedback ('textual gradients') on how to modify variables to improve a downstream objective. These gradients are backpropagated through the graph to update variables. |
Improves the zero-shot accuracy of GPT-4 on the Google-Proof Question Answering benchmark from 51% to 55%.
Achieves a 20% relative performance gain in optimizing solutions to LeetCode-Hard coding problems compared to existing methods.
Enhances prompts for reasoning tasks, pushing GPT-3.5 performance closer to GPT-4. |
Current implementation primarily focuses on text data; extending to other data modalities is important future work.
Exploring more sophisticated optimization techniques inspired by the numerical optimization literature could further improve performance and stability. |
large language models, automatic differentiation, compound ai systems, optimization, textual feedback |
2406.07488
Report |
ReduceFormer: Attention with Tensor Reduction by Summation |
John Yang, Le An, Su Inn Park |
Transformers have excelled in many tasks including vision. However, efficient
deployment of transformer models in low-latency or high-throughput applications
is hindered by the computation in the attention mechanism which involves
expensive operations such as matrix multiplication and Softmax. To address
this, we introduce ReduceFormer, a family of models optimized for efficiency
with the spirit of attention. ReduceFormer leverages only simple operations
such as reduction and element-wise multiplication, leading to greatly
simplified architecture and improved inference performance, with up to 37%
reduction in latency and 44% improvement in throughput, while maintaining
competitive accuracy comparable to other recent methods. The proposed model
family is suitable for edge devices where compute resource and memory bandwidth
are limited, as well as for cloud computing where high throughput is sought
after. |
Introduces ReduceFormer, a family of efficient vision transformer models utilizing simple operations like reduction and element-wise multiplication to improve efficiency without significant accuracy loss. |
Addresses the computational and memory challenges of deploying transformer models in low-latency or high-throughput applications, particularly on resource-constrained edge devices. |
Replaces complex attention mechanisms with a combination of multi-scale local context learning and ReduceFormer Attention, which leverages global summation and element-wise operations to approximate global feature relationships. |
Achieves competitive accuracy comparable to other state-of-the-art methods like EfficientViT on ImageNet-1K benchmark.
Demonstrates significant speedup, with up to 37% reduction in latency on NVIDIA DRIVE Orin SoC and up to 44% improvement in throughput on L40 GPU compared to EfficientViT.
Reduces memory footprint, making it suitable for deployment on edge devices with limited memory bandwidth. |
Current work focuses on image classification, leaving exploration of other vision tasks for future research.
Further optimization of ReduceFormer for specific hardware platforms could potentially yield additional performance gains. |
vision transformers, efficient deep learning, attention mechanisms, edge computing, computer vision |
2406.07480
Report |
Image Neural Field Diffusion Models |
Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, Michael Gharbi |
Diffusion models have shown an impressive ability to model complex data
distributions, with several key advantages over GANs, such as stable training,
better coverage of the training distribution's modes, and the ability to solve
inverse problems without extra training. However, most diffusion models learn
the distribution of fixed-resolution images. We propose to learn the
distribution of continuous images by training diffusion models on image neural
fields, which can be rendered at any resolution, and show its advantages over
fixed-resolution models. To achieve this, a key challenge is to obtain a latent
space that represents photorealistic image neural fields. We propose a simple
and effective method, inspired by several recent techniques but with key
changes to make the image neural fields photorealistic. Our method can be used
to convert existing latent diffusion autoencoders into image neural field
autoencoders. We show that image neural field diffusion models can be trained
using mixed-resolution image datasets, outperform fixed-resolution diffusion
models followed by super-resolution models, and can solve inverse problems with
conditions applied at different scales efficiently. |
This paper proposes Image Neural Field Diffusion models (INFD), which learn the distribution of continuous images via neural fields, enabling resolution-agnostic image generation and editing. |
Current diffusion models are limited to fixed-resolution image generation, requiring separate super-resolution models for high-resolution synthesis. INFD overcomes this by learning a continuous image representation, facilitating efficient high-resolution generation and multi-scale image editing. |
The method involves two stages: 1) Train an image neural field autoencoder that maps images to and from a latent space representing continuous image neural fields. 2) Train a diffusion model on this latent space to generate new images. A novel convolutional local image function (CLIF) renderer is introduced for efficient and photorealistic neural field rendering. |
INFD outperforms fixed-resolution diffusion models followed by super-resolution in terms of image quality and detail preservation.
The model effectively learns from mixed-resolution datasets, even with limited high-resolution training data.
INFD enables efficient solving of inverse problems with multi-scale conditions, such as layout-to-image generation. |
The method assumes scale-consistency in training data, limiting its performance on datasets with significant discrepancies between low and high-resolution images.
Current implementation relies on a fixed-resolution encoder, exploring efficient any-resolution encoders is left for future work. |
diffusion models, neural fields, image generation, super-resolution, image editing |
2406.07476
Report |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs |
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing |
In this paper, we present the VideoLLaMA 2, a set of Video Large Language
Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio
understanding in video and audio-oriented tasks. Building upon its predecessor,
VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC)
connector, which effectively captures the intricate spatial and temporal
dynamics of video data. Additionally, we integrate an Audio Branch into the
model through joint training, thereby enriching the multimodal understanding
capabilities of the model by seamlessly incorporating audio cues. Comprehensive
evaluations on multiple-choice video question answering (MC-VQA), open-ended
video question answering (OE-VQA), and video captioning (VC) tasks demonstrate
that VideoLLaMA 2 consistently achieves competitive results among open-source
models and even gets close to some proprietary models on several benchmarks.
Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and
audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models.
These advancements underline VideoLLaMA 2's superior performance in multimodal
comprehension, setting a new standard for intelligent video analysis systems.
All models are public to facilitate further research. |
Introducing VideoLLaMA 2, a series of Video Large Language Models (Video-LLMs) that improve spatial-temporal modeling and audio understanding in video and audio-oriented tasks. |
Video understanding and generation is an important AI field, and existing Video-LLMs struggle to process temporal dynamics and integrate audio cues effectively. |
VideoLLaMA 2 builds on its predecessor with a new Spatial-Temporal Convolution (STC) connector for capturing spatial and temporal dynamics. It also integrates an Audio Branch through joint training to incorporate audio cues, enhancing multimodal understanding. |
VideoLLaMA 2 achieves competitive results against open-source models on MC-VQA, OE-VQA, and VC tasks, even approaching the performance of some proprietary models.
The model exhibits significant improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks.
VideoLLaMA 2 showcases a deeper understanding of multimodal content, excelling in tasks requiring interpretation of both visual and auditory information. |
The model shows limitations in video tasks heavily reliant on static visual information, suggesting a potential area for improvement.
Future work could explore the integration of other popular LLMs, like Gemma-IT, LLaMA3-Instruct, and Qwen2-Instruct, as the backbone. |
video language models, multimodal understanding, spatial-temporal modeling, audio-visual integration, video question answering |
2406.07472
Report |
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models |
Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, Hsin-Ying Lee |
Existing dynamic scene generation methods mostly rely on distilling knowledge
from pre-trained 3D generative models, which are typically fine-tuned on
synthetic object datasets. As a result, the generated scenes are often
object-centric and lack photorealism. To address these limitations, we
introduce a novel pipeline designed for photorealistic text-to-4D scene
generation, discarding the dependency on multi-view generative models and
instead fully utilizing video generative models trained on diverse real-world
datasets. Our method begins by generating a reference video using the video
generation model. We then learn the canonical 3D representation of the video
using a freeze-time video, delicately generated from the reference video. To
handle inconsistencies in the freeze-time video, we jointly learn a per-frame
deformation to model these imperfections. We then learn the temporal
deformation based on the canonical representation to capture dynamic
interactions in the reference video. The pipeline facilitates the generation of
dynamic scenes with enhanced photorealism and structural integrity, viewable
from multiple perspectives, thereby setting a new standard in 4D scene
generation. |
Introduces \methodname, a novel pipeline for photorealistic text-to-4D scene generation that leverages video generative models trained on real-world datasets. |
Addresses limitations of existing 4D generation methods, which often produce object-centric and unrealistic scenes due to reliance on multi-view models trained on synthetic data. |
Generates a reference video and a freeze-time video using a video diffusion model. Reconstructs canonical 3D Gaussian Splats (3DGS) from the freeze-time video, modeling inconsistencies as per-frame deformations. Learns temporal deformation from the reference video using the reconstructed 3DGS and video score distillation sampling (SDS). |
Achieves text-driven dynamic scene generation with near-photorealistic appearance and realistic 3D motions.
Outperforms state-of-the-art object-centric text-to-4D generation methods in user studies across various realism and quality metrics.
Offers greater flexibility, diversity, and computational efficiency compared to methods relying solely on score distillation sampling. |
Inherits limitations from the underlying video generation model, such as resolution constraints and occasional artifacts.
Reconstruction can be challenging with complex scenes involving rapid movements or lighting changes. |
4d scene generation, text-to-video, deformable 3d gaussian splats, score distillation sampling, photorealistic rendering |
2406.07251
Report |
Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models |
Athanasios Tragakis, Marco Aversa, Chaitanya Kaul, Roderick Murray-Smith, Daniele Faccio |
In this work, we introduce Pixelsmith, a zero-shot text-to-image generative
framework to sample images at higher resolutions with a single GPU. We are the
first to show that it is possible to scale the output of a pre-trained
diffusion model by a factor of 1000, opening the road for gigapixel image
generation at no additional cost. Our cascading method uses the image generated
at the lowest resolution as a baseline to sample at higher resolutions. For the
guidance, we introduce the Slider, a tunable mechanism that fuses the overall
structure contained in the first-generated image with enhanced fine details. At
each inference step, we denoise patches rather than the entire latent space,
minimizing memory demands such that a single GPU can handle the process,
regardless of the image's resolution. Our experimental results show that
Pixelsmith not only achieves higher quality and diversity compared to existing
techniques, but also reduces sampling time and artifacts. The code for our work
is available at https://github.com/Thanos-DB/Pixelsmith. |
Pixelsmith, a zero-shot text-to-image framework that generates gigapixel images using a single consumer-grade GPU by scaling the output of pretrained diffusion models. |
Existing methods for high-resolution image generation are limited by computational resources, memory efficiency, and the introduction of artifacts. Pixelsmith addresses these limitations. |
The method uses a cascaded approach, generating a base image at a lower resolution and then upscaling it. It introduces a "Slider" mechanism to control the level of detail and a patch denoising process for memory efficiency. |
Pixelsmith achieves higher quality and diversity compared to existing techniques.
The method reduces sampling time and artifacts.
It allows for flexible scaling to ultra-high resolutions on limited hardware. |
Suppressing artifacts becomes increasingly difficult at higher resolutions.
Lack of appropriate metrics for evaluating high-resolution image generation. |
image generation, diffusion models, high-resolution, gigapixel, patch denoising |
2406.07209
Report |
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance |
X. Wang, Siming Fu, Qihan Huang, Wanggui He, Hao Jiang |
Recent advancements in text-to-image generation models have dramatically
enhanced the generation of photorealistic images from textual prompts, leading
to an increased interest in personalized text-to-image applications,
particularly in multi-subject scenarios. However, these advances are hindered
by two main challenges: firstly, the need to accurately maintain the details of
each referenced subject in accordance with the textual descriptions; and
secondly, the difficulty in achieving a cohesive representation of multiple
subjects in a single image without introducing inconsistencies. To address
these concerns, our research introduces the MS-Diffusion framework for
layout-guided zero-shot image personalization with multi-subjects. This
innovative approach integrates grounding tokens with the feature resampler to
maintain detail fidelity among subjects. With the layout guidance, MS-Diffusion
further improves the cross-attention to adapt to the multi-subject inputs,
ensuring that each subject condition acts on specific areas. The proposed
multi-subject cross-attention orchestrates harmonious inter-subject
compositions while preserving the control of texts. Comprehensive quantitative
and qualitative experiments affirm that this method surpasses existing models
in both image and text fidelity, promoting the development of personalized
text-to-image generation. |
MS-Diffusion, a novel layout-guided, zero-shot framework for multi-subject image personalization within diffusion models. |
Addresses limitations of existing personalized text-to-image generation models in maintaining subject detail fidelity and achieving cohesive multi-subject representation. |
Introduces a grounding resampler to extract and integrate subject features with grounding information (entities, bounding boxes) and a multi-subject cross-attention mechanism to confine subjects to specific areas guided by layout priors. |
Achieves superior image fidelity and detail retention in single-subject personalization.
Demonstrates robust multi-subject generation with natural interactions and distinct subject representation.
Exhibits high text fidelity, preserving text control capabilities while incorporating multiple subject references. |
Lacks precision in subject positioning due to box-based layout indication.
Explicit layout input requirement during inference limits complex scene generation. |
image-personalization, multi-subject-generation, diffusion-models, zero-shot-learning, layout-guidance |
2406.07170
Report |
VoxNeuS: Enhancing Voxel-Based Neural Surface Reconstruction via Gradient Interpolation |
Sidun Liu, Peng Qiao, Zongxin Ye, Wenyu Li, Yong Dou |
Neural Surface Reconstruction learns a Signed Distance Field~(SDF) to
reconstruct the 3D model from multi-view images. Previous works adopt
voxel-based explicit representation to improve efficiency. However, they
ignored the gradient instability of interpolation in the voxel grid, leading to
degradation on convergence and smoothness. Besides, previous works entangled
the optimization of geometry and radiance, which leads to the deformation of
geometry to explain radiance, causing artifacts when reconstructing textured
planes.
In this work, we reveal that the instability of gradient comes from its
discontinuity during trilinear interpolation, and propose to use the
interpolated gradient instead of the original analytical gradient to eliminate
the discontinuity. Based on gradient interpolation, we propose VoxNeuS, a
lightweight surface reconstruction method for computational and memory
efficient neural surface reconstruction. Thanks to the explicit representation,
the gradient of regularization terms, i.e. Eikonal and curvature loss, are
directly solved, avoiding computation and memory-access overhead.
Further, VoxNeuS adopts a geometry-radiance disentangled architecture to
handle the geometry deformation from radiance optimization.
The experimental results show that VoxNeuS achieves better reconstruction
quality than previous works. The entire training process takes 15 minutes and
less than 3 GB of memory on a single 2080ti GPU. |
VoxNeuS, an efficient and lightweight neural surface reconstruction method that enhances voxel-based neural surface reconstruction by using interpolated gradients. |
Existing voxel-based methods for neural surface reconstruction suffer from gradient instability during trilinear interpolation, leading to slow convergence and poor surface smoothness. Additionally, the entanglement of geometry and radiance optimization can cause artifacts. |
The core of VoxNeuS is the replacement of analytical SDF gradients with interpolated gradients, ensuring gradient continuity without computational overhead. Additionally, it uses a geometry-radiance disentangled architecture, directly applies SDF regularization on vertices, and employs progressive super-resolution of the SDF grid. |
Achieves lower Chamfer Distance on DTU than Voxurf and NeuS2 without requiring foreground masks.
Significantly faster than previous methods, completing training in 15 minutes on a single 2080ti GPU.
More memory efficient, requiring less than 3GB of memory during training. |
Relies on an independent color network (e.g., hash grid), which could be further optimized.
The disentangled architecture may require more iterations to converge compared to entangled approaches. |
neural surface reconstruction, voxel-based representation, gradient interpolation, geometry-radiance disentanglement, efficient 3d reconstruction |
2406.07163
Report |
FaceGPT: Self-supervised Learning to Chat about 3D Human Faces |
Haoran Wang, Mohit Mendiratta, Christian Theobalt, Adam Kortylewski |
We introduce FaceGPT, a self-supervised learning framework for Large
Vision-Language Models (VLMs) to reason about 3D human faces from images and
text. Typical 3D face reconstruction methods are specialized algorithms that
lack semantic reasoning capabilities. FaceGPT overcomes this limitation by
embedding the parameters of a 3D morphable face model (3DMM) into the token
space of a VLM, enabling the generation of 3D faces from both textual and
visual inputs. FaceGPT is trained in a self-supervised manner as a model-based
autoencoder from in-the-wild images. In particular, the hidden state of LLM is
projected into 3DMM parameters and subsequently rendered as 2D face image to
guide the self-supervised learning process via image-based reconstruction.
Without relying on expensive 3D annotations of human faces, FaceGPT obtains a
detailed understanding about 3D human faces, while preserving the capacity to
understand general user instructions. Our experiments demonstrate that FaceGPT
not only achieves high-quality 3D face reconstructions but also retains the
ability for general-purpose visual instruction following. Furthermore, FaceGPT
learns fully self-supervised to generate 3D faces based on complex textual
inputs, which opens a new direction in human face analysis. |
Introduces FaceGPT, a self-supervised framework allowing Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text using a 3D Morphable Model (3DMM) embedded within the VLM's token space. |
Traditional 3D face reconstruction methods lack semantic reasoning, limiting their ability to understand faces from textual descriptions like humans can. FaceGPT bridges this gap by enabling VLMs to process both visual and textual information for 3D face understanding. |
The framework integrates a 3DMM into a VLM, enabling the generation of 3D faces from both text and images. It leverages a self-supervised, model-based autoencoder trained on in-the-wild images, using a differentiable renderer to reconstruct and learn from 2D face images, eliminating the need for expensive 3D annotations. |
Achieves high-quality 3D face reconstructions comparable to specialized methods while retaining general visual instruction following abilities.
Demonstrates the ability to generate 3D faces from complex textual descriptions, opening new avenues in human face analysis.
Exhibits strong performance in traditional 3D face reconstruction, visual instruction following, and text-based 3D face reconstruction tasks. |
Currently does not match the state-of-the-art performance of specialized 3D face reconstruction methods.
Specific to faces and requires a pre-existing 3D morphable model, limiting generalization to other objects. |
3d face reconstruction, vision-language models, self-supervised learning, 3d morphable model, text-to-3d face generation |
2406.07008
Report |
Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models |
Sooyeon Go, Kyungmook Choi, Minjung Shin, Youngjung Uh |
As pretrained text-to-image diffusion models have become a useful tool for
image synthesis, people want to specify the results in various ways. In this
paper, we introduce a method to produce results with the same structure of a
target image but painted with colors from a reference image, i.e., appearance
transfer, especially following the semantic correspondence between the result
and the reference. E.g., the result wing takes color from the reference wing,
not the reference head. Existing methods rely on the query-key similarity
within self-attention layer, usually producing defective results. To this end,
we propose to find semantic correspondences and explicitly rearrange the
features according to the semantic correspondences. Extensive experiments show
the superiority of our method in various aspects: preserving the structure of
the target and reflecting the color from the reference according to the
semantic correspondences, even when the two images are not aligned. |
This paper introduces a training-free method for appearance transfer in text-to-image diffusion models, which transfers local appearances from a reference image to a target image based on their semantic correspondences. |
Existing methods often fail to accurately transfer appearance between unaligned images or those with complex patterns because they rely on query-key similarity within self-attention layers, which doesn’t guarantee semantic correspondence. |
The method finds semantic correspondences between features of the target and reference images, rearranges the reference features accordingly, and injects them into the target features during the denoising process. They utilize image-level segmentation masks to confine the correspondence within the region of interest and apply AdaIN to minimize color discrepancies. |
The method successfully transfers complex color patterns while preserving the target image's structure, even when the target and reference images are not aligned.
It outperforms existing methods in preserving the structure of the target image, as measured by IoU between object masks.
The method is robust to challenging cases, such as cross-category and cross-style appearance transfer, and can handle multiple objects with different appearances. |
The method relies on the performance of the inversion model and may struggle if the inversion is inaccurate.
It may not always find accurate semantic correspondences when the reference image lacks semantically matching parts with the target image. |
appearance transfer, diffusion models, semantic correspondence, image editing, text-to-image synthesis |
2406.06973
Report |
RWKV-CLIP: A Robust Vision-Language Representation Learner |
Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng |
Contrastive Language-Image Pre-training (CLIP) has significantly improved
performance in various vision-language tasks by expanding the dataset with
image-text pairs obtained from websites. This paper further explores CLIP from
the perspectives of data and model architecture. To address the prevalence of
noisy data and enhance the quality of large-scale image-text data crawled from
the internet, we introduce a diverse description generation framework that can
leverage Large Language Models (LLMs) to synthesize and refine content from
web-based texts, synthetic captions, and detection tags. Furthermore, we
propose RWKV-CLIP, the first RWKV-driven vision-language representation
learning model that combines the effective parallel training of transformers
with the efficient inference of RNNs. Comprehensive experiments across various
model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust
and efficient vision-language representation learner, it achieves
state-of-the-art performance in several downstream tasks, including linear
probe, zero-shot classification, and zero-shot image-text retrieval. To
facilitate future research, the code and pre-trained models are released at
https://github.com/deepglint/RWKV-CLIP |
This paper explores CLIP from data and model architecture perspectives, proposing a diverse description generation framework using LLMs and introducing RWKV-CLIP, an RWKV-driven vision-language representation learning model. |
This work addresses challenges of noisy data in large-scale image-text pairs and limitations of Transformers in processing high-resolution images and long sequences. |
The authors develop a diverse description generation framework leveraging LLMs to synthesize and refine information from various sources. They also propose RWKV-CLIP, which combines the parallel training of Transformers with the efficient inference of RNNs. |
RWKV-CLIP achieves state-of-the-art performance in linear probe, surpassing previous models.
It significantly outperforms existing methods in zero-shot image-text retrieval on Flickr30k and MSCOCO.
The model demonstrates robustness and effectiveness in zero-shot classification across 11 datasets. |
The paper notes potential limitations in prompt template constraints affecting zero-shot classification.
Future work could explore further compatibility improvements between RWKV and Transformer architectures. |
vision-language representation learning, contrastive language-image pre-training (clip), rwkv, diverse description generation, zero-shot learning |
2406.06911
Report |
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising |
Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, Xinchao Wang |
Diffusion models have garnered significant interest from the community for
their great generative ability across various applications. However, their
typical multi-step sequential-denoising nature gives rise to high cumulative
latency, thereby precluding the possibilities of parallel computation. To
address this, we introduce AsyncDiff, a universal and plug-and-play
acceleration scheme that enables model parallelism across multiple devices. Our
approach divides the cumbersome noise prediction model into multiple
components, assigning each to a different device. To break the dependency chain
between these components, it transforms the conventional sequential denoising
into an asynchronous process by exploiting the high similarity between hidden
states in consecutive diffusion steps. Consequently, each component is
facilitated to compute in parallel on separate devices. The proposed strategy
significantly reduces inference latency while minimally impacting the
generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff
achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with
only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our
experiments also demonstrate that AsyncDiff can be readily applied to video
diffusion models with encouraging performances. The code is available at
https://github.com/czg1225/AsyncDiff. |
This paper proposes AsyncDiff, a universal and plug-and-play distributed acceleration scheme for diffusion models, enabling model parallelism across multiple devices. |
Diffusion models have high inference latency due to their multi-step sequential denoising process, hindering their widespread application. |
AsyncDiff divides the denoising model into components, each assigned to a different device. By exploiting hidden state similarity between consecutive steps, it transforms sequential denoising into an asynchronous process, allowing parallel computation. |
Achieves up to 4.0x speedup on Stable Diffusion v2.1 with minimal quality degradation on four NVIDIA A5000 GPUs.
Demonstrates effectiveness on both text-to-image and video diffusion models, significantly reducing latency while preserving quality.
Outperforms existing parallel acceleration methods in terms of speed, quality, and resource efficiency. |
Performance may be sub-optimal with limited communication bandwidth between devices.
Relies on pre-trained diffusion models, limiting quality improvements if the baseline model is inadequate. |
diffusion models, model parallelism, asynchronous denoising, inference acceleration, distributed computing |
2406.06890
Report |
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation |
Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang |
Image diffusion distillation achieves high-fidelity generation with very few
sampling steps. However, applying these techniques directly to video diffusion
often results in unsatisfactory frame quality due to the limited visual quality
in public video datasets. This affects the performance of both teacher and
student video diffusion models. Our study aims to improve video diffusion
distillation while improving frame appearance using abundant high-quality image
data. We propose motion consistency model (MCM), a single-stage video diffusion
distillation method that disentangles motion and appearance learning.
Specifically, MCM includes a video consistency model that distills motion from
the video teacher model, and an image discriminator that enhances frame
appearance to match high-quality image data. This combination presents two
challenges: (1) conflicting frame learning objectives, as video distillation
learns from low-quality video frames while the image discriminator targets
high-quality images; and (2) training-inference discrepancies due to the
differing quality of video samples used during training and inference. To
address these challenges, we introduce disentangled motion distillation and
mixed trajectory distillation. The former applies the distillation objective
solely to the motion representation, while the latter mitigates
training-inference discrepancies by mixing distillation trajectories from both
the low- and high-quality video domains. Extensive experiments show that our
MCM achieves the state-of-the-art video diffusion distillation performance.
Additionally, our method can enhance frame quality in video diffusion models,
producing frames with high aesthetic scores or specific styles without
corresponding video data. |
Proposes MCM, a single-stage video diffusion distillation method that accelerates sampling and leverages an optional high-quality image dataset to enhance generated video frame quality. |
Existing video diffusion models often suffer from unsatisfactory frame quality due to limitations in publicly available video datasets. This hinders both teacher and student model performance. |
Combines a video Latent Consistency Model (LCM) for motion distillation with an image discriminator for appearance enhancement, addressing conflicting learning objectives and training-inference discrepancies through disentangled motion distillation and mixed trajectory distillation. |
Significantly improves video diffusion distillation performance compared to previous state-of-the-art methods.
Demonstrates superior adaptability to different image dataset distributions, resulting in higher fidelity and aesthetically pleasing video frames.
Effectively mitigates training-inference discrepancies through simulating inference-time ODE trajectories and mixing them with real video data during training. |
Model performance is sensitive to the quality, diversity, and distribution of training data.
Potential for misuse in creating deepfakes necessitates responsible deployment strategies. |
video diffusion, diffusion distillation, frame quality enhancement, text-to-video generation, motion consistency |
2406.06820
Report |
Adapters Strike Back |
Jan-Martin O. Steitz, Stefan Roth |
Adapters provide an efficient and lightweight mechanism for adapting trained
transformer models to a variety of different tasks. However, they have often
been found to be outperformed by other adaptation mechanisms, including
low-rank adaptation. In this paper, we provide an in-depth study of adapters,
their internal structure, as well as various implementation choices. We uncover
pitfalls for using adapters and suggest a concrete, improved adapter
architecture, called Adapter+, that not only outperforms previous adapter
implementations but surpasses a number of other, more complex adaptation
mechanisms in several challenging settings. Despite this, our suggested adapter
is highly robust and, unlike previous work, requires little to no manual
intervention when addressing a novel scenario. Adapter+ reaches
state-of-the-art average accuracy on the VTAB benchmark, even without a
per-task hyperparameter optimization. |
This paper presents \textbf{\adapter}, an improved adapter configuration for adapting vision transformers (ViTs) for downstream tasks, showing that adapters can outperform other parameter-efficient fine-tuning methods. |
Fine-tuning large ViTs on multiple downstream tasks requires significant storage and risks overfitting on small datasets. Parameter-efficient tuning methods address these issues, and \adapter offers an optimal solution. |
The study investigates the impact of adapter position, inner structure (normalization, scaling, initialization), and pre-processing on ViT adaptation using VTAB and FGVC benchmarks. |
The \textbf{Post-Adapter} position, with \textbf{channel-wise scaling} and \textbf{Houlsby initialization}, proves to be the most effective adapter configuration.
\adapter achieves state-of-the-art average accuracy on VTAB (77.6%) without per-task hyperparameter tuning and on FGVC (90.7%).
\adapter demonstrates superior parameter-accuracy trade-off and robustness to domain shifts compared to LoRA, VPT, SSF, FacT, and other methods. |
The study primarily focuses on a ViT-B/16 architecture.
Future work could explore \adapter's performance on larger ViT models and with different pre-training strategies. |
vision transformer, transfer learning, parameter-efficient fine-tuning, adapter, vtab |
2406.06730
Report |
TRINS: Towards Multimodal Language Models that Can Read |
Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun |
Large multimodal language models have shown remarkable proficiency in
understanding and editing images. However, a majority of these visually-tuned
models struggle to comprehend the textual content embedded in images, primarily
due to the limitation of training data. In this work, we introduce TRINS: a
Text-Rich image INStruction dataset, with the objective of enhancing the
reading ability of the multimodal large language model. TRINS is built upon
LAION using hybrid data annotation strategies that include machine-assisted and
human-assisted annotation processes. It contains 39,153 text-rich images,
captions, and 102,437 questions. Specifically, we show that the number of words
per annotation in TRINS is significantly longer than that of related datasets,
providing new challenges. Furthermore, we introduce a simple and effective
architecture, called a Language-vision Reading Assistant (LaRA), which is good
at understanding textual content within images. LaRA outperforms existing
state-of-the-art multimodal large language models on the TRINS dataset, as well
as other classical benchmarks. Lastly, we conducted a comprehensive evaluation
with TRINS on various text-rich image understanding and generation tasks,
demonstrating its effectiveness. |
This paper introduces TRINS, a text-rich image instruction dataset, to improve multimodal language models' ability to understand and reason about text within images. |
Existing visually-tuned models struggle to comprehend text in images due to limitations in training data, hindering their ability to understand documents, posters, etc., and limiting human-agent collaboration. |
TRINS is built using a semi-automatic approach, leveraging CLIP and GPT-4 for annotation, resulting in 39k+ text-rich images with captions and 100k+ question-answer pairs. The authors also introduce LaRA, a language-vision reading assistant model. |
TRINS annotations are significantly more detailed than existing datasets, leading to improved performance in text-rich image understanding tasks.
LaRA, fine-tuned on TRINS, outperforms state-of-the-art models on text-rich image understanding, demonstrating the dataset's effectiveness.
Fine-tuning on TRINS does not degrade performance on general visual tasks, suggesting a broader benefit to multimodal understanding. |
The ability to extract text from images, while improved by OCR integration, remains a limitation for LaRA.
Generating images with extensive text remains challenging for existing text-to-image models, necessitating further research in text rendering. |
multimodal learning, computer vision, natural language processing, dataset, text recognition |
2406.06527
Report |
IllumiNeRF: 3D Relighting without Inverse Rendering |
Xiaoming Zhao, Pratul P. Srinivasan, Dor Verbin, Keunhong Park, Ricardo Martin Brualla, Philipp Henzler |
Existing methods for relightable view synthesis -- using a set of images of
an object under unknown lighting to recover a 3D representation that can be
rendered from novel viewpoints under a target illumination -- are based on
inverse rendering, and attempt to disentangle the object geometry, materials,
and lighting that explain the input images. Furthermore, this typically
involves optimization through differentiable Monte Carlo rendering, which is
brittle and computationally-expensive. In this work, we propose a simpler
approach: we first relight each input image using an image diffusion model
conditioned on lighting and then reconstruct a Neural Radiance Field (NeRF)
with these relit images, from which we render novel views under the target
lighting. We demonstrate that this strategy is surprisingly competitive and
achieves state-of-the-art results on multiple relighting benchmarks. Please see
our project page at https://illuminerf.github.io/. |
This paper introduces a novel method for relightable 3D reconstruction that leverages a 2D Relighting Diffusion Model (RDM) and a latent NeRF model, departing from conventional inverse rendering techniques. |
Existing inverse rendering based methods for relightable 3D reconstruction are computationally expensive, brittle, and often produce implausible results under novel illumination. |
The proposed method first generates a set of plausible relit images from different viewpoints using a RDM conditioned on target lighting. These images are then used to train a latent NeRF model that learns a consistent 3D representation for novel view synthesis under the target lighting. |
Outperforms state-of-the-art inverse rendering methods on the synthetic TensoIR benchmark.
Achieves competitive results on the real-world Stanford-ORB benchmark.
Demonstrates the effectiveness of using a latent NeRF model to reconcile multiple plausible relighting solutions from the RDM. |
Relies on high-quality geometry estimated from input views, which can affect the accuracy of relighting, especially for specular reflections.
Not suitable for real-time relighting due to the need for generating new samples with the RDM and optimizing a NeRF for each target lighting condition. |
relightable view synthesis, diffusion models, neural radiance fields, inverse rendering, 3d reconstruction |
2406.06523
Report |
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing |
Ting-Hsuan Chen, Jiewen Chan, Hau-Shiang Shiu, Shih-Han Yen, Chang-Han Yeh, Yu-Lun Liu |
We propose a video editing framework, NaRCan, which integrates a hybrid
deformation field and diffusion prior to generate high-quality natural
canonical images to represent the input video. Our approach utilizes homography
to model global motion and employs multi-layer perceptrons (MLPs) to capture
local residual deformations, enhancing the model's ability to handle complex
video dynamics. By introducing a diffusion prior from the early stages of
training, our model ensures that the generated images retain a high-quality
natural appearance, making the produced canonical images suitable for various
downstream tasks in video editing, a capability not achieved by current
canonical-based methods. Furthermore, we incorporate low-rank adaptation (LoRA)
fine-tuning and introduce a noise and diffusion prior update scheduling
technique that accelerates the training process by 14 times. Extensive
experimental results show that our method outperforms existing approaches in
various video editing tasks and produces coherent and high-quality edited video
sequences. See our project page for video results at
https://koi953215.github.io/NaRCan_page/. |
NaRCan: a novel video editing framework that generates high-quality natural canonical images by integrating a hybrid deformation field and diffusion prior. |
Existing canonical-based video editing methods often produce unnatural or distorted canonical images, hindering their application in downstream tasks like text-guided editing. This work addresses this limitation by ensuring the generation of high-quality, natural canonical images. |
The method uses a hybrid deformation field combining homography and residual MLP to model video dynamics. It incorporates a diffusion prior from a LoRA fine-tuned diffusion model to enhance the naturalness of the generated canonical image. A noise and diffusion prior update scheduling technique accelerates the training process. |
NaRCan outperforms existing methods in generating natural canonical images, especially in scenes with complex motion.
The method demonstrates superior performance in text-guided video-to-video translation, achieving better prompt alignment, synthesis quality, and temporal consistency.
NaRCan effectively handles downstream tasks such as adding handwritten characters and dynamic video segmentation, benefiting from the high quality of its generated canonical images. |
LoRA fine-tuning for adapting the diffusion model to specific scenes is time-consuming.
In scenarios with extreme video scene changes, the diffusion prior may not always guarantee a high-quality natural canonical image. |
video editing, canonical image, diffusion model, lora, hybrid deformation field |
2406.06465
Report |
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction |
Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang |
Text-guided video prediction (TVP) involves predicting the motion of future
frames from the initial frame according to an instruction, which has wide
applications in virtual reality, robotics, and content creation. Previous TVP
methods make significant breakthroughs by adapting Stable Diffusion for this
task. However, they struggle with frame consistency and temporal stability
primarily due to the limited scale of video datasets. We observe that
pretrained Image2Video diffusion models possess good priors for video dynamics
but they lack textual control. Hence, transferring Image2Video models to
leverage their video dynamic priors while injecting instruction control to
generate controllable videos is both a meaningful and challenging task. To
achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to
predict future video states based on initial frames and text instructions. More
specifically, we design a dual query transformer (DQFormer) architecture, which
integrates the instructions and frames into the conditional embeddings for
future frame prediction. Additionally, we develop Long-Short Term Temporal
Adapters and Spatial Adapters that can quickly transfer general video diffusion
models to specific scenarios with minimal training costs. Experimental results
show that our method significantly outperforms state-of-the-art techniques on
four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and
UCF-101. Notably, AID achieves 91.2% and 55.5% FVD improvements on Bridge and
SSv2 respectively, demonstrating its effectiveness in various domains. More
examples can be found at our website https://chenhsing.github.io/AID. |
This paper proposes AID, a novel approach that adapts a pretrained Image2Video diffusion model for text-guided video prediction by incorporating a Multi-Modal Large Language Model (MLLM) and a Dual Query Transformer (DQFormer) to effectively integrate textual and visual conditions. |
Existing text-guided video prediction models often struggle with frame consistency and temporal stability due to limitations in video dataset size. Leveraging pretrained Image2Video models with inherent video dynamic priors offers a promising solution. |
The study utilizes a pretrained SVD model as the foundation and introduces MLLM to predict video states from text instructions and initial frames. A DQFormer architecture is designed to integrate these multimodal conditions. Additionally, spatial and temporal adapters are employed for efficient model transfer to specific video prediction tasks. |
AID significantly outperforms state-of-the-art methods in text-guided video prediction across various datasets, including Something Something V2, Bridge Data, and Epic Kitchen-100.
The approach demonstrates superior performance in capturing video dynamics and adhering to textual instructions, leading to more coherent and contextually accurate video predictions.
Ablation studies confirm the effectiveness of individual components such as DQFormer, MLLM-aided prompting, and the use of spatial and temporal adapters. |
The current study primarily focuses on short-term video prediction, exploring longer-term prediction is an area for future work.
While the method effectively transfers to specific domains, investigating its generalization capability to entirely new and unseen scenarios is crucial. |
text-guided video prediction, video diffusion models, multimodal large language models, dqformer, video generation |
2406.06424
Report |
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference |
Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, Jongheon Jeong |
Modern alignment techniques based on human preferences, such as RLHF and DPO,
typically employ divergence regularization relative to the reference model to
ensure training stability. However, this often limits the flexibility of models
during alignment, especially when there is a clear distributional discrepancy
between the preference data and the reference model. In this paper, we focus on
the alignment of recent text-to-image diffusion models, such as Stable
Diffusion XL (SDXL), and find that this "reference mismatch" is indeed a
significant problem in aligning these models due to the unstructured nature of
visual modalities: e.g., a preference for a particular stylistic aspect can
easily induce such a discrepancy. Motivated by this observation, we propose a
novel and memory-friendly preference alignment method for diffusion models that
does not depend on any reference model, coined margin-aware preference
optimization (MaPO). MaPO jointly maximizes the likelihood margin between the
preferred and dispreferred image sets and the likelihood of the preferred sets,
simultaneously learning general stylistic features and preferences. For
evaluation, we introduce two new pairwise preference datasets, which comprise
self-generated image pairs from SDXL, Pick-Style and Pick-Safety, simulating
diverse scenarios of reference mismatch. Our experiments validate that MaPO can
significantly improve alignment on Pick-Style and Pick-Safety and general
preference alignment when used with Pick-a-Pic v2, surpassing the base SDXL and
other existing methods. Our code, models, and datasets are publicly available
via https://mapo-t2i.github.io |
This paper proposes MaPO, a novel and memory-friendly preference alignment method for diffusion models, which eliminates the dependence on a reference model and addresses the issue of reference mismatch in existing alignment techniques. |
Reference mismatch, a distributional discrepancy between preference data and the reference model, limits the flexibility of current alignment methods for text-to-image diffusion models, especially in aligning stylistic features. |
MaPO jointly maximizes the likelihood margin between preferred and dispreferred image sets while maximizing the likelihood of preferred sets, effectively learning stylistic features and preferences simultaneously without relying on a reference model. The authors introduce two new pairwise preference datasets, Pick-Style and Pick-Safety, to evaluate alignment under different reference mismatch scenarios. |
MaPO effectively adapts the text-to-image diffusion model to desired styles and aligns it with human preferences, outperforming reference-model-based methods on Pick-Style and Pick-Safety datasets.
MaPO demonstrates superior performance in general preference alignment on Pick-a-Pic v2, surpassing 21 out of 25 state-of-the-art models in the Imgsys public benchmark.
MaPO exhibits computational efficiency, consuming 14.5% less training time compared to Diffusion-DPO, and enables larger batch sizes due to lower memory usage. |
The method and datasets might inherit biases present in the original SDXL checkpoint used for fine-tuning and curation.
While MaPO demonstrates effectiveness in mitigating unsafe content, it doesn't guarantee perfect screening, and user discretion is advised. Further investigation is needed to explore scenarios with different levels of reference mismatch. |
text-to-image generation, diffusion models, preference optimization, alignment, reference mismatch |
2406.06382
Report |
Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization |
Yi Gu, Zhendong Wang, Yueqin Yin, Yujia Xie, Mingyuan Zhou |
Aligning large language models with human preferences has emerged as a
critical focus in language modeling research. Yet, integrating preference
learning into Text-to-Image (T2I) generative models is still relatively
uncharted territory. The Diffusion-DPO technique made initial strides by
employing pairwise preference learning in diffusion models tailored for
specific text prompts. We introduce Diffusion-RPO, a new method designed to
align diffusion-based T2I models with human preferences more effectively. This
approach leverages both prompt-image pairs with identical prompts and those
with semantically related content across various modalities. Furthermore, we
have developed a new evaluation metric, style alignment, aimed at overcoming
the challenges of high costs, low reproducibility, and limited interpretability
prevalent in current evaluations of human preference alignment. Our findings
demonstrate that Diffusion-RPO outperforms established methods such as
Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions
1.5 and XL-1.0, achieving superior results in both automated evaluations of
human preferences and style alignment. Our code is available at
https://github.com/yigu1008/Diffusion-RPO |
This paper presents Diffusion-RPO, a novel approach for aligning Text-to-Image (T2I) models with human preferences by leveraging semantically related prompt-image pairs through contrastive weighting during the diffusion model sampling process. |
Aligning T2I models with human preferences is crucial for generating images that better meet user expectations and artistic intentions. |
Diffusion-RPO leverages both identical and semantically related prompt-image pairs to optimize the diffusion model's sampling steps. It employs contrastive weighting based on the similarity of prompts and images, measured using CLIP embeddings. |
Diffusion-RPO outperforms existing preference learning baselines (Diffusion-DPO, SFT) in aligning Stable Diffusion 1.5 and SDXL models with human preferences, as evidenced by higher scores on established reward models (HPS, Pick Score).
The paper introduces Style Alignment, a new evaluation task for image preference learning, and demonstrates that Diffusion-RPO excels in this task by effectively fine-tuning models to generate images consistent with specific artistic styles (Van Gogh, Sketch, Winter).
Ablation studies reveal the importance of the distance temperature parameter in balancing the focus on identical versus semantically related prompt-image pairs during optimization. |
The dataset used for training, while extensive, may not fully encapsulate the diverse spectrum of human preferences across all cultures and communities, potentially limiting the model's generalizability.
Future research could explore methods for collecting preference datasets that better represent a wider range of cultural backgrounds and artistic styles, leading to more inclusive and universally appealing T2I models. |
text-to-image synthesis, diffusion models, preference learning, style alignment, human-computer interaction |
2406.06367
Report |
MVGamba: Unify 3D Content Generation as State Space Sequence Modeling |
Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang |
Recent 3D large reconstruction models (LRMs) can generate high-quality 3D
content in sub-seconds by integrating multi-view diffusion models with scalable
multi-view reconstructors. Current works further leverage 3D Gaussian Splatting
as 3D representation for improved visual quality and rendering efficiency.
However, we observe that existing Gaussian reconstruction models often suffer
from multi-view inconsistency and blurred textures. We attribute this to the
compromise of multi-view information propagation in favor of adopting powerful
yet computationally intensive architectures (\eg, Transformers). To address
this issue, we introduce MVGamba, a general and lightweight Gaussian
reconstruction model featuring a multi-view Gaussian reconstructor based on the
RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal
context containing multi-view information for cross-view self-refinement while
generating a long sequence of Gaussians for fine-detail modeling with linear
complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba
unifies 3D generation tasks from a single image, sparse images, or text
prompts. Extensive experiments demonstrate that MVGamba outperforms
state-of-the-art baselines in all 3D content generation scenarios with
approximately only $0.1\times$ of the model size. |
MVGamba is a unified 3D generation framework that leverages a novel multi-view Gaussian reconstructor based on RNN-like State Space Models (SSM) to achieve high-quality 3D content generation with low computational cost. |
Existing Gaussian reconstruction models for 3D generation often compromise multi-view information propagation for computational efficiency, leading to multi-view inconsistency and blurred textures. MVGamba addresses this by efficiently integrating multi-view information while allowing for the generation of long sequences of Gaussians for detailed modeling. |
MVGamba uses a two-stage pipeline: 1) Off-the-shelf multi-view diffusion models generate multi-view images from a single image or text prompt. 2) An SSM-based multi-view reconstructor processes these images causally, expanding them into long sequences of Gaussian tokens and refining them across views. A lightweight Gaussian decoder then predicts the final Gaussian parameters for 3D content representation. |
MVGamba outperforms state-of-the-art baselines in image-to-3D, text-to-3D, and sparse-view reconstruction tasks.
MVGamba demonstrates robustness to inconsistencies in multi-view input, effectively handling noisy or inconsistent images generated by diffusion models.
The performance of MVGamba improves with increasing Gaussian sequence length, highlighting the benefit of its ability to model long sequences efficiently. |
The model's performance depends on the quality of input views generated by multi-view diffusion models, which still exhibit limitations.
Incorrect depth estimation in the front view can sometimes lead to generation failures, requiring manual input order adjustment as a current workaround. |
3d generation, gaussian splatting, state space models, multi-view reconstruction, diffusion models |
2406.06258
Report |
Tuning-Free Visual Customization via View Iterative Self-Attention Control |
Xiaojie Li, Chenghao Gu, Shuzhao Xie, Yunpeng Bai, Weixiang Zhang, Zhi Wang |
Fine-Tuning Diffusion Models enable a wide range of personalized generation
and editing applications on diverse visual modalities. While Low-Rank
Adaptation (LoRA) accelerates the fine-tuning process, it still requires
multiple reference images and time-consuming training, which constrains its
scalability for large-scale and real-time applications. In this paper, we
propose \textit{View Iterative Self-Attention Control (VisCtrl)} to tackle this
challenge. Specifically, VisCtrl is a training-free method that injects the
appearance and structure of a user-specified subject into another subject in
the target image, unlike previous approaches that require fine-tuning the
model. Initially, we obtain the initial noise for both the reference and target
images through DDIM inversion. Then, during the denoising phase, features from
the reference image are injected into the target image via the self-attention
mechanism. Notably, by iteratively performing this feature injection process,
we ensure that the reference image features are gradually integrated into the
target image. This approach results in consistent and harmonious editing with
only one reference image in a few denoising steps. Moreover, benefiting from
our plug-and-play architecture design and the proposed Feature Gradual Sampling
strategy for multi-view editing, our method can be easily extended to edit in
complex visual domains. Extensive experiments show the efficacy of VisCtrl
across a spectrum of tasks, including personalized editing of images, videos,
and 3D scenes. |
This paper proposes View Iterative Self-Attention Control (VisCtrl), a training-free method for personalized visual editing using diffusion models. |
This method allows rapid personalized editing with only one reference image, overcoming limitations of existing model-based and attention-based methods that require extensive training or struggle with complex editing scenarios. |
VisCtrl uses DDIM inversion to obtain initial noise for both reference and target images. During denoising, it iteratively injects features from the reference image into the target image using self-attention, while preserving the target's structure using cross-attention. A Feature Gradually Sampling strategy is introduced for multi-view editing, enabling consistent feature injection across multiple frames or views. |
VisCtrl effectively personalizes images, videos, and 3D scenes with a single reference image.
The method outperforms existing baselines in terms of subject fidelity, background preservation, and structural consistency, as demonstrated by quantitative metrics (CLIP-I, LPIPS, SSIM) and qualitative comparisons.
Ablation studies confirm the benefits of Feature Gradually Sampling for multi-view editing and demonstrate control over the degree of subject personalization. |
The method's performance depends on the accuracy of the segmentation masks used to isolate objects for editing.
Potential biases in the pre-trained diffusion model may influence the generated results, although VisCtrl is designed to mitigate bias introduction. |
diffusion models, personalized visual editing, self-attention, training-free, multi-view editing |
2406.06216
Report |
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis |
Xin Jin, Pengyi Jiao, Zheng-Peng Duan, Xingchao Yang, Chun-Le Guo, Bo Ren, Chongyi Li |
Volumetric rendering based methods, like NeRF, excel in HDR view synthesis
from RAWimages, especially for nighttime scenes. While, they suffer from long
training times and cannot perform real-time rendering due to dense sampling
requirements. The advent of 3D Gaussian Splatting (3DGS) enables real-time
rendering and faster training. However, implementing RAW image-based view
synthesis directly using 3DGS is challenging due to its inherent drawbacks: 1)
in nighttime scenes, extremely low SNR leads to poor structure-from-motion
(SfM) estimation in distant views; 2) the limited representation capacity of
spherical harmonics (SH) function is unsuitable for RAW linear color space; and
3) inaccurate scene structure hampers downstream tasks such as refocusing. To
address these issues, we propose LE3D (Lighting Every darkness with 3DGS). Our
method proposes Cone Scatter Initialization to enrich the estimation of SfM,
and replaces SH with a Color MLP to represent the RAW linear color space.
Additionally, we introduce depth distortion and near-far regularizations to
improve the accuracy of scene structure for downstream tasks. These designs
enable LE3D to perform real-time novel view synthesis, HDR rendering,
refocusing, and tone-mapping changes. Compared to previous volumetric rendering
based methods, LE3D reduces training time to 1% and improves rendering speed by
up to 4,000 times for 2K resolution images in terms of FPS. Code and viewer can
be found in https://github.com/Srameo/LE3D . |
LE3D: a novel method for HDR 3D scene reconstruction from noisy RAW images enabling real-time rendering and editing |
Existing HDR scene reconstruction methods, while effective, suffer from long training times and inability to render in real-time, limiting their practical applications. |
LE3D leverages 3D Gaussian Splatting (3DGS) and introduces: (1) Cone Scatter Initialization to improve SfM in low-light, (2) Color MLP to represent RAW linear color space, and (3) Depth distortion and near-far regularizations for better scene structure. |
Achieves comparable visual quality to state-of-the-art volumetric rendering methods like RawNeRF.
Reduces training time to 1% of RawNeRF.
Enables real-time rendering at speeds up to 4,000 times faster than RawNeRF for 2K resolution. |
Quantitative metrics on sRGB are slightly lower than RawNeRF, potentially due to sparser scene representation.
Future work includes exploring alternative regularization techniques for further improving structural accuracy. |
hdr view synthesis, 3d gaussian splatting, real-time rendering, raw image processing, computational photography |
2406.05871
Report |
OmniControlNet: Dual-stage Integration for Conditional Image Generation |
Yilin Wang, Haiyang Xu, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, Zhuowen Tu |
We provide a two-way integration for the widely adopted ControlNet by
integrating external condition generation algorithms into a single dense
prediction method and incorporating its individually trained image generation
processes into a single model. Despite its tremendous success, the ControlNet
of a two-stage pipeline bears limitations in being not self-contained (e.g.
calls the external condition generation algorithms) with a large model
redundancy (separately trained models for different types of conditioning
inputs). Our proposed OmniControlNet consolidates 1) the condition generation
(e.g., HED edges, depth maps, user scribble, and animal pose) by a single
multi-tasking dense prediction algorithm under the task embedding guidance and
2) the image generation process for different conditioning types under the
textual embedding guidance. OmniControlNet achieves significantly reduced model
complexity and redundancy while capable of producing images of comparable
quality for conditioned text-to-image generation. |
This paper introduces OmniControlNet, which integrates external condition generation algorithms into a single method and incorporates individually trained image generation processes into a single model. |
The standard ControlNet model suffers from large model redundancy, requiring separate models for different conditioning input types. This paper addresses this by creating a single, integrated model. |
OmniControlNet uses a multi-task dense image prediction model for generating various image conditions (e.g., edges, depth maps). It then integrates these into a single text-to-image generation model guided by textual inversion. |
OmniControlNet significantly reduces model complexity and redundancy compared to existing approaches.
The model produces images of comparable quality to ControlNet for conditioned text-to-image generation.
The multi-task dense image prediction component achieves competitive performance on benchmark datasets for depth and edge detection. |
Adding a new task condition requires training a new embedding for that task.
The integrated stage 1 model increases training complexity and slightly reduces image generation quality compared to using separate expert models. |
text-to-image generation, controlnet, dense image prediction, textual inversion, model integration |
2406.05835
Report |
Mamba YOLO: SSMs-Based YOLO For Object Detection |
Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu |
Propelled by the rapid advancement of deep learning technologies, the YOLO
series has set a new benchmark for real-time object detectors. Researchers have
continuously explored innovative applications of reparameterization, efficient
layer aggregation networks, and anchor-free techniques on the foundation of
YOLO. To further enhance detection performance, Transformer-based structures
have been introduced, significantly expanding the model's receptive field and
achieving notable performance gains. However, such improvements come at a cost,
as the quadratic complexity of the self-attention mechanism increases the
computational burden of the model. Fortunately, the emergence of State Space
Models (SSM) as an innovative technology has effectively mitigated the issues
caused by quadratic complexity. In light of these advancements, we introduce
Mamba-YOLO a novel object detection model based on SSM. Mamba-YOLO not only
optimizes the SSM foundation but also adapts specifically for object detection
tasks. Given the potential limitations of SSM in sequence modeling, such as
insufficient receptive field and weak image locality, we have designed the
LSBlock and RGBlock. These modules enable more precise capture of local image
dependencies and significantly enhance the robustness of the model. Extensive
experimental results on the publicly available benchmark datasets COCO and VOC
demonstrate that Mamba-YOLO surpasses the existing YOLO series models in both
performance and competitiveness, showcasing its substantial potential and
competitive edge.The PyTorch code is available
at:\url{https://github.com/HZAI-ZJNU/Mamba-YOLO} |
Presents Mamba-YOLO, a novel object detection model based on State Space Models (SSM) that achieves a new performance baseline for YOLO-based detectors while maintaining real-time performance. |
Aims to address limitations of existing CNN and Transformer-based detectors by leveraging the strengths of SSMs for capturing global dependencies while effectively extracting local features. |
Introduces ODSSBlock, a core module integrating SSMs with novel LocalSpatial Block (LSBlock) and ResGated Block (RGBlock) to enhance local feature extraction and model robustness. Leverages VisionClue Merge to preserve visual information for SSM processing. |
Mamba-YOLO significantly outperforms existing YOLO series models in terms of accuracy and efficiency on COCO and VOC datasets.
Mamba-YOLO-T achieves a 3.4% higher AP than the best performing tiny lightweight models while significantly reducing parameters and FLOPs.
Ablation studies demonstrate the effectiveness of individual components, including ODSSBlock, LSBlock, and RGBlock, in enhancing detection performance. |
The model's performance on dense object detection tasks requires further investigation.
Future work will explore the integration of advanced object detection heads and training strategies to further improve Mamba-YOLO's capabilities. |
object detection, state space models, yolo, real-time, computer vision |
2406.05821
Report |
F-LMM: Grounding Frozen Large Multimodal Models |
Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy |
Endowing Large Multimodal Models (LMMs) with visual grounding capability can
significantly enhance AIs' understanding of the visual world and their
interaction with humans. However, existing methods typically fine-tune the
parameters of LMMs to learn additional segmentation tokens and overfit
grounding and segmentation datasets. Such a design would inevitably cause a
catastrophic diminution in the indispensable conversational capability of
general AI assistants. In this paper, we comprehensively evaluate
state-of-the-art grounding LMMs across a suite of multimodal question-answering
benchmarks, observing pronounced performance drops that indicate vanishing
general knowledge comprehension and weakened instruction following ability. To
address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in
human-AI conversations -- a straightforward yet effective design based on the
fact that word-pixel correspondences conducive to visual grounding inherently
exist in the attention weights of well-trained LMMs. Using only a few trainable
CNN layers, we can translate word-pixel attention weights to mask logits, which
a SAM-based mask refiner can further optimise. Our F-LMM neither learns special
segmentation tokens nor utilises high-quality grounded instruction-tuning data,
but achieves competitive performance on referring expression segmentation and
panoptic narrative grounding benchmarks while completely preserving LMMs'
original conversational ability. Additionally, with instruction-following
ability preserved and grounding ability obtained, our F-LMM can perform visual
chain-of-thought reasoning and better resist object hallucinations. |
This paper presents F-LMM, a novel method for grounding frozen large multimodal models (LMMs) in human-AI conversations, leveraging existing attention weights as segmentation priors to achieve competitive visual grounding without sacrificing the LMM's conversational abilities. |
Existing methods for grounding LMMs often lead to a decline in their general knowledge and instruction-following abilities, which are crucial for building effective general AI assistants. This paper aims to address this issue by proposing a method that preserves the LMM's original conversational capabilities while enabling visual grounding. |
F-LMM utilizes a mask head consisting of a CNN-based mask decoder and a SAM-based mask refiner. The mask decoder translates word-pixel attention weights from the frozen LMM into mask logits, and the mask refiner further optimizes these predictions using image and language cues. The model is trained on referring expression segmentation and panoptic narrative grounding datasets. |
F-LMM achieves competitive performance on both referring expression segmentation and phrase grounding benchmarks, indicating its effectiveness in visual grounding.
Unlike existing grounding LMMs, F-LMM maintains the original LMM's excellence on general question-answering benchmarks, demonstrating its preserved conversational ability.
F-LMM exhibits improved performance on visual chain-of-thought reasoning and resistance to object hallucinations, highlighting the potential of combining grounding and conversational abilities. |
The study is limited to LMMs with up to 8 billion parameters due to computational constraints.
The paper primarily focuses on vision-language interactions and does not explore other modalities such as video or audio. |
large multimodal models, visual grounding, instruction following, conversational ai, visual chain-of-thought reasoning |
2406.05814
Report |
Unified Text-to-Image Generation and Retrieval |
Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang Nie, Tat-Seng Chua |
How humans can efficiently and effectively acquire images has always been a
perennial question. A typical solution is text-to-image retrieval from an
existing database given the text query; however, the limited database typically
lacks creativity. By contrast, recent breakthroughs in text-to-image generation
have made it possible to produce fancy and diverse visual content, but it faces
challenges in synthesizing knowledge-intensive images. In this work, we rethink
the relationship between text-to-image generation and retrieval and propose a
unified framework in the context of Multimodal Large Language Models (MLLMs).
Specifically, we first explore the intrinsic discriminative abilities of MLLMs
and introduce a generative retrieval method to perform retrieval in a
training-free manner. Subsequently, we unify generation and retrieval in an
autoregressive generation way and propose an autonomous decision module to
choose the best-matched one between generated and retrieved images as the
response to the text query. Additionally, we construct a benchmark called
TIGeR-Bench, including creative and knowledge-intensive domains, to standardize
the evaluation of unified text-to-image generation and retrieval. Extensive
experimental results on TIGeR-Bench and two retrieval benchmarks, i.e.,
Flickr30K and MS-COCO, demonstrate the superiority and effectiveness of our
proposed method. |
This paper proposes TIGeR, a unified framework for text-to-image generation and retrieval within Multimodal Large Language Models (MLLMs). |
This unified approach aims to address the limitations of individual text-to-image generation (struggles with knowledge-intensive concepts) and retrieval (limited to existing databases) methods. |
The framework leverages MLLMs' intrinsic discriminative abilities for semantic matching, employing generative retrieval with forward beam search and reverse re-ranking. An autonomous decision mechanism selects between generated and retrieved images based on user prompts. |
TIGeR outperforms expert generation and retrieval models, as well as existing MLLMs, on the TIGeR-Bench, a newly constructed benchmark for unified text-to-image generation and retrieval.
The proposed generative retrieval method achieves state-of-the-art results on Flickr30K and MS-COCO retrieval benchmarks, surpassing specially trained generative retrieval models.
The study demonstrates the effectiveness of visual modality debiasing and the impact of forward beam search and reverse re-ranking on retrieval performance. |
The decision-making module exhibits a generation preference, potentially due to discrepancies between pre-training data and the TIGeR-Bench.
Further investigation is needed to mitigate modality biases and explore the complex interplay between generation and retrieval within the TIGeR framework. |
text-to-image generation, text-to-image retrieval, multimodal large language models, generative retrieval, semantic matching |
2406.05785
Report |
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions |
Daizong Liu, Yang Liu, Wencan Huang, Wei Hu |
Text-guided 3D visual grounding (T-3DVG), which aims to locate a specific
object that semantically corresponds to a language query from a complicated 3D
scene, has drawn increasing attention in the 3D research community over the
past few years. Compared to 2D visual grounding, this task presents great
potential and challenges due to its closer proximity to the real world and the
complexity of data collection and 3D point cloud source processing. In this
survey, we attempt to provide a comprehensive overview of the T-3DVG progress,
including its fundamental elements, recent research advances, and future
research directions. To the best of our knowledge, this is the first systematic
survey on the T-3DVG task. Specifically, we first provide a general structure
of the T-3DVG pipeline with detailed components in a tutorial style, presenting
a complete background overview. Then, we summarize the existing T-3DVG
approaches into different categories and analyze their strengths and
weaknesses. We also present the benchmark datasets and evaluation metrics to
assess their performances. Finally, we discuss the potential limitations of
existing T-3DVG and share some insights on several promising research
directions. The latest papers are continually collected at
https://github.com/liudaizong/Awesome-3D-Visual-Grounding. |
This paper presents the first comprehensive survey of text-guided 3D visual grounding (T-3DVG), covering fundamental elements, recent advances, and future directions. |
T-3DVG is crucial for multimedia intelligence research and real-world 3D applications like robotic navigation and human-computer interaction. It bridges the gap between language and 3D scenes, enabling retrieval of specific objects from complex point cloud data. |
The authors analyze existing T-3DVG methods, categorizing them based on their architectures (two-stage vs. one-stage) and learning paradigms (fully-supervised vs. weakly-supervised). They also discuss the use of additional modalities and large language models. |
Two-stage methods, while initially lagging, have seen performance improvements by incorporating text guidance to refine object locations.
One-stage methods demonstrate efficiency but face challenges in capturing fine-grained spatial relations.
Multi-modal approaches, leveraging 2D images or multi-view data, consistently outperform those relying solely on 3D point clouds. |
Current methods heavily rely on expensive annotations, hindering their scalability.
There's a need to develop more practical T-3DVG settings, moving beyond single object grounding to handle dense object retrieval and grounding within groups of related scenes. |
text-guided 3d visual grounding, cross-modal reasoning, multimodal learning, 3d scene understanding, object retrieval |
2406.05768
Report |
MLCM: Multistep Consistency Distillation of Latent Diffusion Model |
Qingsong Xie, Zhenyi Liao, Chen chen, Zhijie Deng, Shixiang Tang, Haonan Lu |
Distilling large latent diffusion models (LDMs) into ones that are fast to
sample from is attracting growing research interest. However, the majority of
existing methods face a dilemma where they either (i) depend on multiple
individual distilled models for different sampling budgets, or (ii) sacrifice
generation quality with limited (e.g., 2-4) and/or moderate (e.g., 5-8)
sampling steps. To address these, we extend the recent multistep consistency
distillation (MCD) strategy to representative LDMs, establishing the Multistep
Latent Consistency Models (MLCMs) approach for low-cost high-quality image
synthesis. MLCM serves as a unified model for various sampling steps due to the
promise of MCD. We further augment MCD with a progressive training strategy to
strengthen inter-segment consistency to boost the quality of few-step
generations. We take the states from the sampling trajectories of the teacher
model as training data for MLCMs to lift the requirements for high-quality
training datasets and to bridge the gap between the training and inference of
the distilled model. MLCM is compatible with preference learning strategies for
further improvement of visual quality and aesthetic appeal. Empirically, MLCM
can generate high-quality, delightful images with only 2-8 sampling steps. On
the MSCOCO-2017 5K benchmark, MLCM distilled from SDXL gets a CLIP Score of
33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 with only 4 steps,
substantially surpassing 4-step LCM [23], 8-step SDXL-Lightning [17], and
8-step HyperSD [33]. We also demonstrate the versatility of MLCMs in
applications including controllable generation, image style transfer, and
Chinese-to-image generation. |
The paper proposes Multistep Latent Consistency Models (MLCMs), a novel method for accelerating text-to-image latent diffusion models, enabling high-quality image generation in just 2-8 sampling steps. |
Large latent diffusion models (LDMs) often suffer from slow inference speeds. Existing distillation methods for acceleration either rely on multiple models or compromise quality, particularly with few sampling steps. This work addresses these limitations for faster, higher-quality image generation. |
The method extends multistep consistency distillation (MCD) to LDMs, dividing the denoising trajectory into segments and enforcing consistency within each. It introduces progressive training for inter-segment consistency, utilizes samples from the teacher model for image-free training, and incorporates reward learning for improved human preference alignment. |
MLCM achieves state-of-the-art results with a CLIP Score of 33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 in just 4 steps, surpassing competing baselines.
The model exhibits consistent quality improvement with additional sampling steps.
MLCM demonstrates versatility across applications like controllable generation, image stylization, and Chinese-to-image generation. |
Single-step generation quality using MLCM still holds potential for further improvement.
The broader societal impact of accelerating image generation, including potential misuse for creating misleading or harmful content, requires careful consideration. |
image generation, latent diffusion models, model acceleration, consistency distillation, reward learning |
2406.05766
Report |
Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment |
Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Jiangbin Zheng, Kaicheng yu, Wanyu Chen, Stan Z. Li |
Multimodal fusion breaks through the barriers between diverse modalities and
has already yielded numerous impressive performances. However, in various
specialized fields, it is struggling to obtain sufficient alignment data for
the training process, which seriously limits the use of previously elegant
models. Thus, semi-supervised learning attempts to achieve multimodal alignment
with fewer matched pairs but traditional methods like pseudo-labeling are
difficult to apply in domains with no label information. To address these
problems, we transform semi-supervised multimodal alignment into a manifold
matching problem and propose a new method based on CLIP, named Gentle-CLIP.
Specifically, we design a novel semantic density distribution loss to explore
implicit semantic alignment information from unpaired multimodal data by
constraining the latent representation distribution with fine granularity, thus
eliminating the need for numerous strictly matched pairs. Meanwhile, we
introduce multi-kernel maximum mean discrepancy as well as self-supervised
contrastive loss to pull separate modality distributions closer and enhance the
stability of the representation distribution. In addition, the contrastive loss
used in CLIP is employed on the supervised matched data to prevent negative
optimization. Extensive experiments conducted on a range of tasks in various
fields, including protein, remote sensing, and the general vision-language
field, demonstrate the effectiveness of our proposed Gentle-CLIP. |
The paper proposes Gentle-CLIP, a semi-supervised learning method for multimodal alignment based on CLIP, designed to address the challenge of limited alignment data in specialized fields by leveraging vast unmatched data. |
Many specialized fields struggle to obtain sufficient alignment data, limiting the effectiveness of traditional multimodal models like CLIP that rely solely on matched pairs for training. |
Gentle-CLIP transforms semi-supervised multimodal alignment into a manifold matching problem. It introduces a novel semantic density distribution (SDD) loss to capture implicit semantic alignment from unpaired data, along with multi-kernel maximum mean discrepancy (MK-MMD) and self-supervised contrastive loss to refine representation alignment and stability. |
Gentle-CLIP outperforms existing semi-supervised methods in protein representation tasks, achieving strong performance on fold classification, enzyme commission number prediction, and other benchmarks.
In remote sensing tasks, Gentle-CLIP consistently improves zero-shot classification and image-text retrieval results compared to baselines, highlighting its ability to learn from limited matched pairs.
Gentle-CLIP demonstrates promising results in general vision-language retrieval tasks, particularly with ViT as the image encoder on the Mini COCO dataset, indicating its broader applicability. |
The performance of Gentle-CLIP relies on the assumption that the semantic distributions of the unmatched data are sufficiently similar.
Further exploration of augmentation techniques that consider common semantics across modalities could potentially enhance Gentle-CLIP's performance. |
multimodal alignment, semi-supervised learning, contrastive learning, manifold matching, clip |
2406.05723
Report |
Binarized Diffusion Model for Image Super-Resolution |
Zheng Chen, Haotong Qin, Yong Guo, Xiongfei Su, Xin Yuan, Linghe Kong, Yulun Zhang |
Advanced diffusion models (DMs) perform impressively in image
super-resolution (SR), but the high memory and computational costs hinder their
deployment. Binarization, an ultra-compression algorithm, offers the potential
for effectively accelerating DMs. Nonetheless, due to the model structure and
the multi-step iterative attribute of DMs, existing binarization methods result
in significant performance degradation. In this paper, we introduce a novel
binarized diffusion model, BI-DiffSR, for image SR. First, for the model
structure, we design a UNet architecture optimized for binarization. We propose
the consistent-pixel-downsample (CP-Down) and consistent-pixel-upsample (CP-Up)
to maintain dimension consistent and facilitate the full-precision information
transfer. Meanwhile, we design the channel-shuffle-fusion (CS-Fusion) to
enhance feature fusion in skip connection. Second, for the activation
difference across timestep, we design the timestep-aware redistribution (TaR)
and activation function (TaA). The TaR and TaA dynamically adjust the
distribution of activations based on different timesteps, improving the
flexibility and representation alability of the binarized module. Comprehensive
experiments demonstrate that our BI-DiffSR outperforms existing binarization
methods. Code is available at https://github.com/zhengchen1999/BI-DiffSR. |
This paper proposes BI-DiffSR, a novel binarized diffusion model for efficient and accurate image super-resolution. |
Diffusion models excel in image super-resolution but their high computational and memory demands hinder deployment on resource-constrained devices. Binarization offers a solution, but directly applying existing methods to diffusion models leads to significant performance degradation. |
BI-DiffSR introduces a UNet architecture tailored for binarization, featuring Consistent-Pixel Down/Upsampling (CP-Down/Up) for dimension consistency and Channel-Shuffle Fusion (CS-Fusion) for enhanced feature fusion. Additionally, it incorporates Timestep-Aware Redistribution (TaR) and Activation Function (TaA) to handle varying activation distributions across diffusion timesteps. |
BI-DiffSR significantly outperforms state-of-the-art binarization methods in image super-resolution tasks.
It achieves comparable or even better perceptual quality than the full-precision diffusion model (SR3) while utilizing only 8.3% of the parameters and 20.8% of the computational operations.
The proposed model effectively restores fine details and textures in challenging cases, as demonstrated through visual comparisons. |
The introduction of TaR and TaA, while improving performance, leads to increased parameters and training time.
The fixed timestep grouping strategy in TaR and TaA may not be optimal for all modules due to non-uniform activation changes across timesteps. |
image super-resolution, diffusion models, binarization, model compression, unet architecture |
2406.05649
Report |
GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement |
Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliaksandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Hsin-Ying Lee |
We propose a novel approach for 3D mesh reconstruction from multi-view
images. Our method takes inspiration from large reconstruction models like LRM
that use a transformer-based triplane generator and a Neural Radiance Field
(NeRF) model trained on multi-view images. However, in our method, we introduce
several important modifications that allow us to significantly enhance 3D
reconstruction quality. First of all, we examine the original LRM architecture
and find several shortcomings. Subsequently, we introduce respective
modifications to the LRM architecture, which lead to improved multi-view image
representation and more computationally efficient training. Second, in order to
improve geometry reconstruction and enable supervision at full image
resolution, we extract meshes from the NeRF field in a differentiable manner
and fine-tune the NeRF model through mesh rendering. These modifications allow
us to achieve state-of-the-art performance on both 2D and 3D evaluation
metrics, such as a PSNR of 28.67 on Google Scanned Objects (GSO) dataset.
Despite these superior results, our feed-forward model still struggles to
reconstruct complex textures, such as text and portraits on assets. To address
this, we introduce a lightweight per-instance texture refinement procedure.
This procedure fine-tunes the triplane representation and the NeRF color
estimation model on the mesh surface using the input multi-view images in just
4 seconds. This refinement improves the PSNR to 29.79 and achieves faithful
reconstruction of complex textures, such as text. Additionally, our approach
enables various downstream applications, including text- or image-to-3D
generation. |
GTR, a novel 3D reconstruction model for generating high-quality meshes with faithful textures from multi-view images in seconds. |
Existing methods struggle to balance high-quality texture reconstruction with accurate geometry extraction. This work aims to improve both aspects of 3D reconstruction from multi-view images. |
The authors propose a three-pronged approach: 1. Modifying the standard LRM architecture for improved multi-view image representation and computational efficiency. 2. Introducing a two-stage training procedure using NeRF volume rendering for initialization, followed by geometry refinement via differentiable mesh rendering. 3. Implementing a per-instance texture refinement procedure for enhancing intricate details on the mesh surface. |
GTR achieves state-of-the-art performance on both 2D and 3D evaluation metrics, surpassing baselines like LRM and InstantMesh.
The proposed method excels at reconstructing complex textures and fine details, including text and portraits.
The model is computationally efficient, generating meshes within a second and requiring only four seconds for texture refinement. |
The current pipeline trains the convolutional encoder from scratch, potentially limiting convergence speed. Exploring pre-trained models like Stable Diffusion's autoencoder could be beneficial.
The mesh rendering stage relies on NeRF for initialization. Investigating alternative methods like NeuS, which directly generates SDFs, might offer further improvements. |
3d reconstruction, mesh generation, texture refinement, multi-view images, neural rendering |
2406.05641
Report |
PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction |
Shangyu Chen, Zizheng Pan, Jianfei Cai, Dinh Phung |
Personalizing a large-scale pretrained Text-to-Image (T2I) diffusion model is
challenging as it typically struggles to make an appropriate trade-off between
its training data distribution and the target distribution, i.e., learning a
novel concept with only a few target images to achieve personalization
(aligning with the personalized target) while preserving text editability
(aligning with diverse text prompts). In this paper, we propose PaRa, an
effective and efficient Parameter Rank Reduction approach for T2I model
personalization by explicitly controlling the rank of the diffusion model
parameters to restrict its initial diverse generation space into a small and
well-balanced target space. Our design is motivated by the fact that taming a
T2I model toward a novel concept such as a specific art style implies a small
generation space. To this end, by reducing the rank of model parameters during
finetuning, we can effectively constrain the space of the denoising sampling
trajectories towards the target. With comprehensive experiments, we show that
PaRa achieves great advantages over existing finetuning approaches on
single/multi-subject generation as well as single-image editing. Notably,
compared to the prevailing fine-tuning technique LoRA, PaRa achieves better
parameter efficiency (2x fewer learnable parameters) and much better target
image alignment. |
This paper proposes PaRa, a novel parameter-efficient framework for personalizing text-to-image diffusion models through parameter rank reduction. |
Existing T2I personalization methods struggle to balance preserving text editability with aligning to target concepts. PaRa addresses this by explicitly controlling the diffusion model parameter rank to constrain image generation to a well-aligned space. |
PaRa reduces the rank of layer outputs during denoising sampling by introducing a low-rank learnable parameter, utilizing QR decomposition to form orthonormal bases. It also enables combining multiple individually fine-tuned PaRa weights for multi-subject generation. |
PaRa achieves better image alignment than LoRA and SVDiff while using fewer learnable parameters.
The framework allows blending multiple personalized concepts for multi-subject generation without additional training on augmented data.
PaRa facilitates stable single-image editing by directly modifying text prompts without requiring noise inversion. |
PaRa currently focuses on reducing the output space, potentially limiting customization requiring space expansion.
Future work could explore methods for both space extension and reduction within the framework. |
text-to-image synthesis, diffusion models, model personalization, parameter rank reduction, image editing |
2406.05630
Report |
Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion |
Ge Ya Luo, Zhi Hao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal |
With recent advances in video prediction, controllable video generation has
been attracting more attention. Generating high fidelity videos according to
simple and flexible conditioning is of particular interest. To this end, we
propose a controllable video generation model using pixel level renderings of
2D or 3D bounding boxes as conditioning. In addition, we also create a bounding
box predictor that, given the initial and ending frames' bounding boxes, can
predict up to 15 bounding boxes per frame for all the frames in a 25-frame
clip. We perform experiments across 3 well-known AV video datasets: KITTI,
Virtual-KITTI 2 and BDD100k. |
The paper introduces Ctrl-V, a novel model that generates controllable autonomous vehicle videos by conditioning on predicted sequences of 2D and 3D bounding boxes. |
Generating controllable, high-fidelity videos is crucial for applications like autonomous vehicle simulation, enabling realistic and customizable virtual environments for training and testing. |
Ctrl-V comprises two main components: 1) a diffusion-based bounding box predictor (\modelbbox) that forecasts object positions across frames and 2) a ControlNet-adapted video diffusion model (\modelvid) that generates videos adhering to the predicted bounding box trajectories. |
Ctrl-V demonstrates the ability to generate high-fidelity videos that closely align with the provided bounding box conditions, as evidenced by quantitative metrics like FVD, LPIPS, SSIM, and PSNR.
The \modelbbox component effectively predicts bounding box trajectories, achieving high alignment scores with ground-truth labels, especially for the initial and final frames.
The \modelvid component exhibits strong motion control capabilities, accurately depicting object movements and handling uninitialized objects appearing mid-video. |
The current evaluation metrics for bounding box predictions have limitations, as they rely on binary masks and do not consider object tracking IDs.
Further investigation is needed to systematically analyze the model's ability to encode and utilize additional information, such as track IDs and 3D bounding box orientation. |
video generation, controllable video generation, diffusion models, bounding box prediction, autonomous driving |
2406.05602
Report |
Can Prompt Modifiers Control Bias? A Comparative Analysis of Text-to-Image Generative Models |
Philip Wootaek Shin, Jihyun Janice Ahn, Wenpeng Yin, Jack Sampson, Vijaykrishnan Narayanan |
It has been shown that many generative models inherit and amplify societal
biases. To date, there is no uniform/systematic agreed standard to
control/adjust for these biases. This study examines the presence and
manipulation of societal biases in leading text-to-image models: Stable
Diffusion, DALL-E 3, and Adobe Firefly. Through a comprehensive analysis
combining base prompts with modifiers and their sequencing, we uncover the
nuanced ways these AI technologies encode biases across gender, race,
geography, and region/culture. Our findings reveal the challenges and potential
of prompt engineering in controlling biases, highlighting the critical need for
ethical AI development promoting diversity and inclusivity.
This work advances AI ethics by not only revealing the nuanced dynamics of
bias in text-to-image generation models but also by offering a novel framework
for future research in controlling bias. Our contributions-panning comparative
analyses, the strategic use of prompt modifiers, the exploration of prompt
sequencing effects, and the introduction of a bias sensitivity taxonomy-lay the
groundwork for the development of common metrics and standard analyses for
evaluating whether and how future AI models exhibit and respond to requests to
adjust for inherent biases. |
This paper investigates societal biases in text-to-image models (Stable Diffusion, DALL·E 3, Adobe Firefly) and explores if prompt engineering with modifiers can control these biases. |
Understanding and mitigating biases in AI models is crucial to ensure they are fair, inclusive, and do not perpetuate harmful stereotypes. |
The researchers analyzed image outputs for various prompts, including base prompts with added modifiers, to examine bias representation across gender, race, geography, and culture. |
Prompt modifiers can sometimes adjust bias, but simplistic use is not always effective, highlighting the need for more sophisticated strategies.
Some model biases are resistant to control through prompt engineering, demonstrating the deep-rooted nature of these biases.
Prompt sequencing, i.e., the order of base prompt and modifier, can significantly impact the generated images and bias representation. |
The study was limited by a small image dataset and the lack of external human evaluation for bias assessment.
Future work could focus on developing more robust bias-control mechanisms and conducting large-scale human evaluations to assess bias in a nuanced way. |
ai bias, text-to-image generation, prompt engineering, ethical ai, bias mitigation |
2406.05478
Report |
Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis |
Zanlin Ni, Yulin Wang, Renping Zhou, Jiayi Guo, Jinyi Hu, Zhiyuan Liu, Shiji Song, Yuan Yao, Gao Huang |
The field of image synthesis is currently flourishing due to the advancements
in diffusion models. While diffusion models have been successful, their
computational intensity has prompted the pursuit of more efficient
alternatives. As a representative work, non-autoregressive Transformers (NATs)
have been recognized for their rapid generation. However, a major drawback of
these models is their inferior performance compared to diffusion models. In
this paper, we aim to re-evaluate the full potential of NATs by revisiting the
design of their training and inference strategies. Specifically, we identify
the complexities in properly configuring these strategies and indicate the
possible sub-optimality in existing heuristic-driven designs. Recognizing this,
we propose to go beyond existing methods by directly solving the optimal
strategies in an automatic framework. The resulting method, named AutoNAT,
advances the performance boundaries of NATs notably, and is able to perform
comparably with the latest diffusion models at a significantly reduced
inference cost. The effectiveness of AutoNAT is validated on four benchmark
datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M. Our code is available at
https://github.com/LeapLabTHU/ImprovedNAT. |
This paper proposes ImprovedNAT, a novel method to automatically search for optimal training and generation strategies for Non-Autoregressive Transformers (NATs) in image synthesis, improving their performance and efficiency. |
NATs offer fast image generation but often lag behind diffusion models in quality due to sub-optimal, heuristically designed training and generation strategies. |
ImprovedNAT formulates the optimal strategy design as a unified optimization problem and solves it using an alternating optimization algorithm for efficient exploration of the strategy space. |
ImprovedNAT significantly enhances NATs' performance, achieving results comparable to state-of-the-art diffusion models.
ImprovedNAT achieves approximately 5x inference speedup compared to diffusion models without sacrificing performance.
The study highlights that optimizing both training and generation strategies is crucial for NATs, with the latter demonstrating a larger impact. |
The paper mainly focuses on optimizing Beta distribution for training strategy; exploring other distributions could be beneficial.
The current work primarily explores image generation; extending ImprovedNAT to other domains like audio or video generation is a promising direction. |
image synthesis, non-autoregressive transformers, diffusion models, hyperparameter optimization, generative models |
2406.05338
Report |
MotionClone: Training-Free Motion Cloning for Controllable Video Generation |
Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin |
Motion-based controllable text-to-video generation involves motions to
control the video generation. Previous methods typically require the training
of models to encode motion cues or the fine-tuning of video diffusion models.
However, these approaches often result in suboptimal motion generation when
applied outside the trained domain. In this work, we propose MotionClone, a
training-free framework that enables motion cloning from a reference video to
control text-to-video generation. We employ temporal attention in video
inversion to represent the motions in the reference video and introduce primary
temporal-attention guidance to mitigate the influence of noisy or very subtle
motions within the attention weights. Furthermore, to assist the generation
model in synthesizing reasonable spatial relationships and enhance its
prompt-following capability, we propose a location-aware semantic guidance
mechanism that leverages the coarse location of the foreground from the
reference video and original classifier-free guidance features to guide the
video generation. Extensive experiments demonstrate that MotionClone exhibits
proficiency in both global camera motion and local object motion, with notable
superiority in terms of motion fidelity, textual alignment, and temporal
consistency. |
MotionClone, a training-free framework, clones motion from reference videos for controllable text-to-video generation using temporal attention. |
Addresses limitations of existing motion-guided video generation methods that require motion-specific training or fine-tuning, leading to suboptimal results outside the trained domain. |
Uses temporal attention in video inversion to represent motion from a reference video and guides video generation through primary temporal-attention guidance and location-aware semantic guidance. |
Effectively clones both global camera motion and local object motion.
Achieves superior motion fidelity and textual alignment compared to existing methods.
Demonstrates strong temporal consistency in generated videos. |
Motion in the reference video must be suitable for the objects in the new prompt to avoid unrealistic outputs.
Some generated samples may still retain minor structural elements from the reference video. |
text-to-video generation, motion cloning, temporal attention, video diffusion models, controllable video generation |
2406.05271
Report |
USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation |
Xiaoqi Wang, Wenbin He, Xiwei Xuan, Clint Sebastian, Jorge Piazentin Ono, Xin Li, Sima Behpour, Thang Doan, Liang Gou, Han Wei Shen, Liu Ren |
The open-vocabulary image segmentation task involves partitioning images into
semantically meaningful segments and classifying them with flexible
text-defined categories. The recent vision-based foundation models such as the
Segment Anything Model (SAM) have shown superior performance in generating
class-agnostic image segments. The main challenge in open-vocabulary image
segmentation now lies in accurately classifying these segments into
text-defined categories. In this paper, we introduce the Universal Segment
Embedding (USE) framework to address this challenge. This framework is
comprised of two key components: 1) a data pipeline designed to efficiently
curate a large amount of segment-text pairs at various granularities, and 2) a
universal segment embedding model that enables precise segment classification
into a vast range of text-defined categories. The USE model can not only help
open-vocabulary image segmentation but also facilitate other downstream tasks
(e.g., querying and ranking). Through comprehensive experimental studies on
semantic segmentation and part segmentation benchmarks, we demonstrate that the
USE framework outperforms state-of-the-art open-vocabulary segmentation
methods. |
This paper introduces the Universal Segment Embedding (USE) framework for open-vocabulary image segmentation, which can classify image segments into text-defined categories in a zero-shot manner. |
Open-vocabulary image segmentation is crucial for real-world applications requiring flexible and adaptable segmentation models. Existing methods struggle to fully utilize segments from foundation models like SAM. |
The USE framework includes: (1) a data pipeline that automatically generates segment-text pairs with rich semantics at multiple granularities from existing datasets and (2) a lightweight segment embedding model that learns to align segment and text embeddings in a joint vision-language space. |
USE outperforms state-of-the-art two-stage open-vocabulary semantic segmentation methods on ADE20K and Pascal Context benchmarks.
USE demonstrates strong performance on open-vocabulary part segmentation, exceeding VLPart trained on human-annotated parts data.
Ablation studies show the benefits of combining CLIP and DINOv2 in the image encoder and incorporating the cls token for improved performance. |
The current implementation of USE relies on SAM for segment generation, inheriting limitations in capturing parts with blurry boundaries.
Future work can explore more sophisticated architectures for the segment embedding head, such as prompt encoders or cross-attention mechanisms. |
open-vocabulary image segmentation, segment embedding, zero-shot learning, vision-language models, foundation models |
2406.05184
Report |
The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better |
Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna |
Generative text-to-image models enable us to synthesize unlimited amounts of
images in a controllable manner, spurring many recent efforts to train vision
models with synthetic data. However, every synthetic image ultimately
originates from the upstream data used to train the generator. What additional
value does the intermediate generator provide over directly training on
relevant parts of the upstream data? Grounding this question in the setting of
image classification, we compare finetuning on task-relevant, targeted
synthetic data generated by Stable Diffusion -- a generative model trained on
the LAION-2B dataset -- against finetuning on targeted real images retrieved
directly from LAION-2B. We show that while synthetic data can benefit some
downstream tasks, it is universally matched or outperformed by real data from
our simple retrieval baseline. Our analysis suggests that this underperformance
is partially due to generator artifacts and inaccurate task-relevant visual
details in the synthetic images. Overall, we argue that retrieval is a critical
baseline to consider when training with synthetic data -- a baseline that
current methods do not yet surpass. We release code, data, and models at
https://github.com/scottgeng00/unmet-promise. |
This paper investigates whether training on synthetic images generated by text-to-image models like Stable Diffusion offers any benefits over directly training on relevant subsets of the original data used to train the generator (e.g., LAION-2B). |
The use of synthetic data for training vision models is on the rise, but it's crucial to understand if it provides any advantages over directly leveraging the generator's original training data. |
The authors curate targeted synthetic datasets by prompting Stable Diffusion and targeted real datasets by retrieving from LAION-2B. They finetune a pretrained CLIP model on both types of data and compare their performance on five image classification benchmarks. |
Training on targeted real data consistently matches or outperforms training on targeted synthetic data from Stable Diffusion at equivalent data scales.
Increasing the scale of synthetic data does not always close the performance gap and can sometimes even hurt performance.
Analysis suggests that generator artifacts and distorted visual details in synthetic images contribute to their lower performance. |
Compute limitations restricted the exploration of various pretrained backbones and adaptation methods.
The study primarily focused on Stable Diffusion due to the availability of its training data (LAION-2B) for retrieval. |
synthetic data, image classification, data augmentation, text-to-image generation, stable diffusion |
2406.05132
Report |
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination |
Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai |
The integration of language and 3D perception is crucial for developing
embodied agents and robots that comprehend and interact with the physical
world. While large language models (LLMs) have demonstrated impressive language
understanding and generation capabilities, their adaptation to 3D environments
(3D-LLMs) remains in its early stages. A primary challenge is the absence of
large-scale datasets that provide dense grounding between language and 3D
scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset
comprising 40,087 household scenes paired with 6.2 million densely-grounded
scene-language instructions. Our results show that instruction tuning with
3D-GRAND significantly enhances grounding capabilities and reduces
hallucinations in 3D-LLMs. As part of our contributions, we propose a
comprehensive benchmark 3D-POPE to systematically evaluate hallucination in
3D-LLMs, enabling fair comparisons among future models. Our experiments
highlight a scaling effect between dataset size and 3D-LLM performance,
emphasizing the critical role of large-scale 3D-text datasets in advancing
embodied AI research. Notably, our results demonstrate early signals for
effective sim-to-real transfer, indicating that models trained on large
synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and
3D-POPE, we aim to equip the embodied AI community with essential resources and
insights, setting the stage for more reliable and better-grounded 3D-LLMs.
Project website: https://3d-grand.github.io |
The paper introduces 3D-GRAND, a large-scale dataset with 40,087 household scenes and 6.2 million densely grounded scene-language instructions, and 3D-POPE, a benchmark for evaluating object hallucination in 3D-LLMs. |
Existing 3D-LLMs lack large-scale, densely grounded datasets crucial for tasks like robotics and suffer from object hallucination, hindering their reliability and interpretability. |
3D-GRAND leverages LLMs (GPT-4) for scalable and cost-effective dense grounding annotation of synthetic 3D scenes. 3D-POPE uses a polling-based approach with existence questions to assess object hallucination in 3D-LLMs. |
Training with 3D-GRAND significantly reduces object hallucination in 3D-LLMs.
Densely grounded instruction tuning with 3D-GRAND improves the grounding capabilities of 3D-LLMs, achieving state-of-the-art performance on ScanRefer.
Scaling densely grounded data consistently improves grounding accuracy and reduces hallucination, with promising results for sim-to-real transfer from synthetic to real-world 3D scenes. |
The work focuses on room-level 3D-Text pairs, lacking part-level and beyond-room-level annotations.
3D-POPE evaluation is limited to ScanNet scenes and does not include synthetic datasets or more diverse indoor environments. |
3d vision-language, dense grounding, object hallucination, 3d-llm, sim-to-real transfer |
2406.05082
Report |
CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion |
Xingrui Wang, Xin Li, Zhibo Chen |
Tuning-free long video diffusion has been proposed to generate
extended-duration videos with enriched content by reusing the knowledge from
pre-trained short video diffusion model without retraining. However, most works
overlook the fine-grained long-term video consistency modeling, resulting in
limited scene consistency (i.e., unreasonable object or background
transitions), especially with multiple text inputs. To mitigate this, we
propose the Consistency Noise Injection, dubbed CoNo, which introduces the
"look-back" mechanism to enhance the fine-grained scene transition between
different video clips, and designs the long-term consistency regularization to
eliminate the content shifts when extending video contents through noise
prediction. In particular, the "look-back" mechanism breaks the noise
scheduling process into three essential parts, where one internal noise
prediction part is injected into two video-extending parts, intending to
achieve a fine-grained transition between two video clips. The long-term
consistency regularization focuses on explicitly minimizing the pixel-wise
distance between the predicted noises of the extended video clip and the
original one, thereby preventing abrupt scene transitions. Extensive
experiments have shown the effectiveness of the above strategies by performing
long-video generation under both single- and multi-text prompt conditions. The
project has been available in https://wxrui182.github.io/CoNo.github.io/. |
Proposes Consistency Noise Injection (CoNo), a tuning-free long video diffusion method that enhances long-term consistency in generated videos, especially under multiple text prompts. |
Addresses limitations in existing tuning-free long video generation methods, such as coarse transitions between video clips and lack of explicit long-term content consistency modeling. |
Introduces a 'look-back' mechanism with customized noise shuffling strategies to ensure fine-grained transitions between video clips and proposes long-term consistency regularization to minimize content shifts in extended videos. |
Achieves state-of-the-art scene consistency and perceptual quality in long video generation.
Demonstrates effectiveness under both single- and multi-text prompt conditions.
Outperforms existing methods in quantitative metrics such as FVD, KVD, CLIP-Image, and CLIP-Text, and receives higher ratings in human evaluation for semantic alignment, content consistency, realism, and preference. |
Performance might be limited by the capabilities of the pre-trained base video generation model.
Future work includes exploring prompt engineering to further enhance the continuity and semantic coherence of generated long videos. |
video generation, long video diffusion, text-to-video, scene consistency, tuning-free |
2406.05038
Report |
Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs |
Shentong Mo |
Recent advancements in sequence modeling have led to the development of the
Mamba architecture, noted for its selective state space approach, offering a
promising avenue for efficient long sequence handling. However, its application
in 3D shape generation, particularly at high resolutions, remains
underexplored. Traditional diffusion transformers (DiT) with self-attention
mechanisms, despite their potential, face scalability challenges due to the
cubic complexity of attention operations as input length increases. This
complexity becomes a significant hurdle when dealing with high-resolution voxel
sizes. To address this challenge, we introduce a novel diffusion architecture
tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D). This
architecture forgoes traditional attention mechanisms, instead utilizing the
inherent efficiency of the Mamba architecture to maintain linear complexity
with respect to sequence length. DiM-3D is characterized by fast inference
times and substantially lower computational demands, quantified in reduced
Gflops, thereby addressing the key scalability issues of prior models. Our
empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves
state-of-the-art performance in generating high-fidelity and diverse 3D shapes.
Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud
completion. This not only proves the model's scalability but also underscores
its efficiency in generating detailed, high-resolution voxels necessary for
advanced 3D shape modeling, particularly excelling in environments requiring
high-resolution voxel sizes. Through these findings, we illustrate the
exceptional scalability and efficiency of the Diffusion Mamba framework in 3D
shape generation, setting a new standard for the field and paving the way for
future explorations in high-resolution 3D modeling technologies. |
Introduces DiM-3D, a novel diffusion mamba architecture for efficient and scalable 3D point cloud generation, addressing the computational challenges of traditional methods. |
High-resolution 3D shape generation is crucial for various applications, but existing methods struggle with scalability and efficiency. DiM-3D tackles these limitations. |
Leverages the Mamba architecture's selective state space approach to maintain linear complexity with sequence length, enabling efficient handling of high-resolution voxel data. |
Achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes on the ShapeNet benchmark.
Demonstrates superior results in 3D point cloud completion tasks, highlighting its capacity for conditional generation.
Exhibits strong scalability, with performance improvements observed with increasing model size and the number of classes. |
Computational demands, while reduced, might still pose challenges in resource-constrained environments, particularly with extremely high-resolution data.
Model's generalizability might be affected by the quality and diversity of the training data, potentially limiting its applicability in scenarios with limited or biased data. |
3d shape generation, point cloud generation, diffusion models, mamba architecture, state space models |
2406.05000
Report |
AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation |
Lianyu Pang, Jian Yin, Baoquan Zhao, Feize Wu, Fu Lee Wang, Qing Li, Xudong Mao |
Recent advances in text-to-image models have enabled high-quality
personalized image synthesis of user-provided concepts with flexible textual
control. In this work, we analyze the limitations of two primary techniques in
text-to-image personalization: Textual Inversion and DreamBooth. When
integrating the learned concept into new prompts, Textual Inversion tends to
overfit the concept, while DreamBooth often overlooks it. We attribute these
issues to the incorrect learning of the embedding alignment for the concept. We
introduce AttnDreamBooth, a novel approach that addresses these issues by
separately learning the embedding alignment, the attention map, and the subject
identity in different training stages. We also introduce a cross-attention map
regularization term to enhance the learning of the attention map. Our method
demonstrates significant improvements in identity preservation and text
alignment compared to the baseline methods. |
This paper proposes AttnDreamBooth, a novel text-to-image personalization approach that addresses limitations in embedding alignment found in Textual Inversion and DreamBooth. |
Balancing identity preservation and text alignment in personalized image synthesis remains a challenge, hindering the generation of high-quality personalized images with flexible textual control. |
AttnDreamBooth separates the learning of embedding alignment, attention map, and subject identity into three stages: 1) optimizing textual embedding for alignment, 2) fine-tuning cross-attention layers for attention map refinement, and 3) fine-tuning the entire U-Net for subject identity. It also introduces a cross-attention map regularization term for enhanced attention map learning. |
AttnDreamBooth demonstrates superior performance in identity preservation and text alignment compared to baseline methods.
It enables text-aligned personalized image generation, even with complex prompts.
User study shows a clear preference for AttnDreamBooth over baselines in terms of identity preservation and text alignment. |
The current implementation uses consistent training steps across different concepts, potentially limiting performance for certain concepts.
The three-stage training method requires approximately 20 minutes on average to learn a concept. |
text-to-image personalization, dreambooth, textual inversion, attention map, embedding alignment |
2406.04906
Report |
RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection |
Liting Huang, Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Shoujin Wang |
The recent advancements in generative AI models, which can create realistic
and human-like content, are significantly transforming how people communicate,
create, and work. While the appropriate use of generative AI models can benefit
the society, their misuse poses significant threats to data reliability and
authentication. However, due to a lack of aligned multimodal datasets,
effective and robust methods for detecting machine-generated content are still
in the early stages of development. In this paper, we introduce RU-AI, a new
large-scale multimodal dataset designed for the robust and efficient detection
of machine-generated content in text, image, and voice. Our dataset is
constructed from three large publicly available datasets: Flickr8K, COCO, and
Places205, by combining the original datasets and their corresponding
machine-generated pairs. Additionally, experimental results show that our
proposed unified model, which incorporates a multimodal embedding module with a
multilayer perceptron network, can effectively determine the origin of the data
(i.e., original data samples or machine-generated ones) from RU-AI. However,
future work is still required to address the remaining challenges posed by
RU-AI. The source code and dataset are available at
https://github.com/ZhihaoZhang97/RU-AI. |
This document describes ACM's consolidated LaTeX template (acmart) introduced in 2017 for preparing various types of publications. |
Provides a consistent style and incorporates accessibility and metadata-extraction functionality for ACM Digital Library. |
Explains the features of the 'acmart' document class and its parameters like template style, language support, and specific features for SIGCHI Extended Abstracts. |
Supports various ACM publication types (journal, conference, etc.) and stages (review, camera-ready).
Offers multilingual support with commands for translations.
Includes specific environments for SIGCHI Extended Abstracts to format text, figures, and tables in the margin. |
The summary is based on a template overview and may not cover all nuances of the actual LaTeX template.
Further information on specific features and usage can be found in the LaTeX User’s Guide. |
latex, acm publications, template, metadata extraction, accessibility |
2406.04888
Report |
Zero-Shot Video Editing through Adaptive Sliding Score Distillation |
Lianghan Zhu, Yanqi Bao, Jing Huo, Jing Wu, Yu-Kun Lai, Wenbin Li, Yang Gao |
The burgeoning field of text-based video generation (T2V) has reignited
significant interest in the research of controllable video editing. Although
pre-trained T2V-based editing models have achieved efficient editing
capabilities, current works are still plagued by two major challenges. Firstly,
the inherent limitations of T2V models lead to content inconsistencies and
motion discontinuities between frames. Secondly, the notorious issue of
over-editing significantly disrupts areas that are intended to remain
unaltered. To address these challenges, our work aims to explore a robust
video-based editing paradigm based on score distillation. Specifically, we
propose an Adaptive Sliding Score Distillation strategy, which not only
enhances the stability of T2V supervision but also incorporates both global and
local video guidance to mitigate the impact of generation errors. Additionally,
we modify the self-attention layers during the editing process to further
preserve the key features of the original video. Extensive experiments
demonstrate that these strategies enable us to effectively address the
aforementioned challenges, achieving superior editing performance compared to
existing state-of-the-art methods. |
This paper proposes ASSD, a novel score distillation-based video editing method, enhancing editing quality and addressing limitations in current text-to-video generation models. |
Existing text-based video editing methods suffer from content inconsistencies, motion discontinuities, and over-editing. This work aims to address these challenges using a robust score distillation-based paradigm. |
The paper introduces Adaptive Sliding Score Distillation (ASSD) for robust video editing. It uses a sliding window approach for smoothing gradient information and incorporates a weighted attention fusion mechanism to preserve details from the original video. Additionally, it leverages Stable Diffusion for joint guidance in updating the latent code. |
ASSD effectively reduces contaminations and preserves original video content.
The weighted attention fusion mechanism further improves editing quality by preserving details.
Joint guidance from Stable Diffusion enhances the accuracy of update gradients. |
The performance heavily relies on the capability of the text-to-video model used.
The lack of sufficiently powerful open-source text-to-video models limits the method's potential. |
video editing, text-to-video generation, score distillation, diffusion models, adaptive sliding window |
2406.04875
Report |
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views |
Xiaobiao Du, Haiyang Sun, Shuyun Wang, Zhuojie Wu, Hongwei Sheng, Jiaying Ying, Ming Lu, Tianqing Zhu, Kun Zhan, Xin Yu |
3D cars are commonly used in self-driving systems, virtual/augmented reality,
and games. However, existing 3D car datasets are either synthetic or
low-quality, presenting a significant gap toward the high-quality real-world 3D
car datasets and limiting their applications in practical scenarios. In this
paper, we propose the first large-scale 3D real car dataset, termed 3DRealCar,
offering three distinctive features. (1) \textbf{High-Volume}: 2,500 cars are
meticulously scanned by 3D scanners, obtaining car images and point clouds with
real-world dimensions; (2) \textbf{High-Quality}: Each car is captured in an
average of 200 dense, high-resolution 360-degree RGB-D views, enabling
high-fidelity 3D reconstruction; (3) \textbf{High-Diversity}: The dataset
contains various cars from over 100 brands, collected under three distinct
lighting conditions, including reflective, standard, and dark. Additionally, we
offer detailed car parsing maps for each instance to promote research in car
parsing tasks. Moreover, we remove background point clouds and standardize the
car orientation to a unified axis for the reconstruction only on cars without
background and controllable rendering. We benchmark 3D reconstruction results
with state-of-the-art methods across each lighting condition in 3DRealCar.
Extensive experiments demonstrate that the standard lighting condition part of
3DRealCar can be used to produce a large number of high-quality 3D cars,
improving various 2D and 3D tasks related to cars. Notably, our dataset brings
insight into the fact that recent 3D reconstruction methods face challenges in
reconstructing high-quality 3D cars under reflective and dark lighting
conditions. \textcolor{red}{\href{https://xiaobiaodu.github.io/3drealcar/}{Our
dataset is available here.}} |
This paper introduces 3DRealCar, the first large-scale dataset of 3D real cars, offering high volume (2,500 instances), high quality (dense, high-resolution 360-degree RGB-D views), and high diversity (100+ brands, 3 lighting conditions). |
Existing 3D car datasets are limited by being synthetic or low-quality, hindering real-world applications like autonomous driving simulations and realistic 3D modeling. |
Cars were scanned using 3D scanners on smartphones, capturing dense RGB-D images and point clouds. Data preprocessing included background removal, orientation rectification, and point cloud rescaling. The dataset was annotated with car brand, type, color, and parsing maps. |
3DRealCar enables high-quality 3D car reconstruction, especially under standard lighting, as benchmarked with state-of-the-art methods.
Existing methods struggle with reconstructing cars under reflective and dark lighting conditions, posing a new challenge for future research.
3DRealCar enhances the performance of 3D generation and novel view synthesis models by providing real-car priors, improving realism. |
3DRealCar currently only includes car exterior views, limiting its use for interior modeling.
Future work includes expanding the dataset with interior views and exploring methods for robust reconstruction under challenging lighting. |
3d reconstruction, dataset, autonomous driving, car modeling, computer vision |
2406.04746
Report |
PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction |
Eduard Poesina, Adriana Valentina Costache, Adrian-Gabriel Chifu, Josiane Mothe, Radu Tudor Ionescu |
Text-to-image generation has recently emerged as a viable alternative to
text-to-image retrieval, due to the visually impressive results of generative
diffusion models. Although query performance prediction is an active research
topic in information retrieval, to the best of our knowledge, there is no prior
study that analyzes the difficulty of queries (prompts) in text-to-image
generation, based on human judgments. To this end, we introduce the first
dataset of prompts which are manually annotated in terms of image generation
performance. In order to determine the difficulty of the same prompts in image
retrieval, we also collect manual annotations that represent retrieval
performance. We thus propose the first benchmark for joint text-to-image prompt
and query performance prediction, comprising 10K queries. Our benchmark
enables: (i) the comparative assessment of the difficulty of prompts/queries in
image generation and image retrieval, and (ii) the evaluation of prompt/query
performance predictors addressing both generation and retrieval. We present
results with several pre-generation/retrieval and post-generation/retrieval
performance predictors, thus providing competitive baselines for future
research. Our benchmark and code is publicly available under the CC BY 4.0
license at https://github.com/Eduard6421/PQPP. |
This paper introduces PQPP, the first manually annotated benchmark for evaluating the difficulty of prompts in text-to-image generation and retrieval. |
This benchmark enables comparative analysis of prompt difficulty across generation and retrieval tasks, facilitating the development of better performance predictors for text-to-image models. |
Researchers collected over 1.5M human relevance judgments for 10K prompts/queries, covering both image generation (using Stable Diffusion and GLIDE) and retrieval (using CLIP and BLIP-2). |
Low correlation between generation and retrieval performance suggesting a need for task-specific predictors.
Fine-tuned CLIP model achieves the highest correlation with human judgments for image generation.
Fine-tuned BERT model provides strong baseline for both generation and retrieval, especially for retrieval precision. |
Subjectivity in human interpretation of prompts for image generation may introduce variability.
Ground-truth image bank for retrieval relies on caption-based pre-filtering potentially missing relevant images. |
text-to-image generation, text-to-image retrieval, prompt performance prediction, query performance prediction, benchmark |
2406.04675
Report |
OVMR: Open-Vocabulary Recognition with Multi-Modal References |
Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian |
The challenge of open-vocabulary recognition lies in the model has no clue of
new categories it is applied to. Existing works have proposed different methods
to embed category cues into the model, \eg, through few-shot fine-tuning,
providing category names or textual descriptions to Vision-Language Models.
Fine-tuning is time-consuming and degrades the generalization capability.
Textual descriptions could be ambiguous and fail to depict visual details. This
paper tackles open-vocabulary recognition from a different perspective by
referring to multi-modal clues composed of textual descriptions and exemplar
images. Our method, named OVMR, adopts two innovative components to pursue a
more robust category cues embedding. A multi-modal classifier is first
generated by dynamically complementing textual descriptions with image
exemplars. A preference-based refinement module is hence applied to fuse
uni-modal and multi-modal classifiers, with the aim to alleviate issues of
low-quality exemplar images or textual descriptions. The proposed OVMR is a
plug-and-play module, and works well with exemplar images randomly crawled from
the Internet. Extensive experiments have demonstrated the promising performance
of OVMR, \eg, it outperforms existing methods across various scenarios and
setups. Codes are publicly available at
\href{https://github.com/Zehong-Ma/OVMR}{https://github.com/Zehong-Ma/OVMR}. |
This paper presents OVMR, a plug-and-play module that enhances the open-vocabulary recognition capabilities of Vision-Language Models (VLMs) by embedding multi-modal clues (textual descriptions and exemplar images) of novel classes. |
Open-vocabulary recognition is challenging because models have no prior knowledge of unseen categories. Existing methods suffer from limitations like inflexibility, time-consuming fine-tuning, ambiguity in textual descriptions, and varying quality of exemplar images. |
OVMR consists of two modules: 1) A multi-modal classifier generation module that extracts visual tokens from exemplars using a lightweight visual token generator and dynamically fuses them with textual descriptions using a language encoder. 2) A preference-based fusion module that evaluates the performance of uni-modal and multi-modal classifiers on exemplar images and dynamically fuses them based on their performance. |
OVMR achieves comparable performance to state-of-the-art prompt learning methods on 11 classification datasets without requiring fine-tuning.
It outperforms existing few-shot adaptation methods, demonstrating significant improvements on complex datasets like ImageNet.
In open-vocabulary detection, OVMR surpasses previous methods on the LVIS dataset, showing the effectiveness of multi-modal clue embedding. |
The preference-based fusion may have limitations when using very few exemplar images for evaluation.
Future work could explore extending OVMR to other open-vocabulary recognition tasks beyond classification and detection. |
open-vocabulary recognition, vision-language models, multi-modal learning, few-shot learning, classifier fusion |
2406.04662
Report |
Evaluating and Mitigating IP Infringement in Visual Generative AI |
Zhenting Wang, Chen Chen, Vikash Sehwag, Minzhou Pan, Lingjuan Lyu |
The popularity of visual generative AI models like DALL-E 3, Stable Diffusion
XL, Stable Video Diffusion, and Sora has been increasing. Through extensive
evaluation, we discovered that the state-of-the-art visual generative models
can generate content that bears a striking resemblance to characters protected
by intellectual property rights held by major entertainment companies (such as
Sony, Marvel, and Nintendo), which raises potential legal concerns. This
happens when the input prompt contains the character's name or even just
descriptive details about their characteristics. To mitigate such IP
infringement problems, we also propose a defense method against it. In detail,
we develop a revised generation paradigm that can identify potentially
infringing generated content and prevent IP infringement by utilizing guidance
techniques during the diffusion process. It has the capability to recognize
generated content that may be infringing on intellectual property rights, and
mitigate such infringement by employing guidance methods throughout the
diffusion process without retrain or fine-tune the pretrained models.
Experiments on well-known character IPs like Spider-Man, Iron Man, and Superman
demonstrate the effectiveness of the proposed defense method. Our data and code
can be found at https://github.com/ZhentingWang/GAI_IP_Infringement. |
This paper presents a systematic evaluation of the risk of intellectual property (IP) infringement in state-of-the-art visual generative AI models, particularly focusing on their ability to generate images resembling copyrighted characters, even without explicitly mentioning their names. The authors also propose a mitigation method to address this problem. |
With the increasing adoption of visual generative AI models, their potential for IP infringement poses serious legal and ethical challenges. This work is important as it highlights the severity of these issues and proposes a method for mitigating them, contributing to the responsible development and deployment of these technologies. |
The authors construct a benchmark of popular copyrighted characters and use a large language model (GPT-4) to craft descriptive prompts that could trigger IP infringement without directly naming the characters. They evaluate seven popular text-to-image and text-to-video generation models for their IP infringement rates. For mitigation, they propose a method combining name blocking, large vision-language model (GPT-4V) detection of infringing content, and classifier-free guidance to steer the generation process away from infringing outputs. |
The evaluation reveals a high prevalence of IP infringement in both open-source and commercial visual generative AI models, with near 100% infringement rates when character names are explicitly mentioned in prompts.
Even with descriptive prompts avoiding character names, the models still exhibit high infringement rates, highlighting the severity of the issue.
The proposed mitigation method effectively reduces IP infringement rates while maintaining language-image alignment quality, demonstrating its potential for enabling more responsible content generation. |
The evaluation primarily focuses on a limited set of characters and visual generative models. Expanding the scope to encompass a wider range of IP-protected content and models would provide a more comprehensive understanding of the problem.
The reliance on large language and vision-language models for mitigation introduces dependencies on the capabilities and potential biases of these models. Exploring alternative or complementary approaches for detecting and mitigating IP infringement could further enhance the robustness of the proposed solution. |
ai ethics, intellectual property, visual generative ai, diffusion models, content moderation |
2406.04542
Report |
M&M VTO: Multi-Garment Virtual Try-On and Editing |
Luyang Zhu, Yingwei Li, Nan Liu, Hao Peng, Dawei Yang, Ira Kemelmacher-Shlizerman |
We present M&M VTO, a mix and match virtual try-on method that takes as input
multiple garment images, text description for garment layout and an image of a
person. An example input includes: an image of a shirt, an image of a pair of
pants, "rolled sleeves, shirt tucked in", and an image of a person. The output
is a visualization of how those garments (in the desired layout) would look
like on the given person. Key contributions of our method are: 1) a single
stage diffusion based model, with no super resolution cascading, that allows to
mix and match multiple garments at 1024x512 resolution preserving and warping
intricate garment details, 2) architecture design (VTO UNet Diffusion
Transformer) to disentangle denoising from person specific features, allowing
for a highly effective finetuning strategy for identity preservation (6MB model
per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a
common identity loss problem in current virtual try-on methods, 3) layout
control for multiple garments via text inputs specifically finetuned over
PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO
achieves state-of-the-art performance both qualitatively and quantitatively, as
well as opens up new opportunities for virtual try-on via language-guided and
multi-garment try-on. |
This paper introduces M²TD, a single-stage diffusion-based virtual try-on method for mixing and matching multiple garments with layout control. |
M²TD addresses limitations in existing VTO methods, such as preserving intricate garment details, maintaining person identity, and handling multiple garments with layout variations. |
M²TD uses a single-stage diffusion model with progressive training to synthesize high-resolution images. It employs a VTO-UDiT architecture to disentangle person features for efficient finetuning and leverages a finetuned PaLI-3 model for layout control. |
M²TD outperforms state-of-the-art methods in preserving garment details and layouts, both qualitatively and quantitatively.
The proposed method allows for layout control using textual descriptions, enabling edits like tucking or rolling up garments.
Efficient finetuning on person features in M²TD preserves individual identity without overfitting to specific clothing items. |
M²TD faces challenges with uncommon garment combinations and layout editing that requires inpainting unseen areas.
The model does not explicitly incorporate size information for a perfect fit. |
virtual try-on, diffusion models, image synthesis, layout control, person identity preservation |
2406.04343
Report |
Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image |
Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, João F. Henriques, Christian Rupprecht, Andrea Vedaldi |
In this paper, we propose Flash3D, a method for scene reconstruction and
novel view synthesis from a single image which is both very generalisable and
efficient. For generalisability, we start from a "foundation" model for
monocular depth estimation and extend it to a full 3D shape and appearance
reconstructor. For efficiency, we base this extension on feed-forward Gaussian
Splatting. Specifically, we predict a first layer of 3D Gaussians at the
predicted depth, and then add additional layers of Gaussians that are offset in
space, allowing the model to complete the reconstruction behind occlusions and
truncations. Flash3D is very efficient, trainable on a single GPU in a day, and
thus accessible to most researchers. It achieves state-of-the-art results when
trained and tested on RealEstate10k. When transferred to unseen datasets like
NYU it outperforms competitors by a large margin. More impressively, when
transferred to KITTI, Flash3D achieves better PSNR than methods trained
specifically on that dataset. In some instances, it even outperforms recent
methods that use multiple views as input. Code, models, demo, and more results
are available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/. |
This paper introduces Flash3D, an efficient and generalizable method for reconstructing 3D scenes and synthesizing novel views from a single image using a feed-forward network and Gaussian Splatting. |
Current methods for monocular scene reconstruction are often computationally expensive, limited in generalization ability, or rely on iterative optimization. Flash3D addresses these limitations by leveraging the efficiency of Gaussian Splatting and the generalization capability of a pre-trained monocular depth estimation model. |
Flash3D extends a foundation model for monocular depth estimation by predicting multiple layers of 3D Gaussians for each pixel. The first layer captures visible surfaces guided by the depth estimate, while subsequent layers model occluded and truncated regions. This multi-Gaussian representation, coupled with image padding to capture out-of-view regions, enables the model to reconstruct complete scenes. |
Flash3D achieves state-of-the-art novel view synthesis accuracy on RealEstate10k, outperforming methods specifically designed for single-view scene reconstruction.
The model demonstrates strong cross-domain generalization, achieving state-of-the-art accuracy on NYU and KITTI datasets without being trained on them.
Flash3D exhibits superior performance in view extrapolation compared to existing two-view methods, indicating its ability to effectively model unseen areas. |
As a deterministic, regressive model, Flash3D may produce blurry renderings in regions with ambiguity, such as large baselines, occlusions, or backward camera motion.
The non-negativity constraint on depth offsets can limit the model's ability to recover scene structure closer to the camera than the initial depth estimate, making it sensitive to failures in the pre-trained depth estimator. |
3d scene reconstruction, novel view synthesis, monocular vision, gaussian splatting, deep learning |
2406.04342
Report |
Learning 1D Causal Visual Representation with De-focus Attention Networks |
Chenxin Tao, Xizhou Zhu, Shiqian Su, Lewei Lu, Changyao Tian, Xuan Luo, Gao Huang, Hongsheng Li, Yu Qiao, Jie Zhou, Jifeng Dai |
Modality differences have led to the development of heterogeneous
architectures for vision and language models. While images typically require 2D
non-causal modeling, texts utilize 1D causal modeling. This distinction poses
significant challenges in constructing unified multi-modal models. This paper
explores the feasibility of representing images using 1D causal modeling. We
identify an "over-focus" issue in existing 1D causal vision models, where
attention overly concentrates on a small proportion of visual tokens. The issue
of "over-focus" hinders the model's ability to extract diverse visual features
and to receive effective gradients for optimization. To address this, we
propose De-focus Attention Networks, which employ learnable bandpass filters to
create varied attention patterns. During training, large and scheduled drop
path rates, and an auxiliary loss on globally pooled features for global
understanding tasks are introduced. These two strategies encourage the model to
attend to a broader range of tokens and enhance network optimization. Extensive
experiments validate the efficacy of our approach, demonstrating that 1D causal
visual representation can perform comparably to 2D non-causal representation in
tasks such as global perception, dense prediction, and multi-modal
understanding. Code is released at
https://github.com/OpenGVLab/De-focus-Attention-Networks. |
This paper proposes De-focus Attention Networks to enhance 1D causal visual modeling by addressing the "over-focus" issue, where attention concentrates excessively on a few tokens, hindering diverse feature extraction and gradient flow. |
Bridging the performance gap between 1D causal and 2D non-causal vision models is crucial for constructing unified and effective multi-modal models. |
The authors introduce De-focus Attention with learnable bandpass filters, incorporating learnable exponential spatial decay and relative position embeddings to create diverse attention patterns. They also employ large scheduled drop path rates and an auxiliary loss on globally pooled features for global understanding tasks to enhance network optimization. |
De-focus Attention Networks achieve comparable or even superior performance to 2D non-causal ViTs on ImageNet classification, object detection, and image-text retrieval.
The proposed method consistently improves performance across various architectures like ViT, Mamba, and RetNet.
The effectiveness of learnable bandpass filters, large drop path rates, and the auxiliary loss is validated through ablation studies. |
The work primarily focuses on image-based tasks; further exploration is needed for other visual modalities.
Future research can investigate the optimal integration of De-focus Attention Networks with existing multi-modal models. |
computer vision, causal modeling, vision transformers, state space models, multi-modal learning |
2406.04341
Report |
Interpreting the Second-Order Effects of Neurons in CLIP |
Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt |
We interpret the function of individual neurons in CLIP by automatically
describing them using text. Analyzing the direct effects (i.e. the flow from a
neuron through the residual stream to the output) or the indirect effects
(overall contribution) fails to capture the neurons' function in CLIP.
Therefore, we present the "second-order lens", analyzing the effect flowing
from a neuron through the later attention heads, directly to the output. We
find that these effects are highly selective: for each neuron, the effect is
significant for <2% of the images. Moreover, each effect can be approximated by
a single direction in the text-image space of CLIP. We describe neurons by
decomposing these directions into sparse sets of text representations. The sets
reveal polysemantic behavior - each neuron corresponds to multiple, often
unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we
mass-produce "semantic" adversarial examples by generating images with concepts
spuriously correlated to the incorrect class. Additionally, we use the
second-order effects for zero-shot segmentation and attribute discovery in
images. Our results indicate that a scalable understanding of neurons can be
used for model deception and for introducing new model capabilities. |
This paper presents an interpretability method for understanding the function of individual neurons in CLIP by describing them using text, focusing on their second-order effects (contributions flowing through subsequent attention heads to the output). |
Interpreting neurons in CLIP is crucial for understanding model limitations, enabling interventions, and potentially uncovering new capabilities. |
The authors introduce a 'second-order lens', analyzing the effect of a neuron's activation flowing through later attention heads to the output. They decompose these second-order effects into sparse sets of text representations, revealing the polysemantic nature of neurons. |
Neurons in later CLIP layers have more significant second-order effects.
Each neuron's second-order effect is highly selective, significantly impacting only a small subset of images.
Neurons exhibit polysemantic behavior, responding to multiple, often unrelated concepts. |
The method does not fully analyze the effects of neurons on attention map patterns (queries and keys).
Mutual effects and dependencies between neurons within and across layers are not explored. |
interpretability, clip, neuron analysis, adversarial examples, zero-shot segmentation |
2406.04338
Report |
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion |
Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, Yueqi Duan |
In recent years, there has been rapid development in 3D generation models,
opening up new possibilities for applications such as simulating the dynamic
movements of 3D objects and customizing their behaviors. However, current 3D
generative models tend to focus only on surface features such as color and
shape, neglecting the inherent physical properties that govern the behavior of
objects in the real world. To accurately simulate physics-aligned dynamics, it
is essential to predict the physical properties of materials and incorporate
them into the behavior prediction process. Nonetheless, predicting the diverse
materials of real-world objects is still challenging due to the complex nature
of their physical attributes. In this paper, we propose \textbf{Physics3D}, a
novel method for learning various physical properties of 3D objects through a
video diffusion model. Our approach involves designing a highly generalizable
physical simulation system based on a viscoelastic material model, which
enables us to simulate a wide range of materials with high-fidelity
capabilities. Moreover, we distill the physical priors from a video diffusion
model that contains more understanding of realistic object materials. Extensive
experiments demonstrate the effectiveness of our method with both elastic and
plastic materials. Physics3D shows great potential for bridging the gap between
the physical world and virtual neural space, providing a better integration and
application of realistic physical principles in virtual environments. Project
page: https://liuff19.github.io/Physics3D. |
Presents Physics3D, a novel framework for learning various physical properties of 3D objects from video diffusion models, enabling the simulation of diverse materials with both elasticity and viscosity. |
Current 3D generative models often prioritize surface features over inherent physical properties, limiting their ability to realistically simulate object dynamics. |
Employs a viscoelastic Material Point Method (MPM) with elastoplastic and viscoelastic components, and leverages a video generation model (Stable Video Diffusion) to distill physical priors for optimization. |
Successfully simulates complex textured objects with realistic and physically plausible movements.
Outperforms baselines (PhysDreamer, PhysGaussian, DreamGaussian4D) in terms of realism, damping, and motion consistency, as demonstrated by space-time slice visualizations and video quality metrics.
User study confirms Physics3D generates significantly more preferred results regarding quality, realism, and fluency. |
Current method requires manual intervention to define movable objects and filling ranges in complex environments.
Future work aims to automate these processes using large segmentation models and enhance the physics system modeling. |
3d dynamic generation, physical simulation, viscoelastic material point method (mpm), video diffusion model, score distillation sampling (sds) |
2406.04337
Report |
Coherent Zero-Shot Visual Instruction Generation |
Quynh Phung, Songwei Ge, Jia-Bin Huang |
Despite the advances in text-to-image synthesis, particularly with diffusion
models, generating visual instructions that require consistent representation
and smooth state transitions of objects across sequential steps remains a
formidable challenge. This paper introduces a simple, training-free framework
to tackle the issues, capitalizing on the advancements in diffusion models and
large language models (LLMs). Our approach systematically integrates text
comprehension and image generation to ensure visual instructions are visually
appealing and maintain consistency and accuracy throughout the instruction
sequence. We validate the effectiveness by testing multi-step instructions and
comparing the text alignment and consistency with several baselines. Our
experiments show that our approach can visualize coherent and visually pleasing
instructions |
This paper introduces a training-free framework for generating coherent visual instructions from textual instructions, leveraging pre-trained text-to-image diffusion models and large language models (LLMs). |
Generating visual instructions is crucial for intuitive understanding and overcoming language barriers. Existing text-to-image methods struggle with maintaining consistency and accurately depicting state transitions across multiple steps. |
The framework employs a two-stage process: 1) In-context planning with LLMs to re-caption instructions into descriptive texts, capturing object states and relationships. 2) Adaptive feature-sharing for image generation, using local region constraints from segmentation models and global state similarity constraints from LLMs. |
Re-captioning instructions as descriptive text significantly improves coherence and accuracy compared to using raw instructions.
Adaptive feature sharing with local and global constraints effectively balances object consistency and necessary variations across steps.
The method generates high-quality visual instructions comparable to fine-tuned models, demonstrating the potential of training-free approaches. |
The generation quality is limited by the capabilities of current text-to-image models, sometimes failing to accurately depict specific objects or attributes.
Future work can explore incorporating temporal reasoning and fine-grained control over object transformations. |
visual instruction generation, text-to-image synthesis, diffusion models, large language models, zero-shot learning |
2406.04333
Report |
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model |
Yang Sui, Yanyu Li, Anil Kag, Yerlan Idelbayev, Junli Cao, Ju Hu, Dhritiman Sagar, Bo Yuan, Sergey Tulyakov, Jian Ren |
Diffusion-based image generation models have achieved great success in recent
years by showing the capability of synthesizing high-quality content. However,
these models contain a huge number of parameters, resulting in a significantly
large model size. Saving and transferring them is a major bottleneck for
various applications, especially those running on resource-constrained devices.
In this work, we develop a novel weight quantization method that quantizes the
UNet from Stable Diffusion v1.5 to 1.99 bits, achieving a model with 7.9X
smaller size while exhibiting even better generation quality than the original
one. Our approach includes several novel techniques, such as assigning optimal
bits to each layer, initializing the quantized model for better performance,
and improving the training strategy to dramatically reduce quantization error.
Furthermore, we extensively evaluate our quantized model across various
benchmark datasets and through human evaluation to demonstrate its superior
generation quality. |
BitsFusion, a novel weight quantization framework that compresses the weights of UNet from SD-v1.5 to 1.99 bits, achieving a 7.9x smaller model size while maintaining or even improving generation quality. |
Large-scale diffusion models are difficult to store and transfer due to their size, especially for resource-constrained devices. Quantization offers a solution by reducing model size without significant architectural changes. |
The method involves per-layer quantization error analysis using MSE and CLIP score to develop a mixed-precision strategy. It also introduces techniques like time embedding pre-computing, balanced integer initialization, and alternating optimization for scaling factors. Training employs a two-stage pipeline with distillation and noise prediction, incorporating quantization error-aware time step sampling. |
The 1.99-bit quantized model consistently outperforms the full-precision SD-v1.5 across various benchmark datasets and evaluation metrics (TIFA, GenEval, CLIP score).
Human evaluation on PartiPrompts shows user preference for BitsFusion over SD-v1.5.
BitsFusion outperforms other quantization methods like LSQ, Q-Diffusion, EfficientDM, and Apple-MBP in CLIP score. |
The compression of VAE and CLIP text encoder is not explored in this work.
The weight quantization techniques could be extended to activation quantization. |
diffusion models, quantization, stable diffusion, model compression, image generation |
2406.04332
Report |
Coarse-To-Fine Tensor Trains for Compact Visual Representations |
Sebastian Loeschcke, Dan Wang, Christian Leth-Espensen, Serge Belongie, Michael J. Kastoryano, Sagie Benaim |
The ability to learn compact, high-quality, and easy-to-optimize
representations for visual data is paramount to many applications such as novel
view synthesis and 3D reconstruction. Recent work has shown substantial success
in using tensor networks to design such compact and high-quality
representations. However, the ability to optimize tensor-based representations,
and in particular, the highly compact tensor train representation, is still
lacking. This has prevented practitioners from deploying the full potential of
tensor networks for visual data. To this end, we propose 'Prolongation
Upsampling Tensor Train (PuTT)', a novel method for learning tensor train
representations in a coarse-to-fine manner. Our method involves the prolonging
or `upsampling' of a learned tensor train representation, creating a sequence
of 'coarse-to-fine' tensor trains that are incrementally refined. We evaluate
our representation along three axes: (1). compression, (2). denoising
capability, and (3). image completion capability. To assess these axes, we
consider the tasks of image fitting, 3D fitting, and novel view synthesis,
where our method shows an improved performance compared to state-of-the-art
tensor-based methods. For full results see our project webpage:
https://sebulo.github.io/PuTT_website/ |
PuTT is a coarse-to-fine tensor train representation that learns compact visual representations through incremental refinement, surpassing previous tensor-based methods in compression, denoising, and handling incomplete data. |
Existing tensor-based representations struggle with optimization, getting trapped in local minima and failing to utilize the full compression potential of tensor trains, especially with noisy or incomplete data. |
Starting with a low-resolution representation, PuTT iteratively upsamples learned tensor trains using a prolongation operator and TT-SVD for rank control, refining the representation in a coarse-to-fine manner. |
PuTT achieves better compression ratios and higher PSNR/SSIM scores than CP, Tucker, and VM decompositions on 2D and 3D data fitting.
It excels in denoising, outperforming baselines across varying noise levels and exhibiting superior visual quality.
PuTT effectively handles incomplete data, achieving high PSNR/SSIM even with 99% data missing. |
Current implementation of TensoRF’s “shrinkage” process is not compatible with QTT.
PuTT is not specifically designed as a generative model and is not as effective for tasks like image inpainting over large areas. |
tensor networks, tensor train, quantized tensor train, coarse-to-fine learning, visual representation |
2406.04330
Report |
Parameter-Inverted Image Pyramid Networks |
Xizhou Zhu, Xue Yang, Zhaokai Wang, Hao Li, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai |
Image pyramids are commonly used in modern computer vision tasks to obtain
multi-scale features for precise understanding of images. However, image
pyramids process multiple resolutions of images using the same large-scale
model, which requires significant computational cost. To overcome this issue,
we propose a novel network architecture known as the Parameter-Inverted Image
Pyramid Networks (PIIP). Our core idea is to use models with different
parameter sizes to process different resolution levels of the image pyramid,
thereby balancing computational efficiency and performance. Specifically, the
input to PIIP is a set of multi-scale images, where higher resolution images
are processed by smaller networks. We further propose a feature interaction
mechanism to allow features of different resolutions to complement each other
and effectively integrate information from different spatial scales. Extensive
experiments demonstrate that the PIIP achieves superior performance in tasks
such as object detection, segmentation, and image classification, compared to
traditional image pyramid methods and single-branch networks, while reducing
computational cost. Notably, when applying our method on a large-scale vision
foundation model InternViT-6B, we improve its performance by 1%-2% on detection
and segmentation with only 40%-60% of the original computation. These results
validate the effectiveness of the PIIP approach and provide a new technical
direction for future vision computing tasks. Our code and models are available
at https://github.com/OpenGVLab/PIIP. |
This paper introduces Parameter-Inverted Image Pyramid Networks (PIIP), a novel architecture that enhances multi-scale representation in vision backbones while improving computational efficiency. |
Traditional image pyramids, while effective, impose significant computational overhead by processing images at multiple resolutions with the same large-scale model. PIIP addresses this challenge by using a parameter-inverted design. |
PIIP employs a multi-branch structure with cross-branch interactions and branch merging. Smaller models handle higher-resolution images, while larger models process lower-resolution images. Feature interaction modules facilitate information exchange between branches. |
PIIP achieves superior performance compared to traditional image pyramids and single-branch networks in object detection, instance segmentation, semantic segmentation, and image classification tasks while reducing computational costs.
When applied to the large-scale InternViT-6B model, PIIP improves performance by 1%-2% on detection and segmentation tasks while using only 40%-60% of the original computation.
Extensive ablations provide design guidelines for PIIP, such as prioritizing resolution increase in the largest image branch and limiting the largest model size. |
Current experiments focus on adapting PIIP to existing pre-trained models; future work will explore from-scratch pre-training with PIIP.
The interaction mechanism between branches can be further improved by incorporating more advanced attention mechanisms. |
image pyramid, multi-scale representation learning, vision transformer, computational efficiency, object detection, instance segmentation, semantic segmentation, image classification |
2406.04325
Report |
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions |
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang |
We present the ShareGPT4Video series, aiming to facilitate the video
understanding of large video-language models (LVLMs) and the video generation
of text-to-video models (T2VMs) via dense and precise captions. The series
comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with
various lengths and sources, developed through carefully designed data
filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and
capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic
videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that
reached SOTA performance on three advancing video benchmarks. To achieve this,
taking aside the non-scalable costly human annotators, we find using GPT4V to
caption video with a naive multi-frame or frame-concatenation input strategy
leads to less detailed and sometimes temporal-confused results. We argue the
challenge of designing a high-quality video captioning strategy lies in three
aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame
detailed content description. 3) Frame-number scalability for arbitrary-length
videos. To this end, we meticulously designed a differential video captioning
strategy, which is stable, scalable, and efficient for generating captions for
videos with arbitrary resolution, aspect ratios, and length. Based on it, we
construct ShareGPT4Video, which contains 40K high-quality videos spanning a
wide range of categories, and the resulting captions encompass rich world
knowledge, object attributes, camera movements, and crucially, detailed and
precise temporal descriptions of events. Based on ShareGPT4Video, we further
develop ShareCaptioner-Video, a superior captioner capable of efficiently
generating high-quality captions for arbitrary videos... |
This paper introduces ShareGPT4Video, a dataset of 40K video-caption pairs with detailed temporal descriptions generated using GPT4V, and ShareCaptioner-Video, a model fine-tuned on this dataset for efficient high-quality video captioning. |
Existing video caption datasets often lack detailed temporal descriptions, limiting the development of large video-language models (LVLMs) and text-to-video models (T2VMs). This work aims to address this gap by providing high-quality, temporally rich video captions. |
The authors develop a Differential Sliding-Window Captioning (DiffSW) strategy that leverages GPT4V to generate detailed descriptions of changes between consecutive keyframes. These differential captions are then summarized into a comprehensive video caption using GPT4. This strategy ensures temporal consistency and detailed content description. |
ShareGPT4Video, containing 40K high-quality video-caption pairs, significantly improves the performance of existing LVLMs like VideoLLaVA and LLaMA-VID on benchmarks like VideoBench, MVBench, and TempCompass.
ShareCaptioner-Video, trained on ShareGPT4Video, enables the efficient generation of high-quality captions for a larger dataset of 4.8M videos, totaling 3000 hours.
T2VMs trained on the detailed captions generated by ShareCaptioner-Video demonstrate improved control over semantic content and camera movement in video generation. |
The current pipeline does not incorporate audio information, limiting its applicability to conversational scenarios.
The dataset relies on videos from existing sources and may contain human faces, requiring users to adhere to the original licenses. |
video captioning, large video-language models, text-to-video generation, multi-modal learning, gpt4v |
2406.04324
Report |
SF-V: Single Forward Video Generation Model |
Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren |
Diffusion-based video generation models have demonstrated remarkable success
in obtaining high-fidelity videos through the iterative denoising process.
However, these models require multiple denoising steps during sampling,
resulting in high computational costs. In this work, we propose a novel
approach to obtain single-step video generation models by leveraging
adversarial training to fine-tune pre-trained video diffusion models. We show
that, through the adversarial training, the multi-steps video diffusion model,
i.e., Stable Video Diffusion (SVD), can be trained to perform single forward
pass to synthesize high-quality videos, capturing both temporal and spatial
dependencies in the video data. Extensive experiments demonstrate that our
method achieves competitive generation quality of synthesized videos with
significantly reduced computational overhead for the denoising process (i.e.,
around $23\times$ speedup compared with SVD and $6\times$ speedup compared with
existing works, with even better generation quality), paving the way for
real-time video synthesis and editing. More visualization results are made
publicly available at https://snap-research.github.io/SF-V. |
This paper presents the first single-step image-to-video generation model based on a fine-tuned pre-trained video diffusion model, significantly reducing computational cost while maintaining quality. |
Video diffusion models, while powerful, suffer from high computational costs due to multi-step denoising processes, hindering their wider deployment. |
The authors leverage adversarial training on the latent space of a pre-trained Stable Video Diffusion (SVD) model. They introduce a discriminator with spatial and temporal heads to enhance image quality and motion consistency, respectively. |
The model achieves comparable generation quality to SVD with 16 denoising steps, leading to a ~23x speedup.
It outperforms existing few-step video generation methods like AnimateLCM in both quality and speed.
Ablation studies demonstrate the importance of both spatial and temporal discriminator heads, as well as the impact of noise distribution for optimal performance. |
While single-step denoising is achieved, other components like the temporal VAE decoder still contribute to the overall runtime.
Future work includes accelerating these components for a truly real-time video generation system. |
video generation, diffusion models, adversarial training, single-step generation, latent space |
2406.04322
Report |
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data |
Qihao Liu, Yi Zhang, Song Bai, Adam Kortylewski, Alan Yuille |
We present DIRECT-3D, a diffusion-based 3D generative model for creating
high-quality 3D assets (represented by Neural Radiance Fields) from text
prompts. Unlike recent 3D generative models that rely on clean and well-aligned
3D data, limiting them to single or few-class generation, our model is directly
trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating
the key challenge (i.e., data scarcity) in large-scale 3D generation. In
particular, DIRECT-3D is a tri-plane diffusion model that integrates two
innovations: 1) A novel learning framework where noisy data are filtered and
aligned automatically during the training process. Specifically, after an
initial warm-up phase using a small set of clean data, an iterative
optimization is introduced in the diffusion process to explicitly estimate the
3D pose of objects and select beneficial data based on conditional density. 2)
An efficient 3D representation that is achieved by disentangling object
geometry and color features with two separate conditional diffusion models that
are optimized hierarchically. Given a prompt input, our model generates
high-quality, high-resolution, realistic, and complex 3D objects with accurate
geometric details in seconds. We achieve state-of-the-art performance in both
single-class generation and text-to-3D generation. We also demonstrate that
DIRECT-3D can serve as a useful 3D geometric prior of objects, for example to
alleviate the well-known Janus problem in 2D-lifting methods such as
DreamFusion. The code and models are available for research purposes at:
https://github.com/qihao067/direct3d. |
DIRECT-3D, a diffusion-based 3D generative model for high-quality 3D asset creation from text prompts, trained on noisy and unaligned 'in-the-wild' 3D assets. |
Addresses the challenge of data scarcity in large-scale 3D generation by utilizing readily available, albeit noisy, 'in-the-wild' 3D data, overcoming limitations of previous methods reliant on clean, aligned, and limited datasets. |
Employs a tri-plane diffusion model with two key innovations: 1) Iterative optimization during training for automatic data cleaning and alignment based on conditional density. 2) Disentanglement of object geometry and color features using separate conditional diffusion models optimized hierarchically. |
Achieves state-of-the-art performance in single-class generation, outperforming previous methods by a large margin on standard benchmarks.
Exhibits superior performance in text-to-3D generation compared to previous work (Shap-E), demonstrating higher quality, detail, complexity, and realism as per user studies.
Serves as an effective 3D geometry prior, significantly improving the consistency of 2D-lifting methods like DreamFusion and mitigating issues like the Janus problem. |
Limited compositionality due to the nature of 3D datasets and model design, struggling with novel object combinations.
Potential lack of realistic texture details due to the limitations of current large-scale 3D datasets primarily containing synthetic data. |
text-to-3d generation, diffusion models, neural radiance fields (nerf), 3d geometry prior, "in-the-wild 3d data" |
2406.04321
Report |
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling |
Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo |
In this work, we systematically study music generation conditioned solely on
the video. First, we present a large-scale dataset comprising 190K video-music
pairs, including various genres such as movie trailers, advertisements, and
documentaries. Furthermore, we propose VidMuse, a simple framework for
generating music aligned with video inputs. VidMuse stands out by producing
high-fidelity music that is both acoustically and semantically aligned with the
video. By incorporating local and global visual cues, VidMuse enables the
creation of musically coherent audio tracks that consistently match the video
content through Long-Short-Term modeling. Through extensive experiments,
VidMuse outperforms existing models in terms of audio quality, diversity, and
audio-visual alignment. The code and datasets will be available at
https://github.com/ZeyueT/VidMuse/. |
This paper introduces V2M, a large-scale dataset for video-to-music generation, and proposes VidMuse, a novel method that generates music aligned with video content using a long-short-term modeling approach. |
Video-to-music generation is a challenging task with increasing demand due to the growth of social media platforms. Existing datasets are limited in size, diversity, or focus on specific musical forms like MIDI. |
The authors construct V2M by collecting and meticulously filtering a large corpus of video-music pairs from YouTube and IMDb. VidMuse utilizes a visual encoder, a long-short-term visual module to capture global and local video features, a music token decoder, and an audio codec decoder to generate music. |
VidMuse outperforms existing models in objective metrics for audio quality, diversity, and audio-visual alignment on the V2M benchmark.
Subjective user studies confirm that VidMuse generates music that is better aligned with videos and exhibits superior audio quality and musicality compared to baseline methods.
Ablation studies demonstrate the efficacy of the long-short-term modeling approach and justify the choice of hyperparameters in VidMuse. |
The current codec used in VidMuse limits the audio sampling rate and introduces reconstruction loss.
Training large models like VidMuse requires substantial computational resources. |
video-to-music generation, music generation, multi-modal learning, deep learning, dataset |
2406.04314
Report |
Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step |
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, Liang Zheng |
Recently, Direct Preference Optimization (DPO) has extended its success from
aligning large language models (LLMs) to aligning text-to-image diffusion
models with human preferences. Unlike most existing DPO methods that assume all
diffusion steps share a consistent preference order with the final generated
images, we argue that this assumption neglects step-specific denoising
performance and that preference labels should be tailored to each step's
contribution. To address this limitation, we propose Step-aware Preference
Optimization (SPO), a novel post-training approach that independently evaluates
and adjusts the denoising performance at each step, using a step-aware
preference model and a step-wise resampler to ensure accurate step-aware
supervision. Specifically, at each denoising step, we sample a pool of images,
find a suitable win-lose pair, and, most importantly, randomly select a single
image from the pool to initialize the next denoising step. This step-wise
resampler process ensures the next win-lose image pair comes from the same
image, making the win-lose comparison independent of the previous step. To
assess the preferences at each step, we train a separate step-aware preference
model that can be applied to both noisy and clean images. Our experiments with
Stable Diffusion v1.5 and SDXL demonstrate that SPO significantly outperforms
the latest Diffusion-DPO in aligning generated images with complex, detailed
prompts and enhancing aesthetics, while also achieving more than 20x times
faster in training efficiency. Code and model:
https://rockeycoss.github.io/spo.github.io/ |
This paper introduces Step-aware Preference Optimization (SPO), a novel post-training approach for aligning text-to-image diffusion models with human preferences by independently evaluating and adjusting the denoising performance at each step. |
Existing Direct Preference Optimization (DPO) methods for diffusion models assume a consistent preference order across all diffusion steps, neglecting step-specific denoising performance and leading to misaligned supervision signals. |
SPO utilizes a step-aware preference model to assess the quality of denoised samples at each step and a step-wise resampler to ensure independent preference evaluation, removing trajectory-level dependency. |
SPO significantly outperforms state-of-the-art DPO methods, including Diffusion-DPO, D3PO, and DDPO, in aligning generated images with complex prompts and enhancing aesthetics, as evaluated by both AI feedback metrics and user studies.
The step-wise resampler with random selection significantly improves performance, acting as effective trajectory augmentation.
SPO achieves more than 20x faster training efficiency compared to Diffusion-DPO due to the use of more accurate step-aware preference labels. |
The step-aware preference model's performance degrades for noisy samples at very large timesteps.
Future work includes exploring different step-aware preference model architectures and applying SPO to other diffusion-based generation tasks. |
diffusion models, text-to-image generation, preference learning, direct preference optimization, post-training |
2406.04312
Report |
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization |
Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, Zeynep Akata |
Text-to-Image (T2I) models have made significant advancements in recent
years, but they still struggle to accurately capture intricate details
specified in complex compositional prompts. While fine-tuning T2I models with
reward objectives has shown promise, it suffers from "reward hacking" and may
not generalize well to unseen prompt distributions. In this work, we propose
Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I
models at inference by optimizing the initial noise based on the signal from
one or multiple human preference reward models. Remarkably, solving this
optimization problem with gradient ascent for 50 iterations yields impressive
results on four different one-step models across two competitive benchmarks,
T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds,
ReNO-enhanced one-step models consistently surpass the performance of all
current open-source Text-to-Image models. Extensive user studies demonstrate
that our model is preferred nearly twice as often compared to the popular SDXL
model and is on par with the proprietary Stable Diffusion 3 with 8B parameters.
Moreover, given the same computational resources, a ReNO-optimized one-step
model outperforms widely-used open-source models such as SDXL and
PixArt-$\alpha$, highlighting the efficiency and effectiveness of ReNO in
enhancing T2I model performance at inference time. Code is available at
https://github.com/ExplainableML/ReNO. |
Introduces ReNO, a novel approach that enhances Text-to-Image (T2I) models at inference by optimizing the initial noise based on human preference reward models. |
Existing T2I models struggle to accurately capture intricate details in complex prompts. While fine-tuning with reward objectives is promising, it suffers from 'reward hacking' and generalization issues. ReNO offers an efficient alternative by enhancing models at inference time. |
ReNO leverages distilled one-step T2I models to circumvent exploding/vanishing gradients during backpropagation. It optimizes the initial noise using a combination of reward models (HPSv2, PickScore, ImageReward, CLIP) for a limited number of iterations while regularizing the noise to prevent reward hacking. |
ReNO significantly improves performance on T2I-CompBench and GenEval, with gains of over 20% in some categories.
User studies demonstrate ReNO-enhanced models are preferred nearly twice as often as SDXL and are on par with the proprietary SD3.
ReNO outperforms competing multi-step models given the same computational budget, offering an efficient balance between performance and speed. |
Limitations in current reward models might hinder further performance improvements.
ReNO increases the required VRAM during generation. |
text-to-image generation, noise optimization, reward models, one-step diffusion models, compositionality |
2406.04309
Report |
ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation |
Sergey Zakharov, Katherine Liu, Adrien Gaidon, Rares Ambrus |
The common trade-offs of state-of-the-art methods for multi-shape
representation (a single model "packing" multiple objects) involve trading
modeling accuracy against memory and storage. We show how to encode multiple
shapes represented as continuous neural fields with a higher degree of
precision than previously possible and with low memory usage. Key to our
approach is a recursive hierarchical formulation that exploits object
self-similarity, leading to a highly compressed and efficient shape latent
space. Thanks to the recursive formulation, our method supports spatial and
global-to-local latent feature fusion without needing to initialize and
maintain auxiliary data structures, while still allowing for continuous field
queries to enable applications such as raytracing. In experiments on a set of
diverse datasets, we provide compelling qualitative results and demonstrate
state-of-the-art multi-scene reconstruction and compression results with a
single network per dataset. |
Proposes ReFiNe (Recursive Field Networks), a method to encode multiple shapes as neural fields into a single network, achieving high compression and reconstruction quality by recursively representing shapes and fusing features across different levels of detail. |
Addresses the limitations of current multi-shape representation techniques that compromise detail and accuracy for memory efficiency by enabling high-fidelity representation and compression of multiple shapes within a single network. |
Utilizes a recursive autoencoder to represent shapes as octrees, prunes unoccupied voxels, aggregates features spatially and hierarchically, and employs MLPs to decode features into SDF, SDF+RGB, or NeRF representations. |
Outperforms DeepSDF and Curriculum DeepSDF in reconstruction accuracy on Thingi32 and ShapeNet150 datasets while achieving comparable performance to ROAD with lower memory usage.
Exhibits higher fidelity in reconstructing high-frequency details on the SRN Cars dataset compared to CodeNeRF and SRN.
Demonstrates scalability by encoding the Google Scanned Objects dataset (1030 objects) and the RTMV dataset (40 scenes) with high compression rates and good reconstruction quality. |
Currently limited to representing bounded scenes.
Future work includes extending to unbounded scenes and exploring 3D synthesis using diffusion models. |
neural fields, shape representation, compression, recursive networks, 3d reconstruction |
2406.04303
Report |
Vision-LSTM: xLSTM as Generic Vision Backbone |
Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, Johannes Brandstetter |
Transformers are widely used as generic backbones in computer vision, despite
initially introduced for natural language processing. Recently, the Long
Short-Term Memory (LSTM) has been extended to a scalable and performant
architecture - the xLSTM - which overcomes long-standing LSTM limitations via
exponential gating and parallelizable matrix memory structure. In this report,
we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to
computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process
the sequence of patch tokens from top to bottom while even blocks go from
bottom to top. Experiments show that ViL holds promise to be further deployed
as new generic backbone for computer vision architectures. |
The paper introduces Vision-LSTM (ViL), a novel backbone for computer vision tasks inspired by the success of xLSTM in language modeling. ViL adapts xLSTM's building blocks to vision by processing image patches in an alternating fashion, enabling efficient handling of non-sequential image data. |
ViL addresses the limitations of traditional Vision Transformers, particularly their quadratic computational complexity that makes them costly for high-resolution images. ViL's linear complexity makes it well-suited for tasks requiring high-resolution inputs, such as medical imaging and semantic segmentation. |
The paper explores various ViL block designs, focusing on multi-directional processing of patch tokens. The final architecture employs alternating mLSTM blocks, with odd blocks processing patches top-to-bottom and even blocks bottom-to-top. Experiments on ImageNet-1K compare different design choices and evaluate performance against existing architectures. |
ViL achieves competitive performance on ImageNet-1K classification, outperforming some heavily optimized ViT models, especially at smaller scales.
ViL demonstrates robustness to different classification designs, indicating flexibility in adapting to various vision tasks.
Despite lacking the inductive bias of convolutions, ViL exhibits competitive performance with CNN-based models like ConvNeXt. |
The paper acknowledges that hyperparameters for larger ViL models are not yet fully optimized due to the computational cost of training.
The current implementation of ViL lacks custom CUDA kernels, which are expected to further improve its speed. |
computer vision, vision transformer, lstm, xlstm, image classification |
2406.04295
Report |
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment |
Jiayi Guo, Junhao Zhao, Chunjiang Ge, Chaoqun Du, Zanlin Ni, Shiji Song, Humphrey Shi, Gao Huang |
Test-time adaptation (TTA) aims to enhance the performance of source-domain
pretrained models when tested on unknown shifted target domains. Traditional
TTA methods primarily adapt model weights based on target data streams, making
model performance sensitive to the amount and order of target data. Recently,
diffusion-driven TTA methods have demonstrated strong performance by using an
unconditional diffusion model, which is also trained on the source domain to
transform target data into synthetic data as a source domain projection. This
allows the source model to make predictions without weight adaptation. In this
paper, we argue that the domains of the source model and the synthetic data in
diffusion-driven TTA methods are not aligned. To adapt the source model to the
synthetic domain of the unconditional diffusion model, we introduce a
Synthetic-Domain Alignment (SDA) framework to fine-tune the source model with
synthetic data. Specifically, we first employ a conditional diffusion model to
generate labeled samples, creating a synthetic dataset. Subsequently, we use
the aforementioned unconditional diffusion model to add noise to and denoise
each sample before fine-tuning. This process mitigates the potential domain gap
between the conditional and unconditional models. Extensive experiments across
various models and benchmarks demonstrate that SDA achieves superior domain
alignment and consistently outperforms existing diffusion-driven TTA methods.
Our code is available at
https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment. |
This paper proposes Synthetic-Domain Alignment (SDA), a novel Test-Time Adaptation (TTA) framework aligning both source model and target data to a shared synthetic domain derived from a diffusion model. |
Existing TTA methods, whether adapting the model to the target domain or vice versa, struggle with real-world data limitations or domain gaps inherent to synthetic data. SDA aims to overcome these by finding a common ground for adaptation. |
SDA operates in two stages: 1) A labeled synthetic dataset is generated using a conditional diffusion model, then aligned to the target domain using an unconditional diffusion model. This dataset fine-tunes the source model. 2) Target data is projected into the synthetic domain using the same unconditional diffusion model, enabling the fine-tuned model to make predictions on now domain-aligned data. |
SDA consistently outperforms state-of-the-art diffusion-driven TTA methods on both ImageNet-C and ImageNet-W benchmarks across various model architectures.
Visualization using Grad-CAM highlights SDA's superior domain alignment compared to methods relying solely on target data projection.
Ablation studies confirm the importance of both synthetic data generation and alignment processes within SDA's framework. |
SDA, relying on diffusion models, inherits their current limitation of low test speed, requiring further research into faster sampling or distillation.
The quality of synthetic data, crucial for SDA's effectiveness, is dependent on the capabilities of the generative diffusion models, an area under active development. |
test-time adaptation, diffusion models, domain alignment, synthetic data, robust image classification |
2406.04277
Report |
VideoTetris: Towards Compositional Text-to-Video Generation |
Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui |
Diffusion models have demonstrated great success in text-to-video (T2V)
generation. However, existing methods may face challenges when handling complex
(long) video generation scenarios that involve multiple objects or dynamic
changes in object numbers. To address these limitations, we propose
VideoTetris, a novel framework that enables compositional T2V generation.
Specifically, we propose spatio-temporal compositional diffusion to precisely
follow complex textual semantics by manipulating and composing the attention
maps of denoising networks spatially and temporally. Moreover, we propose an
enhanced video data preprocessing to enhance the training data regarding motion
dynamics and prompt understanding, equipped with a new reference frame
attention mechanism to improve the consistency of auto-regressive video
generation. Extensive experiments demonstrate that our VideoTetris achieves
impressive qualitative and quantitative results in compositional T2V
generation. Code is available at: https://github.com/YangLing0818/VideoTetris |
VideoTetris, a novel diffusion-based framework enabling compositional text-to-video (T2V) generation by manipulating and composing attention maps of denoising networks spatially and temporally. |
Existing T2V models struggle with complex scenes involving multiple objects or dynamic changes in object numbers, especially in long video generation with compositional prompts. |
Introduces Spatio-Temporal Compositional Diffusion to precisely follow complex textual semantics. Employs Enhanced Video Data Preprocessing to enhance motion dynamics and prompt understanding. Proposes Reference Frame Attention to improve consistency in auto-regressive video generation. |
Achieves state-of-the-art quality in compositional video generation, accurately placing and maintaining multiple objects with distinct attributes.
Generates high-quality long videos aligned with progressive compositional prompts, seamlessly integrating new characters and maintaining consistency.
Outperforms existing methods in both qualitative and quantitative evaluations, including VBLIP-VQA, VUnidet, and CLIP-SIM. |
Current limitations in long video generation models impact the performance of long compositional videos.
High computational cost and strong control information from ControlNet hinder object consistency and position control in transitions. |
text-to-video generation, diffusion models, compositional generation, long video generation, consistency regularization |
2406.04264
Report |
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding |
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu |
The evaluation of Long Video Understanding (LVU) performance poses an
important but challenging research problem. Despite previous efforts, the
existing video understanding benchmarks are severely constrained by several
issues, especially the insufficient lengths of videos, a lack of diversity in
video types and evaluation tasks, and the inappropriateness for evaluating LVU
performances. To address the above problems, we propose a new benchmark, called
MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and
in-depth evaluation of LVU. MLVU presents the following critical values: 1) The
substantial and flexible extension of video lengths, which enables the
benchmark to evaluate LVU performance across a wide range of durations. 2) The
inclusion of various video genres, e.g., movies, surveillance footage,
egocentric videos, cartoons, game videos, etc., which reflects the models' LVU
performances in different scenarios. 3) The development of diversified
evaluation tasks, which enables a comprehensive examination of MLLMs' key
abilities in long-video understanding. The empirical study with 20 latest MLLMs
reveals significant room for improvement in today's technique, as all existing
methods struggle with most of the evaluation tasks and exhibit severe
performance degradation when handling longer videos. Additionally, it suggests
that factors such as context length, image-understanding quality, and the
choice of LLM backbone can play critical roles in future advancements. We
anticipate that MLVU will advance the research of long video understanding by
providing a comprehensive and in-depth analysis of MLLMs. |
MLVU, a new benchmark for evaluating long video understanding in Multimodal Large Language Models (MLLMs), is proposed, featuring long and diverse videos and a range of tasks. |
Evaluating the long-video understanding (LVU) performance of MLLMs is crucial yet challenging due to limitations in existing benchmarks, including insufficient video length, lack of diversity in video types and tasks, and inappropriateness for LVU evaluation. |
MLVU is constructed with diverse video lengths (3 min to 2+ hours) and genres (movies, surveillance, etc.) and includes 9 LVU-tailored tasks, categorized as holistic, single-detail, and multi-detail understanding, with both multi-choice and free-form generation formats. |
Long-video understanding remains challenging for existing MLLMs, with even the best model (GPT-4o) struggling with tasks demanding fine-grained understanding of long videos.
A significant performance gap exists between open-source and proprietary models.
Context length, image-understanding quality, and the choice of LLM backbone are identified as critical factors influencing LVU performance. |
MLVU could be extended to encompass tasks involving high-resolution videos or more specialized tasks like tracking and low-level processing.
Potential copyright concerns with using copyrighted video material, despite efforts to minimize infringement. |
multimodal learning, long video understanding, benchmarking, large language models, video understanding |
2406.04254
Report |
GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions |
Salvatore Esposito, Qingshan Xu, Kacper Kania, Charlie Hewitt, Octave Mariotti, Lohit Petikam, Julien Valentin, Arno Onken, Oisin Mac Aodha |
We introduce a new generative approach for synthesizing 3D geometry and
images from single-view collections. Most existing approaches predict
volumetric density to render multi-view consistent images. By employing
volumetric rendering using neural radiance fields, they inherit a key
limitation: the generated geometry is noisy and unconstrained, limiting the
quality and utility of the output meshes. To address this issue, we propose
GeoGen, a new SDF-based 3D generative model trained in an end-to-end manner.
Initially, we reinterpret the volumetric density as a Signed Distance Function
(SDF). This allows us to introduce useful priors to generate valid meshes.
However, those priors prevent the generative model from learning details,
limiting the applicability of the method to real-world scenarios. To alleviate
that problem, we make the transformation learnable and constrain the rendered
depth map to be consistent with the zero-level set of the SDF. Through the lens
of adversarial training, we encourage the network to produce higher fidelity
details on the output meshes. For evaluation, we introduce a synthetic dataset
of human avatars captured from 360-degree camera angles, to overcome the
challenges presented by real-world datasets, which often lack 3D consistency
and do not cover all camera angles. Our experiments on multiple datasets show
that GeoGen produces visually and quantitatively better geometry than the
previous generative models based on neural radiance fields. |
This paper presents GeoGen, a new generative model for synthesizing 3D geometry and images from single-view image collections, addressing the limitations of existing neural radiance field-based methods that often produce noisy and unconstrained geometry. |
Generating high-quality 3D geometry from single-view images is crucial for various applications, including content creation, virtual reality, and animation, but existing methods struggle to produce accurate and detailed results. |
GeoGen utilizes a Signed Distance Function (SDF) network within a StyleGAN generative architecture, augmented with an SDF depth map consistency loss to improve geometric accuracy by aligning 3D points with the SDF's zero-level set. |
GeoGen generates visually and quantitatively better geometry than previous neural radiance field-based generative models, as demonstrated through experiments on FFHQ, ShapeNet Cars, and a new synthetic human head dataset.
The proposed SDF depth map consistency loss effectively reduces geometric inaccuracies caused by volumetric integration, leading to more precise 3D reconstructions.
A new synthetic human head dataset with 360-degree views is introduced, addressing the limitations of existing datasets like FFHQ and providing a valuable resource for training and evaluating 3D generative models. |
The reliance on posed images for training, necessitating pose estimation during preprocessing.
The potential for increased computational load if the SDF consistency loss is extended to more points along each ray for further geometric refinement. |
generative models, 3d reconstruction, signed distance function, neural rendering, single-view reconstruction |
2406.04251
Report |
Localized Gaussian Point Management |
Haosen Yang, Chenhao Zhang, Wenqing Wang, Marco Volino, Adrian Hilton, Li Zhang, Xiatian Zhu |
Point management is a critical component in optimizing 3D Gaussian Splatting
(3DGS) models, as the point initiation (e.g., via structure from motion) is
distributionally inappropriate. Typically, the Adaptive Density Control (ADC)
algorithm is applied, leveraging view-averaged gradient magnitude thresholding
for point densification, opacity thresholding for pruning, and regular
all-points opacity reset. However, we reveal that this strategy is limited in
tackling intricate/special image regions (e.g., transparent) as it is unable to
identify all the 3D zones that require point densification, and lacking an
appropriate mechanism to handle the ill-conditioned points with negative
impacts (occlusion due to false high opacity). To address these limitations, we
propose a Localized Point Management (LPM) strategy, capable of identifying
those error-contributing zones in the highest demand for both point addition
and geometry calibration. Zone identification is achieved by leveraging the
underlying multiview geometry constraints, with the guidance of image rendering
errors. We apply point densification in the identified zone, whilst resetting
the opacity of those points residing in front of these regions so that a new
opportunity is created to correct ill-conditioned points. Serving as a
versatile plugin, LPM can be seamlessly integrated into existing 3D Gaussian
Splatting models. Experimental evaluation across both static 3D and dynamic 4D
scenes validate the efficacy of our LPM strategy in boosting a variety of
existing 3DGS models both quantitatively and qualitatively. Notably, LPM
improves both vanilla 3DGS and SpaceTimeGS to achieve state-of-the-art
rendering quality while retaining real-time speeds, outperforming on
challenging datasets such as Tanks & Temples and the Neural 3D Video Dataset. |
This paper introduces \fullname{} (\shortname{}), a novel point management approach for 3D Gaussian Splatting (3DGS) that improves scene representation and rendering quality. |
Existing point management techniques in 3DGS, like Adaptive Density Control (ADC), rely on global thresholds for point densification, which often overlook under-optimized points and lack a mechanism for handling ill-conditioned points leading to rendering errors. |
\shortname{} leverages multiview geometry constraints and image rendering errors to identify error-contributing 3D zones. It then applies localized point manipulations, including point addition in under-populated areas and opacity reset for potentially ill-conditioned points to improve geometry. |
\shortname{} achieves state-of-the-art results on challenging datasets like Tanks & Temples and DeepBlending, surpassing previous methods in rendering quality.
On the Neural 3D Video Dataset, integrating \shortname{} with SpaceTimeGS yields the best performance, effectively capturing subtle static and dynamic details.
Ablation studies demonstrate the efficacy of individual components of \shortname{} and its robustness to sparse training data. |
The current point densification method still relies on rules from 3DGS and may not be optimal, requiring further exploration.
Future work could focus on extending \shortname{} to address multi-resolution representations in 3DGS. |
3d gaussian splatting, point management, novel view synthesis, multiview geometry, neural rendering |
2406.04221
Report |
Matching Anything by Segmenting Anything |
Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, Fisher Yu |
The robust association of the same objects across video frames in complex
scenes is crucial for many applications, especially Multiple Object Tracking
(MOT). Current methods predominantly rely on labeled domain-specific video
datasets, which limits the cross-domain generalization of learned similarity
embeddings. We propose MASA, a novel method for robust instance association
learning, capable of matching any objects within videos across diverse domains
without tracking labels. Leveraging the rich object segmentation from the
Segment Anything Model (SAM), MASA learns instance-level correspondence through
exhaustive data transformations. We treat the SAM outputs as dense object
region proposals and learn to match those regions from a vast image collection.
We further design a universal MASA adapter which can work in tandem with
foundational segmentation or detection models and enable them to track any
detected objects. Those combinations present strong zero-shot tracking ability
in complex domains. Extensive tests on multiple challenging MOT and MOTS
benchmarks indicate that the proposed method, using only unlabeled static
images, achieves even better performance than state-of-the-art methods trained
with fully annotated in-domain video sequences, in zero-shot association.
Project Page: https://matchinganything.github.io/ |
This paper introduces MASA, a novel method for learning generalizable instance association from unlabeled images, leveraging the Segment Anything Model (SAM) to enable zero-shot object tracking. |
Current object tracking methods rely heavily on labeled domain-specific video datasets, limiting their ability to generalize across domains. MASA addresses this by learning robust instance association from readily available unlabeled images, eliminating the need for costly video annotations. |
MASA leverages SAM's rich object segmentation to establish instance-level correspondence. By applying diverse data transformations to unlabeled images and their corresponding SAM outputs, MASA learns discriminative instance representations through contrastive learning. Additionally, a universal MASA adapter is proposed to enable existing open-world segmentation and detection models to track objects effectively. |
MASA achieves state-of-the-art zero-shot association performance on various MOT benchmarks, including TAO, BDD100K, and Youtube-VIS, surpassing methods trained with fully annotated in-domain video sequences.
The proposed method exhibits strong cross-domain generalization, effectively tracking objects in diverse domains without requiring domain-specific training data.
The introduction of the MASA adapter enables seamless integration with existing open-world segmentation and detection models, enhancing their capabilities for tracking any detected object. |
One limitation lies in addressing temporal inconsistencies in detection or segmentation results across video frames, leading to flickering effects in video visualization.
Another limitation is the lack of a long-term memory system, making the model susceptible to failure in scenarios with severe occlusions. |
object tracking, zero-shot learning, instance association, segment anything model (sam), contrastive learning |
2406.04103
Report |
Multistep Distillation of Diffusion Models via Moment Matching |
Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom |
We present a new method for making diffusion models faster to sample. The
method distills many-step diffusion models into few-step models by matching
conditional expectations of the clean data given noisy data along the sampling
trajectory. Our approach extends recently proposed one-step methods to the
multi-step case, and provides a new perspective by interpreting these
approaches in terms of moment matching. By using up to 8 sampling steps, we
obtain distilled models that outperform not only their one-step versions but
also their original many-step teacher models, obtaining new state-of-the-art
results on the Imagenet dataset. We also show promising results on a large
text-to-image model where we achieve fast generation of high resolution images
directly in image space, without needing autoencoders or upsamplers. |
This paper presents Moment Matching Distillation, a new method to distill many-step diffusion models into faster few-step models. |
Diffusion models, while powerful generative models for various data modalities, suffer from slow sampling speed due to the iterative nature of the denoising process. This method addresses this limitation, making them more practical for real-world applications. |
The method matches conditional expectations of clean data given noisy data throughout the sampling process. It minimizes the L2 distance between moments of the teacher model and a distilled student model, either with an auxiliary denoising model in an alternating optimization scheme or by directly matching gradients of the teacher's loss in parameter space. |
Distilled models using 8 sampling steps achieve state-of-the-art results on ImageNet, even surpassing the original many-step teacher model.
The method allows for fast generation of high-resolution images directly in image space for large text-to-image models, eliminating the need for autoencoders or upsamplers.
The proposed distillation loss provides a clear metric to monitor the progress of the distillation process. |
While effective for 8+ sampling steps, the method's performance for 1-2 steps is not as competitive and needs improvement.
The paper relies on automated image quality metrics and would benefit from human evaluations to complement the findings. |
diffusion models, model distillation, generative models, image generation, moment matching |
2406.04101
Report |
How Far Can We Compress Instant-NGP-Based NeRF? |
Yihang Chen, Qianyi Wu, Mehrtash Harandi, Jianfei Cai |
In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable
capabilities in representing 3D scenes. To expedite the rendering process,
learnable explicit representations have been introduced for combination with
implicit NeRF representation, which however results in a large storage space
requirement. In this paper, we introduce the Context-based NeRF Compression
(CNC) framework, which leverages highly efficient context models to provide a
storage-friendly NeRF representation. Specifically, we excavate both level-wise
and dimension-wise context dependencies to enable probability prediction for
information entropy reduction. Additionally, we exploit hash collision and
occupancy grids as strong prior knowledge for better context modeling. To the
best of our knowledge, we are the first to construct and exploit context models
for NeRF compression. We achieve a size reduction of 100$\times$ and 70$\times$
with improved fidelity against the baseline Instant-NGP on Synthesic-NeRF and
Tanks and Temples datasets, respectively. Additionally, we attain 86.7\% and
82.3\% storage size reduction against the SOTA NeRF compression method BiRF.
Our code is available here: https://github.com/YihangChen-ee/CNC. |
This paper proposes Context-based NeRF Compression (CNC), a novel framework using context models for compressing NeRF models with multi-resolution hash encoding (e.g., Instant-NGP). |
Explicit representations in NeRF, while enabling fast rendering, lead to large storage requirements. CNC addresses this by minimizing information uncertainty in explicit feature encoding through context modeling, enabling storage-efficient NeRF representations. |
CNC leverages level-wise and dimension-wise context models to estimate the probability distribution of feature embeddings for entropy reduction. It utilizes hash collision and occupancy grids from Instant-NGP to improve the accuracy of context modeling. |
CNC achieves over 100x and 70x size reduction on Synthetic-NeRF and Tanks and Temples datasets respectively, while improving fidelity compared to the Instant-NGP baseline.
Compared to BiRF (SOTA NeRF compression), CNC achieves 86.7% and 82.3% size reduction on the two datasets.
Ablation studies validate the importance of both level-wise and dimension-wise context models, the coarse-to-fine contextual order, and the hash fusion module for achieving optimal compression performance. |
A limitation of CNC is the increased training time compared to models without context models.
Future work includes exploring faster implementations of context models and applying the CNC framework to compress dynamic or large-scale NeRFs. |
neural radiance field, nerf compression, context modeling, hash encoding, occupancy grid |
2406.04032
Report |
Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis |
Marianna Ohanyan, Hayk Manukyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi |
We present Zero-Painter, a novel training-free framework for
layout-conditional text-to-image synthesis that facilitates the creation of
detailed and controlled imagery from textual prompts. Our method utilizes
object masks and individual descriptions, coupled with a global text prompt, to
generate images with high fidelity. Zero-Painter employs a two-stage process
involving our novel Prompt-Adjusted Cross-Attention (PACA) and Region-Grouped
Cross-Attention (ReGCA) blocks, ensuring precise alignment of generated objects
with textual prompts and mask shapes. Our extensive experiments demonstrate
that Zero-Painter surpasses current state-of-the-art methods in preserving
textual details and adhering to mask shapes. |
Introduces Zero-Painter, a training-free framework for layout-conditional text-to-image synthesis, generating images from object masks, individual descriptions, and a global text prompt. |
Addresses challenges in crafting detailed prompts and limitations of traditional text-to-image models in handling intricate descriptions of multiple objects. |
Utilizes a two-stage process: Single Object Generation (SOG) with Prompt-Adjusted Cross-Attention (PACA) for generating individual objects, and Comprehensive Composition (CC) with Region-Grouped Cross-Attention (ReGCA) for seamless object integration based on global prompt and mask-prompt pairs. |
Zero-Painter surpasses state-of-the-art methods in preserving textual details and adhering to mask shapes.
PACA effectively aligns generated objects with individual prompts and prevents generation outside masked areas.
ReGCA ensures coherent background generation and maintains object integrity, even with missing object information in the global prompt. |
Zero-Painter faces limitations in handling overlapping masks, leading to less visually coherent outcomes.
Future work will focus on addressing overlapping mask challenges and further enhancing the framework's efficiency. |
text-to-image synthesis, layout-conditional generation, cross-attention, stable diffusion, image inpainting |
2406.03723
Report |
Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling |
Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang, Pedro Miraldo, Suhas Lohit, Moitreya Chatterjee |
Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have
enabled their near photo-realistic, free-viewpoint rendering. Although these
methods have shown some potential in creating immersive experiences, two
drawbacks limit their ubiquity: (i) a significant reduction in reconstruction
quality when the computing budget is limited, and (ii) a lack of semantic
understanding of the underlying scenes. To address these issues, we introduce
Gear-NeRF, which leverages semantic information from powerful image
segmentation models. Our approach presents a principled way for learning a
spatio-temporal (4D) semantic embedding, based on which we introduce the
concept of gears to allow for stratified modeling of dynamic regions of the
scene based on the extent of their motion. Such differentiation allows us to
adjust the spatio-temporal sampling resolution for each region in proportion to
its motion scale, achieving more photo-realistic dynamic novel view synthesis.
At the same time, almost for free, our approach enables free-viewpoint tracking
of objects of interest - a functionality not yet achieved by existing
NeRF-based methods. Empirical studies validate the effectiveness of our method,
where we achieve state-of-the-art rendering and tracking performance on
multiple challenging datasets. |
Gear-NeRF is a novel dynamic NeRF approach that leverages semantic information from image segmentation models for stratified modeling of 4D scenes, enabling motion-aware sampling for improved novel view synthesis and free-viewpoint object tracking. |
Existing dynamic NeRF methods often suffer from reduced reconstruction quality with limited resources and lack semantic understanding of scenes. |
Gear-NeRF utilizes a 4D semantic embedding to assign gear levels to scene regions based on motion scales, allowing for differentiated spatio-temporal sampling resolutions. |
Achieves state-of-the-art rendering quality on multiple challenging datasets, outperforming baselines in PSNR, SSIM, and LPIPS.
Enables free-viewpoint object tracking with simple user prompts like clicks, achieving over 90% mIoU and accuracy on evaluated datasets.
Demonstrates the effectiveness of motion-aware sampling and semantic embedding through ablation studies. |
Training and inference times are longer compared to some baselines due to the increased sampling density in high-motion regions.
Future work includes exploring different gear assignment strategies and optimizing for faster training and inference. |
neural radiance fields, dynamic scene reconstruction, novel view synthesis, object tracking, semantic segmentation |
2406.03697
Report |
Superpoint Gaussian Splatting for Real-Time High-Fidelity Dynamic Scene Reconstruction |
Diwen Wan, Ruijie Lu, Gang Zeng |
Rendering novel view images in dynamic scenes is a crucial yet challenging
task. Current methods mainly utilize NeRF-based methods to represent the static
scene and an additional time-variant MLP to model scene deformations, resulting
in relatively low rendering quality as well as slow inference speed. To tackle
these challenges, we propose a novel framework named Superpoint Gaussian
Splatting (SP-GS). Specifically, our framework first employs explicit 3D
Gaussians to reconstruct the scene and then clusters Gaussians with similar
properties (e.g., rotation, translation, and location) into superpoints.
Empowered by these superpoints, our method manages to extend 3D Gaussian
splatting to dynamic scenes with only a slight increase in computational
expense. Apart from achieving state-of-the-art visual quality and real-time
rendering under high resolutions, the superpoint representation provides a
stronger manipulation capability. Extensive experiments demonstrate the
practicality and effectiveness of our approach on both synthetic and real-world
datasets. Please see our project page at
https://dnvtmf.github.io/SP_GS.github.io. |
Introduces Superpoint Gaussian Splatting (SP-GS), a novel approach for high-fidelity and real-time rendering in dynamic scenes that clusters similar 3D Gaussians into superpoints to reduce computational expense. |
Rendering novel views in dynamic scenes is crucial but challenging, with existing NeRF-based methods suffering from low rendering quality and slow inference speed. |
SP-GS reconstructs scenes with explicit 3D Gaussians and groups them into superpoints based on similar deformation properties. A deformation network predicts superpoint transformations, enabling efficient rendering. A property reconstruction loss enforces rigidity within superpoints. |
Achieves real-time rendering on dynamic scenes, up to 227 FPS at 800x800 resolution for synthetic datasets and 117 FPS at 536x960 for real datasets.
Outperforms previous state-of-the-art methods in terms of visual quality and rendering speed on D-NeRF, HyperNeRF, and NeRF-DS datasets.
Demonstrates strong extensibility, supporting applications like non-rigid motion prediction, model distillation, pose estimation, and scene editing. |
Real-world scene reconstruction relies on sparse point clouds, which can be challenging to obtain accurately, especially for dynamic scenes.
Reliance on COLMAP for camera pose estimation in dynamic scenes can be limiting due to its design for static scenes. |
3d reconstruction, novel view synthesis, dynamic scene, gaussian splatting, real-time rendering |
2406.03586
Report |
CountCLIP -- [Re] Teaching CLIP to Count to Ten |
Harshvardhan Mestha, Tejas Agrawal, Karan Bania, Shreyas V, Yash Bhisikar |
Large vision-language models (VLMs) are shown to learn rich joint image-text
representations enabling high performances in relevant downstream tasks.
However, they fail to showcase their quantitative understanding of objects, and
they lack good counting-aware representation. This paper conducts a
reproducibility study of 'Teaching CLIP to Count to Ten' (Paiss et al., 2023),
which presents a method to finetune a CLIP model (Radford et al., 2021) to
improve zero-shot counting accuracy in an image while maintaining the
performance for zero-shot classification by introducing a counting-contrastive
loss term. We improve the model's performance on a smaller subset of their
training data with lower computational resources. We verify these claims by
reproducing their study with our own code. The implementation can be found at
https://github.com/SforAiDl/CountCLIP. |
This paper presents a reproducibility study of 'Teaching CLIP to Count to Ten', focusing on improving the zero-shot counting accuracy of CLIP models while maintaining zero-shot classification performance. |
Count-aware VLMs are crucial for enhancing text-to-image and text-to-video models, enabling the generation of accurate content with the correct number of entities. |
The study fine-tunes a pre-trained CLIP model using a counting-contrastive loss term alongside the regular contrastive loss. Three novel schemes for balancing the loss function based on class frequencies are introduced: λ_norm, λ_modal, and λ_log. Additionally, the counting objective is modified to contrast against all possible incorrect counts (CountPlus). |
The study achieves comparable or better zero-shot counting accuracy than the original work, even with a 640 times smaller training dataset.
Balancing the auxiliary loss weight using class frequencies proves effective for improving performance in scenarios with extreme class imbalance and limited data.
Changing the counting objective to a multiclass classification loss, combined with balanced lambda, further enhances performance. |
While improving overall accuracy, class-balancing schemes might compromise the accuracy of data-rich classes.
The models struggle to predict higher-numbered classes (7-10) due to limited training data for these classes. Future work should focus on gathering more diverse training data to address this imbalance. |
vision-language models, zero-shot counting, clip, countbench, class imbalance |
2406.03520
Report |
VideoPhy: Evaluating Physical Commonsense for Video Generation |
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover |
Recent advances in internet-scale video data pretraining have led to the
development of text-to-video generative models that can create high-quality
videos across a broad range of visual concepts and styles. Due to their ability
to synthesize realistic motions and render complex objects, these generative
models have the potential to become general-purpose simulators of the physical
world. However, it is unclear how far we are from this goal with the existing
text-to-video generative models. To this end, we present VideoPhy, a benchmark
designed to assess whether the generated videos follow physical commonsense for
real-world activities (e.g. marbles will roll down when placed on a slanted
surface). Specifically, we curate a list of 688 captions that involve
interactions between various material types in the physical world (e.g.,
solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on
these captions from diverse state-of-the-art text-to-video generative models,
including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere
from Google, Pika). Further, our human evaluation reveals that the existing
models severely lack the ability to generate videos adhering to the given text
prompts, while also lack physical commonsense. Specifically, the best
performing model, Pika, generates videos that adhere to the caption and
physical laws for only 19.7% of the instances. VideoPhy thus highlights that
the video generative models are far from accurately simulating the physical
world. Finally, we also supplement the dataset with an auto-evaluator,
VideoCon-Physics, to assess semantic adherence and physical commonsense at
scale. |
This paper introduces VideoPhysics, a benchmark dataset designed to evaluate the physical commonsense of text-to-video (T2V) generative models. |
Current T2V models are being explored as potential physical world simulators. However, their ability to adhere to real-world physics remains unclear, necessitating a dedicated benchmark like VideoPhysics. |
The researchers curated 688 text prompts describing interactions between different states of matter (solid-solid, solid-fluid, fluid-fluid). They generated videos from nine different T2V models using these prompts and conducted human evaluations to assess semantic adherence and physical commonsense. An automatic evaluator, VideoCon++, was also developed for scalable testing. |
Existing T2V models exhibit a significant lack of physical commonsense, with the best model (Pika) achieving only 19.7% accuracy in both semantic adherence and physical plausibility.
Models struggle the most with captions depicting solid-solid interactions, indicating an area for improvement.
VideoCon++, a fine-tuned video-language model, outperforms baselines like GPT-4Vision and Gemini-Pro-Vision-1.5 in evaluating semantic adherence and physical commonsense. |
The study is limited by the scope of the VideoPhysics dataset and the diversity of T2V models evaluated.
Human evaluations, while insightful, are expensive and may not capture the nuances of diverse cultural perspectives on physics. |
text-to-video generation, physical commonsense, benchmarking, video understanding, generative ai |
2406.03459
Report |
LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection |
Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang |
In this paper, we present a light-weight detection transformer, LW-DETR,
which outperforms YOLOs for real-time object detection. The architecture is a
simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our
approach leverages recent advanced techniques, such as training-effective
techniques, e.g., improved loss and pretraining, and interleaved window and
global attentions for reducing the ViT encoder complexity. We improve the ViT
encoder by aggregating multi-level feature maps, and the intermediate and final
feature maps in the ViT encoder, forming richer feature maps, and introduce
window-major feature map organization for improving the efficiency of
interleaved attention computation. Experimental results demonstrate that the
proposed approach is superior over existing real-time detectors, e.g., YOLO and
its variants, on COCO and other benchmark datasets. Code and models are
available at (https://github.com/Atten4Vis/LW-DETR). |
This paper introduces LW-DETR, a lightweight detection transformer designed for real-time object detection that outperforms YOLO models. |
Real-time object detection is crucial for various applications, and this work explores the potential of transformers in this domain. |
LW-DETR employs a simple architecture consisting of a ViT encoder, a projector, and a shallow DETR decoder. It leverages:
- Multi-level feature aggregation
- Interleaved window and global attentions for efficiency
- Window-major feature map organization for faster inference
- Effective training techniques like improved loss and pretraining |
LW-DETR surpasses previous state-of-the-art real-time detectors, including YOLO-NAS, YOLOv8, and RTMDet, on COCO and other benchmarks.
Pretraining on Objects365 significantly boosts LW-DETR performance, demonstrating the benefit of large-scale pretraining for transformer-based detectors.
The analysis highlights the impact of NMS post-processing on latency in non-end-to-end detectors and how tuning the score threshold can improve efficiency. |
The paper focuses solely on real-time detection, and further research is needed to explore its applicability to open-world detection and other vision tasks.
Exploring more complex network architectures, similar to those used in YOLO-NAS, could potentially further enhance LW-DETR's performance. |
object detection, real-time, detection transformer, vision transformer (vit), pretraining |
2406.03417
Report |
CoFie: Learning Compact Neural Surface Representations with Coordinate Fields |
Hanwen Jiang, Haitao Yang, Georgios Pavlakos, Qixing Huang |
This paper introduces CoFie, a novel local geometry-aware neural surface
representation. CoFie is motivated by the theoretical analysis of local SDFs
with quadratic approximation. We find that local shapes are highly compressive
in an aligned coordinate frame defined by the normal and tangent directions of
local shapes. Accordingly, we introduce Coordinate Field, which is a
composition of coordinate frames of all local shapes. The Coordinate Field is
optimizable and is used to transform the local shapes from the world coordinate
frame to the aligned shape coordinate frame. It largely reduces the complexity
of local shapes and benefits the learning of MLP-based implicit
representations. Moreover, we introduce quadratic layers into the MLP to
enhance expressiveness concerning local shape geometry. CoFie is a
generalizable surface representation. It is trained on a curated set of 3D
shapes and works on novel shape instances during testing. When using the same
amount of parameters with prior works, CoFie reduces the shape error by 48% and
56% on novel instances of both training and unseen shape categories. Moreover,
CoFie demonstrates comparable performance to prior works when using only 70%
fewer parameters. |
This paper presents CoFie, a novel local geometry-aware neural surface representation that uses a Coordinate Field to transform local shapes into an aligned coordinate system, simplifying their representation and improving learning. |
Existing local-aware neural surface representations often lead to a significant increase in parameters. CoFie addresses this by reducing the complexity of representing local shapes through aligned coordinate frames. |
CoFie represents shapes hierarchically using voxels for coarse geometry and MLP-based neural SDFs for fine-grained details within each voxel. It introduces a learnable Coordinate Field to align local shapes and employs quadratic layers in the MLP to enhance the representation of local shape geometry. |
CoFie reduces shape error by 48% and 56% on novel instances of both training and unseen shape categories compared to prior arts.
CoFie achieves comparable results to prior work while using 70% fewer parameters.
CoFie, using a single shared MLP, demonstrates performance comparable to methods that overfit a specific model for each testing shape. |
CoFie's reliance on local shapes limits its applicability to shape completion tasks, unlike methods with global shape priors.
The fixed cell resolution in CoFie can be problematic when a local cell intersects with thin structures. |
neural surface representation, coordinate field, local geometry, shape auto-decoding, implicit neural representations |
2406.03303
Report |
Learning Visual Prompts for Guiding the Attention of Vision Transformers |
Razieh Rezaei, Masoud Jalili Sabet, Jindong Gu, Daniel Rueckert, Philip Torr, Ashkan Khakzar |
Visual prompting infuses visual information into the input image to adapt
models toward specific predictions and tasks. Recently, manually crafted
markers such as red circles are shown to guide the model to attend to a target
region on the image. However, these markers only work on models trained with
data containing those markers. Moreover, finding these prompts requires
guesswork or prior knowledge of the domain on which the model is trained. This
work circumvents manual design constraints by proposing to learn the visual
prompts for guiding the attention of vision transformers. The learned visual
prompt, added to any input image would redirect the attention of the
pre-trained vision transformer to its spatial location on the image.
Specifically, the prompt is learned in a self-supervised manner without
requiring annotations and without fine-tuning the vision transformer. Our
experiments demonstrate the effectiveness of the proposed optimization-based
visual prompting strategy across various pre-trained vision encoders. |
This paper introduces a self-supervised method for learning visual prompts that guide the attention of pre-trained vision transformers without requiring manual design or fine-tuning. |
This is important because it allows for the adaptation of various vision transformers to specific tasks and predictions without relying on dataset biases or manual prompt engineering, which can be limiting. |
The method involves training a deep neural prior to generate a visual prompt (patch). This prompt is then applied to random locations on images, and the attention values of the vision transformer are used to calculate a self-supervised loss. This loss guides the optimization of the prompt to attract attention to its location. |
Learned prompts effectively guide attention in various vision transformers, including CLIP variants, DeiT, and DINO.
Optimal prompts are not universal and vary across models and training paradigms.
The method outperforms baselines in keypoint naming tasks, particularly when image context is crucial. |
The work primarily explores the prompt's effectiveness in keypoint naming tasks; further investigation into other vision tasks is needed.
The impact of prompt size and shape on performance warrants more in-depth analysis. |
visual prompting, vision transformers, self-supervised learning, attention mechanisms, prompt optimization |
2406.03293
Report |
Text-to-Image Rectified Flow as Plug-and-Play Priors |
Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, Guosheng Lin |
Large-scale diffusion models have achieved remarkable performance in
generative tasks. Beyond their initial training applications, these models have
proven their ability to function as versatile plug-and-play priors. For
instance, 2D diffusion models can serve as loss functions to optimize 3D
implicit models. Rectified flow, a novel class of generative models, enforces a
linear progression from the source to the target distribution and has
demonstrated superior performance across various domains. Compared to
diffusion-based methods, rectified flow approaches surpass in terms of
generation quality and efficiency, requiring fewer inference steps. In this
work, we present theoretical and experimental evidence demonstrating that
rectified flow based methods offer similar functionalities to diffusion models
- they can also serve as effective priors. Besides the generative capabilities
of diffusion priors, motivated by the unique time-symmetry properties of
rectified flow models, a variant of our method can additionally perform image
inversion. Experimentally, our rectified flow-based priors outperform their
diffusion counterparts - the SDS and VSD losses - in text-to-3D generation. Our
method also displays competitive performance in image inversion and editing. |
This paper presents the first study on using pretrained rectified flow models as priors for image editing, inversion and 3D generation, similar to how diffusion models are used. |
Rectified flow models are gaining popularity for their superior generation quality and efficiency compared to diffusion models, but their potential as priors remained unexplored. |
The authors propose three methods: 1) RFDS: analogous to SDS loss in diffusion, 2) iRFDS: utilizes time-symmetry of rectified flow for image inversion, 3) RFDS-Rev: a two-stage method to improve RFDS generation quality. |
RFDS-Rev achieves state-of-the-art performance in text-to-3D generation benchmarks among 2D lifting methods, surpassing diffusion priors.
iRFDS demonstrates competitive performance in image inversion and editing compared to diffusion-based methods.
Rectified flow based priors show faster convergence speed in 3D generation than diffusion priors. |
The proposed methods inherit limitations of 2D models, such as difficulty in generating 3D objects with consistent camera poses.
The priors might inherit biases present in the pretrained text-to-image models. |
rectified flow, diffusion model, generative prior, text-to-3d generation, image inversion |
2406.03280
Report |
FusionBench: A Comprehensive Benchmark of Deep Model Fusion |
Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Do, Dacheng Tao |
Deep model fusion is an emerging technique that unifies the predictions or
parameters of several deep neural networks into a single model in a
cost-effective and data-efficient manner. This enables the unified model to
take advantage of the original models' strengths, potentially exceeding their
performance. Although a variety of deep model fusion techniques have been
introduced, their evaluations tend to be inconsistent and often inadequate to
validate their effectiveness and robustness against distribution shifts. To
address this issue, we introduce FusionBench, which is the first comprehensive
benchmark dedicated to deep model fusion. FusionBench covers a wide range of
tasks, including open-vocabulary image classification, text classification, and
text-to-text generation. Each category includes up to eight tasks with
corresponding task-specific models, featuring both full fine-tuning and LoRA
fine-tuning, as well as models of different sizes, to ensure fair and balanced
comparisons of various multi-task model fusion techniques across different
tasks, model scales, and fine-tuning strategies. We implement and evaluate a
broad spectrum of deep model fusion techniques. These techniques range from
model ensemble methods, which combine the predictions to improve the overall
performance, to model merging, which integrates different models into a single
one, and model mixing methods, which upscale or recombine the components of the
original models. FusionBench now contains 26 distinct tasks, 74 fine-tuned
models, and 16 fusion techniques, and we are committed to consistently
expanding the benchmark with more tasks, models, and fusion techniques. In
addition, we offer a well-documented set of resources and guidelines to aid
researchers in understanding and replicating the benchmark results. Homepage
https://github.com/tanganke/fusion_bench |
This paper introduces FusionBench, the first comprehensive benchmark dedicated to evaluating deep model fusion techniques across a variety of tasks and model architectures. |
Standardized benchmarks for evaluating deep model fusion are lacking, making it challenging to verify the effectiveness and robustness of these techniques. FusionBench addresses this issue and provides insights into best practices and future research directions. |
FusionBench adopts a modular and extensible platform comprising three core modules: Algorithm Module, Model Pool Module, and Task Pool Module. The benchmark covers a wide range of tasks, including image classification, scene understanding, text classification, and text-to-text generation, using various deep learning models like CLIP, ResNet-50, GPT-2, and Flan-T5. |
Multi-task model fusion algorithms generally outperform pre-trained models, demonstrating the effectiveness of knowledge transfer.
Layer-wise AdaMerging and Weight-Ensembling MoE achieve superior overall performance among the multi-task model fusion methods.
Adaptive model fusion methods may be prone to overfitting on certain tasks when the test data distribution is corrupted, indicating the need for further regularization to improve generalization and robustness. |
FusionBench currently primarily focuses on evaluating deep model fusion for multi-task learning.
Future work includes extending the benchmark by incorporating additional datasets and applications, such as human preference alignment, multi-modal fusion, and reinforcement learning tasks. |
deep model fusion, benchmarking, multi-task learning, model ensemble, model merging, model mixing |
2406.03215
Report |
Searching Priors Makes Text-to-Video Synthesis Better |
Haoran Cheng, Liang Peng, Linxuan Xia, Yuepeng Hu, Hengjia Li, Qinglin Lu, Xiaofei He, Boxi Wu |
Significant advancements in video diffusion models have brought substantial
progress to the field of text-to-video (T2V) synthesis. However, existing T2V
synthesis model struggle to accurately generate complex motion dynamics,
leading to a reduction in video realism. One possible solution is to collect
massive data and train the model on it, but this would be extremely expensive.
To alleviate this problem, in this paper, we reformulate the typical T2V
generation process as a search-based generation pipeline. Instead of scaling up
the model training, we employ existing videos as the motion prior database.
Specifically, we divide T2V generation process into two steps: (i) For a given
prompt input, we search existing text-video datasets to find videos with text
labels that closely match the prompt motions. We propose a tailored search
algorithm that emphasizes object motion features. (ii) Retrieved videos are
processed and distilled into motion priors to fine-tune a pre-trained base T2V
model, followed by generating desired videos using input prompt. By utilizing
the priors gleaned from the searched videos, we enhance the realism of the
generated videos' motion. All operations can be finished on a single NVIDIA RTX
4090 GPU. We validate our method against state-of-the-art T2V models across
diverse prompt inputs. The code will be public. |
This paper introduces a novel search-based text-to-video (T2V) generation pipeline that leverages existing video data to improve the realism of generated videos, particularly in terms of motion dynamics. |
Current T2V models often struggle to generate realistic and complex motion sequences. This work aims to address this limitation by utilizing the abundance of real-world motion information available in existing video datasets. |
The proposed method involves two main steps: 1) **Video Retrieval:** Given a text prompt, semantically similar videos are retrieved from a dataset. 2) **Tuning and Synthesis:** Keyframes are extracted from the retrieved videos, distilled into motion priors, and used to fine-tune a pre-trained T2V model for generating the final video. |
The method generates videos with more realistic and temporally coherent motion compared to existing T2V models.
User studies confirm that the generated videos are perceived as more realistic and better aligned with the input text prompts.
Ablation studies highlight the importance of both the video retrieval and motion distillation components for achieving high-quality results. |
The method's reliance on text-based video retrieval can be limiting due to semantic ambiguity and the complex relationship between motion and appearance.
The keyframe extraction process may sometimes miss broader dynamic context, focusing solely on detected objects. |
text-to-video synthesis, video diffusion models, motion dynamics, video retrieval, motion distillation |
2406.03184
Report |
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion |
Hao Wen, Zehuan Huang, Yaohui Wang, Xinyuan Chen, Yu Qiao, Lu Sheng |
Existing single image-to-3D creation methods typically involve a two-stage
process, first generating multi-view images, and then using these images for 3D
reconstruction. However, training these two stages separately leads to
significant data bias in the inference phase, thus affecting the quality of
reconstructed results. We introduce a unified 3D generation framework, named
Ouroboros3D, which integrates diffusion-based multi-view image generation and
3D reconstruction into a recursive diffusion process. In our framework, these
two modules are jointly trained through a self-conditioning mechanism, allowing
them to adapt to each other's characteristics for robust inference. During the
multi-view denoising process, the multi-view diffusion model uses the 3D-aware
maps rendered by the reconstruction module at the previous timestep as
additional conditions. The recursive diffusion framework with 3D-aware feedback
unites the entire process and improves geometric consistency.Experiments show
that our framework outperforms separation of these two stages and existing
methods that combine them at the inference phase. Project page:
https://costwen.github.io/Ouroboros3D/ |
Ouroboros3D, a unified image-to-3D creation framework that integrates multi-view image generation and 3D reconstruction into a recursive diffusion process. |
Existing two-stage methods for single image-to-3D creation suffer from data bias during inference, which affects the quality of the reconstructed 3D models. This paper aims to address this issue by proposing a unified framework. |
Ouroboros3D jointly trains a multi-view diffusion model and a feed-forward reconstruction model through a self-conditioning mechanism. During multi-view denoising, the diffusion model utilizes 3D-aware maps (e.g., color and coordinate maps) rendered from the reconstructed 3D model at the previous timestep as additional conditions. |
Ouroboros3D outperforms existing two-stage methods and methods that combine stages during inference in terms of multi-view consistency and 3D reconstruction quality.
The proposed framework effectively mitigates data bias by enabling the two stages to adapt to each other's characteristics.
Experiments demonstrate superior geometric consistency and detail in the generated 3D models. |
The current implementation utilizes 3D Gaussian Splatting as the 3D representation, which might limit its applicability in certain domains.
Future work includes exploring mesh-based 3D representations and extending the framework to handle 3D scenes. |
3d reconstruction, diffusion models, multi-view synthesis, self-conditioning, image-to-3d generation |
2406.03175
Report |
Dynamic 3D Gaussian Fields for Urban Areas |
Tobias Fischer, Jonas Kulhanek, Samuel Rota Bulò, Lorenzo Porzi, Marc Pollefeys, Peter Kontschieder |
We present an efficient neural 3D scene representation for novel-view
synthesis (NVS) in large-scale, dynamic urban areas. Existing works are not
well suited for applications like mixed-reality or closed-loop simulation due
to their limited visual quality and non-interactive rendering speeds. Recently,
rasterization-based approaches have achieved high-quality NVS at impressive
speeds. However, these methods are limited to small-scale, homogeneous data,
i.e. they cannot handle severe appearance and geometry variations due to
weather, season, and lighting and do not scale to larger, dynamic areas with
thousands of images. We propose 4DGF, a neural scene representation that scales
to large-scale dynamic urban areas, handles heterogeneous input data, and
substantially improves rendering speeds. We use 3D Gaussians as an efficient
geometry scaffold while relying on neural fields as a compact and flexible
appearance model. We integrate scene dynamics via a scene graph at global scale
while modeling articulated motions on a local level via deformations. This
decomposed approach enables flexible scene composition suitable for real-world
applications. In experiments, we surpass the state-of-the-art by over 3 dB in
PSNR and more than 200 times in rendering speed. |
This paper introduces a novel neural scene representation method for large-scale, dynamic urban areas, enabling efficient and high-quality novel-view synthesis. |
Existing methods struggle to achieve both high visual quality and fast rendering speeds in complex urban environments, limiting their use in applications like mixed-reality and simulation. |
The method leverages 3D Gaussian primitives for geometry, neural fields for compact and flexible appearance modeling, and a scene graph to handle scene dynamics and transient geometry variations. |
The method outperforms state-of-the-art approaches by over 3dB in PSNR and is more than 200x faster in rendering.
It effectively reconstructs large-scale urban areas from heterogeneous data sources with varying weather, lighting, and seasons.
The approach successfully models non-rigid object motion, such as pedestrians and cyclists, via a deformation head in the scene graph. |
The method currently does not model image distortions caused by the physical image formation process, such as rolling shutter or motion blur.
The assumption of a pinhole camera model might be suboptimal for certain capturing settings, such as equirectangular cameras. |
novel view synthesis, 3d scene representation, neural fields, 3d gaussian splatting, dynamic scenes |
2406.03070
Report |
A-Bench: Are LMMs Masters at Evaluating AI-generated Images? |
Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo Min, Zijian Chen, Xiaohong Liu, Weisi Lin, Guangtao Zhai |
How to accurately and efficiently assess AI-generated images (AIGIs) remains
a critical challenge for generative models. Given the high costs and extensive
time commitments required for user studies, many researchers have turned
towards employing large multi-modal models (LMMs) as AIGI evaluators, the
precision and validity of which are still questionable. Furthermore,
traditional benchmarks often utilize mostly natural-captured content rather
than AIGIs to test the abilities of LMMs, leading to a noticeable gap for
AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to
diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is
organized under two key principles: 1) Emphasizing both high-level semantic
understanding and low-level visual quality perception to address the intricate
demands of AIGIs. 2) Various generative models are utilized for AIGI creation,
and various LMMs are employed for evaluation, which ensures a comprehensive
validation scope. Ultimately, 2,864 AIGIs from 16 text-to-image models are
sampled, each paired with question-answers annotated by human experts, and
tested across 18 leading LMMs. We hope that A-Bench will significantly enhance
the evaluation process and promote the generation quality for AIGIs. The
benchmark is available at https://github.com/Q-Future/A-Bench. |
This paper introduces A-Bench, a diagnostic benchmark designed to evaluate the ability of large multi-modal models (LMMs) to assess AI-generated images (AIGIs). |
Accurate and efficient evaluation of AIGIs is crucial, but existing methods using small expert models or traditional benchmarks have limitations. LMMs are increasingly used for evaluation, but their reliability remains questionable. |
A-Bench focuses on high-level semantic understanding (A-Bench$^{P1}$) and low-level quality perception (A-Bench$^{P2}$). It includes 2,864 AIGIs from 16 T2I models, paired with question-answers annotated by human experts, and tests 18 LMMs. |
LMMs outperform random guessing but lag significantly behind human performance in evaluating AIGIs.
LMMs excel at basic semantic understanding but struggle with complex prompts and nuanced quality assessment, particularly in identifying generative distortions.
Proprietary LMMs generally outperform open-source LMMs, but both fall short of human-level evaluation. |
The choice and number of generative models and LMMs used in A-Bench might limit the generalizability of the results.
The rapid evolution of AI might necessitate frequent updates to A-Bench to maintain its relevance. |
ai-generated images, image evaluation, large multi-modal models, benchmarking, semantic understanding |
2406.03035
Report |
Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control |
Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, Wei Liu, Mengyang Liu, Wenhan Luo |
Pose-controllable character video generation is in high demand with extensive
applications for fields such as automatic advertising and content creation on
social media platforms. While existing character image animation methods using
pose sequences and reference images have shown promising performance, they tend
to struggle with incoherent animation in complex scenarios, such as multiple
character animation and body occlusion. Additionally, current methods request
large-scale high-quality videos with stable backgrounds and temporal
consistency as training datasets, otherwise, their performance will greatly
deteriorate. These two issues hinder the practical utilization of character
image animation tools. In this paper, we propose a practical and robust
framework Follow-Your-Pose v2, which can be trained on noisy open-sourced
videos readily available on the internet. Multi-condition guiders are designed
to address the challenges of background stability, body occlusion in
multi-character generation, and consistency of character appearance. Moreover,
to fill the gap of fair evaluation of multi-character pose animation, we
propose a new benchmark comprising approximately 4,000 frames. Extensive
experiments demonstrate that our approach outperforms state-of-the-art methods
by a margin of over 35\% across 2 datasets and on 7 metrics. Meanwhile,
qualitative assessments reveal a significant improvement in the quality of
generated video, particularly in scenarios involving complex backgrounds and
body occlusion of multi-character, suggesting the superiority of our approach. |
This paper presents Follow-Your-Pose v2, a practical and robust framework for character image animation trained on noisy open-sourced videos, addressing limitations of existing methods in handling complex scenarios like multiple characters and body occlusion. |
Pose-controllable character video generation is crucial for various applications, including automatic advertising and content creation. Existing methods struggle with incoherent animation in complex scenes and require high-quality training data, limiting their practicality. |
FYPv2 employs multi-condition guided generation: optical flow guider for background stability, depth guider for addressing body occlusion in multi-character generation, and reference pose guider for appearance consistency. It's trained on a large-scale noisy dataset from the internet. Additionally, a new benchmark with approximately 4,000 frames is proposed for evaluating multi-character pose animation. |
FYPv2 outperforms state-of-the-art methods by over 35% across 2 datasets and on 7 metrics.
It demonstrates significant improvement in generating temporally consistent and realistic animations, especially in complex backgrounds and multi-character scenes with body occlusion.
The proposed multi-character benchmark provides a valuable resource for evaluating character animation models. |
The model's performance might be affected by extreme pose variations or complex actions not well-represented in the training data.
Future work could explore incorporating more sophisticated temporal modeling techniques for smoother and more natural animations. |
character image animation, pose control, video generation, latent diffusion model, multi-character animation |
2406.02968
Report |
Adversarial Generation of Hierarchical Gaussians for 3D Generative Model |
Sangeek Hyun, Jae-Pil Heo |
Most advances in 3D Generative Adversarial Networks (3D GANs) largely depend
on ray casting-based volume rendering, which incurs demanding rendering costs.
One promising alternative is rasterization-based 3D Gaussian Splatting (3D-GS),
providing a much faster rendering speed and explicit 3D representation. In this
paper, we exploit Gaussian as a 3D representation for 3D GANs by leveraging its
efficient and explicit characteristics. However, in an adversarial framework,
we observe that a na\"ive generator architecture suffers from training
instability and lacks the capability to adjust the scale of Gaussians. This
leads to model divergence and visual artifacts due to the absence of proper
guidance for initialized positions of Gaussians and densification to manage
their scales adaptively. To address these issues, we introduce a generator
architecture with a hierarchical multi-scale Gaussian representation that
effectively regularizes the position and scale of generated Gaussians.
Specifically, we design a hierarchy of Gaussians where finer-level Gaussians
are parameterized by their coarser-level counterparts; the position of
finer-level Gaussians would be located near their coarser-level counterparts,
and the scale would monotonically decrease as the level becomes finer, modeling
both coarse and fine details of the 3D scene. Experimental results demonstrate
that ours achieves a significantly faster rendering speed (x100) compared to
state-of-the-art 3D consistent GANs with comparable 3D generation capability.
Project page: https://hse1032.github.io/gsgan. |
This paper introduces the use of 3D Gaussian representation with rasterization for efficient 3D GANs, proposing a hierarchical structure that regularizes the positions and scales of Gaussians to improve training stability and generation quality. |
Existing 3D GANs rely heavily on computationally expensive ray casting-based volume rendering. This paper leverages the efficiency of rasterization-based 3D Gaussian Splatting (3D-GS) to accelerate the rendering process significantly. |
The authors propose a hierarchical 3D Gaussian representation for the generator in 3D GANs. This hierarchy encourages coarse-to-fine 3D scene modeling by linking the position and scale parameters of Gaussians at adjacent levels. The generator architecture, based on transformer blocks, implements this hierarchy, ensuring stable training and detailed scene generation. Additionally, anchor Gaussians are introduced to further enhance the regularization process. |
The proposed method achieves significantly faster rendering speeds (over 100 times faster than state-of-the-art methods) while maintaining comparable generation quality.
Experiments on FFHQ and AFHQ-Cat datasets demonstrate the effectiveness of the proposed method in generating realistic and multi-view consistent images.
The hierarchical Gaussian representation stabilizes the training process, especially during the early stages, compared to naive 3D Gaussian implementations in GANs. |
The number of Gaussians used is fixed and not adapted based on scene complexity, potentially limiting representation capacity for diverse scenes.
The scale hierarchy relies on hyperparameters that might need adjustment based on the dataset and resolution. |
generative adversarial networks (gans), 3d gaussian splatting, rasterization, hierarchical representation, efficient rendering |
2406.02965
Report |
Understanding the Impact of Negative Prompts: When and How Do They Take Effect? |
Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, Cho-Jui Hsieh |
The concept of negative prompts, emerging from conditional generation models
like Stable Diffusion, allows users to specify what to exclude from the
generated images.%, demonstrating significant practical efficacy. Despite the
widespread use of negative prompts, their intrinsic mechanisms remain largely
unexplored. This paper presents the first comprehensive study to uncover how
and when negative prompts take effect. Our extensive empirical analysis
identifies two primary behaviors of negative prompts. Delayed Effect: The
impact of negative prompts is observed after positive prompts render
corresponding content. Deletion Through Neutralization: Negative prompts delete
concepts from the generated image through a mutual cancellation effect in
latent space with positive prompts. These insights reveal significant potential
real-world applications; for example, we demonstrate that negative prompts can
facilitate object inpainting with minimal alterations to the background via a
simple adaptive algorithm. We believe our findings will offer valuable insights
for the community in capitalizing on the potential of negative prompts. |
This paper presents the first comprehensive study uncovering the mechanisms of negative prompts in conditional image generation models, particularly their delayed effect and how they delete concepts through neutralization in latent space. |
Despite the popularity of negative prompts for controlling image generation, their intrinsic mechanisms remain largely unexplored, hindering the full utilization of their potential. |
The authors conduct extensive empirical analysis, visualizing cross-attention maps across diffusion steps and analyzing estimated noises, to understand when and how negative prompts take effect. |
Negative prompts exhibit a delayed effect, influencing generation only after positive prompts render corresponding content.
Negative prompts delete objects by neutralizing positive signals in latent space through a subtractive process.
Introducing negative prompts too early can lead to the paradoxical generation of the undesired object ("Reverse Activation") due to the interplay of data distribution guidance and prompt guidance. |
The study primarily focuses on noun and adjective-based negative prompts, leaving other parts of speech unexplored.
Future work can explore incorporating negative prompts during model training as a form of data augmentation. |
negative prompts, diffusion models, image generation, controllable inpainting, reverse activation |
2406.02923
Report |
Rethinking Spiking Neural Networks as State Space Models |
Malyaban Bal, Abhronil Sengupta |
Spiking neural networks (SNNs) are posited as a biologically plausible
alternative to conventional neural architectures, with their core computational
framework resting on the extensively studied leaky integrate-and-fire (LIF)
neuron design. The stateful nature of LIF neurons has spurred ongoing
discussions about the ability of SNNs to process sequential data, akin to
recurrent neural networks (RNNs). Despite this, there remains a significant gap
in the exploration of current SNNs within the realm of long-range dependency
tasks. In this study, to extend the analysis of neuronal dynamics beyond
simplistic LIF mechanism, we present a novel class of stochastic spiking
neuronal model grounded in state space models. We expand beyond the scalar
hidden state representation of LIF neurons, which traditionally comprises only
the membrane potential, by proposing an n-dimensional hidden state.
Additionally, we enable fine-tuned formulation of neuronal dynamics across each
layer by introducing learnable parameters, as opposed to the fixed dynamics in
LIF neurons. We also develop a robust framework for scaling these neuronal
models to deep SNN-based architectures, ensuring efficient parallel training
while also adeptly addressing the challenge of non-differentiability of
stochastic spiking operation during the backward phase. Our models attain
state-of-the-art performance among SNN models across diverse long-range
dependency tasks, encompassing the Long Range Arena benchmark, permuted
sequential MNIST, and the Speech Command dataset. Moreover, we provide an
analysis of the energy efficiency advantages, emphasizing the sparse activity
pattern intrinsic to this spiking model. |
This paper proposes Stochastic Spiking Structured State Space Models (S6), a novel class of neuronal models inspired by biological neurons and based on state space models, to improve spiking neural networks' (SNNs) ability to process long-range dependencies in sequential data. |
Current SNNs, primarily based on the leaky integrate-and-fire (LIF) neuron model, struggle with long-range dependencies due to their simplified dynamics and limited hidden state representation. This limits their application in tasks like natural language processing and time-series analysis where long-term dependencies are crucial. |
The authors replace the scalar hidden state of LIF neurons with an n-dimensional hidden state, enabling richer temporal information encoding. They use a stochastic spiking mechanism instead of the deterministic one in LIF models. They formulate the neuronal dynamics as a convolution operation to enable parallel training and inference, enhancing scalability and energy efficiency. |
S6-based SNNs achieve state-of-the-art performance among SNN models on various long-range dependency tasks, including the Long Range Arena benchmark, permuted sequential MNIST, and the Speech Command dataset.
The model outperforms traditional non-spiking transformer-based architectures on these tasks, demonstrating its capability to handle long sequences effectively.
Analysis shows the model offers significant energy efficiency gains due to the sparse spiking activity inherent in the S6 model. |
The model's performance was primarily evaluated on classification-based long-range dependency tasks. Future work can explore its application to generative tasks.
To fully realize the energy and power efficiency benefits, future steps could involve deploying the model on edge devices and neuromorphic chips like Intel Loihi 2. |
spiking neural networks, state space models, long-range dependencies, sequence modeling, neuromorphic computing |
2406.02918
Report |
U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation |
Chenxin Li, Xinyu Liu, Wuyang Li, Cheng Wang, Hengyu Liu, Yixuan Yuan |
U-Net has become a cornerstone in various visual applications such as image
segmentation and diffusion probability models. While numerous innovative
designs and improvements have been introduced by incorporating transformers or
MLPs, the networks are still limited to linearly modeling patterns as well as
the deficient interpretability. To address these challenges, our intuition is
inspired by the impressive results of the Kolmogorov-Arnold Networks (KANs) in
terms of accuracy and interpretability, which reshape the neural network
learning via the stack of non-linear learnable activation functions derived
from the Kolmogorov-Anold representation theorem. Specifically, in this paper,
we explore the untapped potential of KANs in improving backbones for vision
tasks. We investigate, modify and re-design the established U-Net pipeline by
integrating the dedicated KAN layers on the tokenized intermediate
representation, termed U-KAN. Rigorous medical image segmentation benchmarks
verify the superiority of U-KAN by higher accuracy even with less computation
cost. We further delved into the potential of U-KAN as an alternative U-Net
noise predictor in diffusion models, demonstrating its applicability in
generating task-oriented model architectures. These endeavours unveil valuable
insights and sheds light on the prospect that with U-KAN, you can make strong
backbone for medical image segmentation and generation. Project page:
https://yes-ukan.github.io/ |
This paper proposes U-KAN, a novel framework integrating Kolmogorov-Arnold Networks (KANs) into the U-Net architecture, aiming to improve accuracy, efficiency, and interpretability in vision tasks, particularly medical image segmentation. |
Existing U-Net variations, while advanced, face limitations in linearly modeling complex patterns and lack interpretability, hindering their reliability and explainability in critical applications like medical imaging. |
U-KAN employs a two-phrase encoder-decoder structure. It utilizes convolutional blocks for initial feature extraction and introduces tokenized KAN blocks at higher-level representations to capture complex patterns. Additionally, it leverages skip connections for detailed feature fusion. |
U-KAN outperforms state-of-the-art segmentation models, including U-Net++, Att-UNet, and U-Mamba, on BUSI, GlaS, and CVC-ClinicDB datasets, achieving higher IoU and F1 scores.
The method demonstrates superior efficiency with fewer parameters and comparable or lower Gflops than most compared methods, except for U-NeXt.
As a diffusion model backbone, Diffusion U-KAN exhibits superior generative capabilities compared to conventional U-Net-based diffusion models, achieving better FID and IS scores on the tested medical datasets. |
The paper primarily focuses on medical image analysis, exploring segmentation and generation tasks. Further research is needed to validate its effectiveness in broader vision applications.
The impact of different KAN layer configurations and their interplay with other architectural components warrants further investigation to unlock the full potential of U-KAN. |
u-net, kolmogorov-arnold networks, medical image segmentation, image generation, diffusion models |
2406.02917
Report |
A comprehensive and FAIR comparison between MLP and KAN representations for differential equations and operator networks |
Khemraj Shukla, Juan Diego Toscano, Zhicheng Wang, Zongren Zou, George Em Karniadakis |
Kolmogorov-Arnold Networks (KANs) were recently introduced as an alternative
representation model to MLP. Herein, we employ KANs to construct
physics-informed machine learning models (PIKANs) and deep operator models
(DeepOKANs) for solving differential equations for forward and inverse
problems. In particular, we compare them with physics-informed neural networks
(PINNs) and deep operator networks (DeepONets), which are based on the standard
MLP representation. We find that although the original KANs based on the
B-splines parameterization lack accuracy and efficiency, modified versions
based on low-order orthogonal polynomials have comparable performance to PINNs
and DeepONet although they still lack robustness as they may diverge for
different random seeds or higher order orthogonal polynomials. We visualize
their corresponding loss landscapes and analyze their learning dynamics using
information bottleneck theory. Our study follows the FAIR principles so that
other researchers can use our benchmarks to further advance this emerging
topic. |
This work systematically compares Kolmogorov-Arnold Networks (KANs) to Multilayer Perceptrons (MLPs) for solving differential equations and operator learning problems, focusing on their accuracy, efficiency, and learning dynamics. |
Despite the popularity of MLPs in scientific machine learning, they have limitations in interpretability and efficiency. KANs offer a potentially more interpretable and accurate alternative, making their systematic evaluation crucial. |
The authors benchmark various KAN architectures against MLP-based models (PINNs, DeepONets) on several problems: function approximation, Hamiltonian dynamics, Helmholtz equation, Navier-Stokes equation, Allen-Cahn equation, Burgers' equation, and Darcy flow. They analyze accuracy, training time, and use the Information Bottleneck theory to understand learning dynamics. |
Modified Chebyshev KANs (cPIKANs) show comparable accuracy to PINNs, sometimes outperforming them, but with increased training time.
cPIKANs are more robust to noise than DeepONets in operator learning tasks but require more computational resources.
Both PINNs and cPIKANs exhibit similar learning dynamics through fitting, diffusion, and total diffusion stages, as revealed by the Information Bottleneck analysis. |
Training cPIKANs is computationally more expensive than PINNs, especially for high-dimensional problems.
cPIKANs exhibit sensitivity to initialization and choice of polynomial order, sometimes leading to instability. |
scientific machine learning, kolmogorov-arnold networks, physics-informed neural networks, operator learning, information bottleneck |
2406.02915
Report |
Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models |
Jinhao Li, Haopeng Li, Sarah Erfani, Lei Feng, James Bailey, Feng Liu |
It has recently been discovered that using a pre-trained vision-language
model (VLM), e.g., CLIP, to align a whole query image with several finer text
descriptions generated by a large language model can significantly enhance
zero-shot performance. However, in this paper, we empirically find that the
finer descriptions tend to align more effectively with local areas of the query
image rather than the whole image, and then we theoretically validate this
finding. Thus, we present a method called weighted visual-text cross alignment
(WCA). This method begins with a localized visual prompting technique, designed
to identify local visual areas within the query image. The local visual areas
are then cross-aligned with the finer descriptions by creating a similarity
matrix using the pre-trained VLM. To determine how well a query image aligns
with each category, we develop a score function based on the weighted
similarities in this matrix. Extensive experiments demonstrate that our method
significantly improves zero-shot performance across various datasets, achieving
results that are even comparable to few-shot learning methods. |
This paper proposes Weighted Visual-Text Cross Alignment (WCA), a method that improves zero-shot visual classification by aligning fine-grained text descriptions with local visual areas of an image using localized visual prompting. |
Aligning whole images with fine-grained descriptions can be suboptimal, as such descriptions often better match specific image regions. WCA addresses this limitation by focusing on local alignment, leading to improved performance. |
WCA first segments an image into local patches using localized visual prompting. Then, it cross-aligns these patches with fine-grained text descriptions generated by a large language model for each category, creating a similarity matrix. Finally, a weighted aggregation scheme, considering the relevance of both patches and descriptions, determines the final image-category alignment score. |
WCA significantly outperforms existing zero-shot methods on various benchmarks, including ImageNet, CUB, and Oxford Pets.
The method shows particularly strong improvements on tasks where standard CLIP models struggle, indicating its effectiveness in handling complex visual recognition scenarios.
WCA even achieves performance comparable to few-shot learning methods, highlighting its potential for learning with limited labeled data. |
WCA might be less effective for tasks requiring holistic image understanding rather than object-centric recognition.
The method's performance could be hindered when images contain multiple objects of varying sizes, as patch weights might not always accurately capture the importance of smaller objects. |
visual-text cross alignment, zero-shot classification, vision-language models, visual prompting, large language models |
2406.02881
Report |
Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter |
Peng Xing, Ning Wang, Jianbo Ouyang, Zechao Li |
The remarkable advancement in text-to-image generation models significantly
boosts the research in ID customization generation. However, existing
personalization methods cannot simultaneously satisfy high fidelity and
high-efficiency requirements. Their main bottleneck lies in the prompt image
encoder, which produces weak alignment signals with the text-to-image model and
significantly increased model size. Towards this end, we propose a lightweight
Inv-Adapter, which first extracts diffusion-domain representations of ID images
utilizing a pre-trained text-to-image model via DDIM image inversion, without
additional image encoder. Benefiting from the high alignment of the extracted
ID prompt features and the intermediate features of the text-to-image model, we
then embed them efficiently into the base text-to-image model by carefully
designing a lightweight attention adapter. We conduct extensive experiments to
assess ID fidelity, generation loyalty, speed, and training parameters, all of
which show that the proposed Inv-Adapter is highly competitive in ID
customization generation and model scale. |
This paper proposes Inv-Adapter, a lightweight method for high-fidelity ID customization in text-to-image generation, utilizing DDIM image inversion to extract diffusion-domain representations of ID images and embedding them efficiently via a lightweight attention adapter. |
Existing personalization methods struggle to achieve both high fidelity and high efficiency in ID customization generation due to weak alignment signals and increased model size from prompt image encoders. |
Inv-Adapter extracts diffusion features from pre-trained text-to-image models via DDIM inversion and injects them into both self and cross attention layers using a lightweight Embedded Attention Adapter. |
Inv-Adapter achieves state-of-the-art performance in generating faithful, detailed, and high-fidelity images while maintaining high efficiency.
It effectively preserves ID information while aligning with textual prompts, demonstrated by quantitative metrics (CLIP-I, DINO, FACE-SIM) and qualitative results.
The lightweight design results in smaller training parameters and faster generation speed compared to other methods. |
The current training dataset lacks diversity in face poses, limiting the model's ability to generalize to different viewpoints.
Image inversion introduces a speed bottleneck, which could be addressed in future work with model acceleration techniques like LCM. |
id customization generation, text-to-image generation, image inversion, attention adapter, diffusion models |
2406.02820
Report |
ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models |
Kiymet Akdemir, Pinar Yanardag |
Text-to-image diffusion models have recently taken center stage as pivotal
tools in promoting visual creativity across an array of domains such as comic
book artistry, children's literature, game development, and web design. These
models harness the power of artificial intelligence to convert textual
descriptions into vivid images, thereby enabling artists and creators to bring
their imaginative concepts to life with unprecedented ease. However, one of the
significant hurdles that persist is the challenge of maintaining consistency in
character generation across diverse contexts. Variations in textual prompts,
even if minor, can yield vastly different visual outputs, posing a considerable
problem in projects that require a uniform representation of characters
throughout. In this paper, we introduce a novel framework designed to produce
consistent character representations from a single text prompt across diverse
settings. Through both quantitative and qualitative analyses, we demonstrate
that our framework outperforms existing methods in generating characters with
consistent visual identities, underscoring its potential to transform creative
industries. By addressing the critical challenge of character consistency, we
not only enhance the practical utility of these models but also broaden the
horizons for artistic and creative expression. |
Introduces ORACLE, a novel framework that leverages mutual information to ensure consistent character generation across diverse settings from a single text prompt in text-to-image diffusion models. |
Addresses the critical challenge of maintaining visual consistency in character generation across different contexts, which is crucial for storytelling, brand identity, and character recognition in various creative applications. |
1. Generates a grid of candidate character images from a text prompt using a pre-trained diffusion model. 2. Identifies and removes inconsistent images from the candidate set using mutual information-based filtering. 3. Trains a personalized model (e.g., LoRA) on the refined image set to generate consistent characters in various contexts. |
ORACLE outperforms existing methods in generating characters with consistent visual identities across diverse settings, as demonstrated through qualitative and quantitative comparisons.
User study confirms that ORACLE produces characters that are both consistent and relevant to the given text prompts.
The framework is highly versatile and applicable for various creative tasks like story illustration, object generation, and 3D character modeling. |
Despite consistent input images, the underlying diffusion model may still introduce minor inconsistencies in details like clothing.
The current implementation requires manual cropping of the generated character grid, which can be automated in future work. |
text-to-image synthesis, diffusion models, character consistency, mutual information, personalization |
2406.02720
Report |
3D-HGS: 3D Half-Gaussian Splatting |
Haolin Li, Jinyang Liu, Mario Sznaier, Octavia Camps |
Photo-realistic 3D Reconstruction is a fundamental problem in 3D computer
vision. This domain has seen considerable advancements owing to the advent of
recent neural rendering techniques. These techniques predominantly aim to focus
on learning volumetric representations of 3D scenes and refining these
representations via loss functions derived from rendering. Among these, 3D
Gaussian Splatting (3D-GS) has emerged as a significant method, surpassing
Neural Radiance Fields (NeRFs). 3D-GS uses parameterized 3D Gaussians for
modeling both spatial locations and color information, combined with a
tile-based fast rendering technique. Despite its superior rendering performance
and speed, the use of 3D Gaussian kernels has inherent limitations in
accurately representing discontinuous functions, notably at edges and corners
for shape discontinuities, and across varying textures for color
discontinuities. To address this problem, we propose to employ 3D Half-Gaussian
(3D-HGS) kernels, which can be used as a plug-and-play kernel. Our experiments
demonstrate their capability to improve the performance of current 3D-GS
related methods and achieve state-of-the-art rendering performance on various
datasets without compromising rendering speed. |
This paper introduces 3D Half-Gaussian Splatting (3D-HGS), a novel plug-and-play reconstruction kernel designed to enhance the accuracy of 3D scene reconstruction in neural rendering. The key innovation lies in splitting the traditional 3D Gaussian kernel into two halves, each with learnable opacity, enabling better representation of discontinuities in shape and color often found at edges, corners, and texture-rich areas. |
Accurately reconstructing 3D scenes with photorealism is crucial for various applications such as VR, media production, and autonomous driving. While existing methods like 3D Gaussian Splatting (3D-GS) have achieved impressive speed and quality, they struggle with discontinuities. This work addresses this limitation, aiming for state-of-the-art performance without sacrificing rendering speed. |
The method starts with a 3D scene representation obtained through Structure from Motion. Instead of 3D Gaussians, 3D Half-Gaussians, defined by a splitting plane and individual opacities for each half, are used as reconstruction kernels. These are projected onto the image plane and blended to synthesize novel views. The parameters of these kernels, including the splitting plane normal and opacities, are optimized by minimizing a loss function comparing rendered images to ground truth. |
3D-HGS, when implemented within existing 3D-GS frameworks, demonstrates state-of-the-art rendering performance on datasets like Mip-NeRF360, Tanks & Temples, and Deep Blending.
The method excels at capturing fine-grained details, high-frequency textures, complex lighting, and shadow areas, surpassing previous state-of-the-art methods in both quantitative metrics (PSNR, SSIM, LPIPS) and qualitative visual comparisons.
Ablation studies confirm the effectiveness of the 3D Half-Gaussian kernel compared to other kernel choices and highlight the impact of training strategies, including the learning rate for the normal of the splitting plane. |
Despite improvements in novel view synthesis, 3D-HGS still faces challenges with geometry reconstruction in featureless areas, requiring further research.
The ethical implications of generating highly realistic 3D scenes, including potential misuse for disinformation and privacy violations, are acknowledged, highlighting the need for responsible development and deployment of such technology. |
3d reconstruction, neural rendering, gaussian splatting, novel view synthesis, discontinuity modeling |
2406.02549
Report |
Dreamguider: Improved Training free Diffusion-based Conditional Generation |
Nithin Gopalakrishnan Nair, Vishal M Patel |
Diffusion models have emerged as a formidable tool for training-free
conditional generation.However, a key hurdle in inference-time guidance
techniques is the need for compute-heavy backpropagation through the diffusion
network for estimating the guidance direction. Moreover, these techniques often
require handcrafted parameter tuning on a case-by-case basis. Although some
recent works have introduced minimal compute methods for linear inverse
problems, a generic lightweight guidance solution to both linear and non-linear
guidance problems is still missing. To this end, we propose Dreamguider, a
method that enables inference-time guidance without compute-heavy
backpropagation through the diffusion network. The key idea is to regulate the
gradient flow through a time-varying factor. Moreover, we propose an empirical
guidance scale that works for a wide variety of tasks, hence removing the need
for handcrafted parameter tuning. We further introduce an effective lightweight
augmentation strategy that significantly boosts the performance during
inference-time guidance. We present experiments using Dreamguider on multiple
tasks across multiple datasets and models to show the effectiveness of the
proposed modules. To facilitate further research, we will make the code public
after the review process. |
This paper introduces Dreamguider, a method for inference-time guidance in diffusion models that avoids computationally expensive backpropagation through the network, enabling zero-shot conditional generation. |
Existing inference-time guidance techniques for diffusion models often require heavy computations and case-by-case parameter tuning, limiting their practicality. Dreamguider addresses these limitations with a lightweight and generic approach. |
Dreamguider regulates the gradient flow during inference using a time-varying factor and employs a gradient-dependent scaling factor for automatic parameter tuning. It also introduces DiffuseAugment, a differentiable augmentation strategy, to enhance sampling quality. |
Dreamguider achieves superior performance on linear inverse problems (e.g., super-resolution, colorization) compared to DPS and MGD.
For non-linear tasks (e.g., sketch-to-face, ID guidance), Dreamguider outperforms existing methods like Freedom and MGD in terms of image quality and sampling speed.
The proposed empirical scaling factor and DiffuseAugment effectively enhance the performance of zero-shot conditional generation. |
Direct application to latent diffusion models for linear inverse problems is limited due to VAE reconstruction errors.
While the empirical scaling factor demonstrates effectiveness, a comprehensive mathematical analysis for optimal parameter estimation is left for future work. |
diffusion models, inference-time guidance, zero-shot learning, conditional generation, image restoration |
2406.02548
Report |
Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation |
Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan |
Recent works on open-vocabulary 3D instance segmentation show strong promise,
but at the cost of slow inference speed and high computation requirements. This
high computation cost is typically due to their heavy reliance on 3D clip
features, which require computationally expensive 2D foundation models like
Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a
consequence, this hampers their applicability in many real-world applications
that require both fast and accurate predictions. To this end, we propose a fast
yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO
3D, that effectively leverages only 2D object detection from multi-view RGB
images for open-vocabulary 3D instance segmentation. We address this task by
generating class-agnostic 3D masks for objects in the scene and associating
them with text prompts. We observe that the projection of class-agnostic 3D
point cloud instances already holds instance information; thus, using SAM might
only result in redundancy that unnecessarily increases the inference time. We
empirically find that a better performance of matching text prompts to 3D masks
can be achieved in a faster fashion with a 2D object detector. We validate our
Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios:
(i) with ground truth masks, where labels are required for given object
proposals, and (ii) with class-agnostic 3D proposals generated from a 3D
proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on
both datasets while obtaining up to $\sim$16$\times$ speedup compared to the
best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D
achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds
per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D. |
Proposes Open-YOLO 3D, a fast and accurate open-vocabulary 3D instance segmentation method using 2D object detection from multi-view RGB images. |
Existing open-vocabulary 3D instance segmentation methods are computationally expensive and slow, hindering real-world applications requiring fast and accurate predictions. |
Generates class-agnostic 3D masks and associates them with text prompts using a 2D open-vocabulary object detector to create low-granularity label maps for each frame, then uses these maps to predict labels for the 3D masks. |
Achieves state-of-the-art performance on ScanNet200 and Replica datasets.
Up to 16x faster than existing methods.
Demonstrates the effectiveness of 2D object detection for open-vocabulary 3D instance segmentation. |
Relies solely on a 3D proposal network, potentially missing small objects.
Could benefit from incorporating fast 2D instance segmentation for enhanced 3D proposal generation. |
3d instance segmentation, open-vocabulary, 2d object detection, multi-view, zero-shot learning |
2406.02547
Report |
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning |
Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou |
Training models with longer in-context lengths is a significant challenge for
multimodal model due to substantial GPU memory and computational costs. This
exploratory study does not present state-of-the-art models; rather, it
introduces an innovative method designed to increase in-context text length in
multi-modality large language models (MLLMs) efficiently. We present Visualized
In-Context Text Processing (VisInContext), which processes long in-context text
using visual tokens. This technique significantly reduces GPU memory usage and
floating point operations (FLOPs) for both training and inferenceing stage. For
instance, our method expands the pre-training in-context text length from 256
to 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model.
Experimental results demonstrate that model trained with VisInContext delivers
superior performance on common downstream benchmarks for in-context few-shot
evaluation. Additionally, VisInContext is complementary to existing methods for
increasing in-context text length and enhances document understanding
capabilities, showing great potential in document QA tasks and sequential
document retrieval. |
This paper proposes Text2Vis, a novel method to increase the in-context text length of Multimodal Large Language Models (MLLMs) by rendering text as images, thereby reducing computational cost. |
Training MLLMs with long in-context lengths is crucial for complex tasks like document understanding but is hindered by high GPU memory and computational costs. |
Text2Vis converts long text into images and processes them using a visual encoder alongside regular images. It introduces Token Masking and Text-Centric Contrastive Learning (TCCL) to ensure the model effectively learns text semantics from these rendered images. |
Text2Vis significantly improves performance on multimodal downstream tasks by increasing the effective in-context text length.
It achieves comparable performance to raw text inputs when using rendered text images for text-only in-context examples.
Text2Vis significantly improves the model's document understanding abilities on tasks like DocVQA and OCRVQA. |
Currently, Text2Vis uses a fixed image size even for short texts, leading to potential inefficiencies.
Future work will explore dynamically adjusting image sizes to optimize token usage. |
multimodal learning, large language models, document understanding, in-context learning, computational efficiency |
2406.02541
Report |
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting |
Inkyu Shin, Qihang Yu, Xiaohui Shen, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen |
Recent advancements in zero-shot video diffusion models have shown promise
for text-driven video editing, but challenges remain in achieving high temporal
consistency. To address this, we introduce Video-3DGS, a 3D Gaussian Splatting
(3DGS)-based video refiner designed to enhance temporal consistency in
zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian
optimizing process tailored for editing dynamic monocular videos. In the first
stage, Video-3DGS employs an improved version of COLMAP, referred to as
MC-COLMAP, which processes original videos using a Masked and Clipped approach.
For each video clip, MC-COLMAP generates the point clouds for dynamic
foreground objects and complex backgrounds. These point clouds are utilized to
initialize two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) aiming to represent
foreground and background views. Both foreground and background views are then
merged with a 2D learnable parameter map to reconstruct full views. In the
second stage, we leverage the reconstruction ability developed in the first
stage to impose the temporal constraints on the video diffusion model. To
demonstrate the efficacy of Video-3DGS on both stages, we conduct extensive
experiments across two related tasks: Video Reconstruction and Video Editing.
Video-3DGS trained with 3k iterations significantly improves video
reconstruction quality (+3 PSNR, +7 PSNR increase) and training efficiency
(x1.9, x4.5 times faster) over NeRF-based and 3DGS-based state-of-art methods
on DAVIS dataset, respectively. Moreover, it enhances video editing by ensuring
temporal consistency across 58 dynamic monocular videos. |
This paper introduces Video-3DGS, a two-stage 3D Gaussian Splatting based framework that reconstructs and refines dynamic monocular video scenes, leading to significant improvements in both video reconstruction and editing. |
Existing zero-shot video diffusion models face challenges in achieving high temporal consistency due to their limited understanding of individual video scenes. Video-3DGS aims to address this limitation by leveraging the per-scene representation power of 3DGS. |
In the first stage, Video-3DGS utilizes an improved COLMAP (MC-COLMAP) to generate foreground and background point clouds, which are used to initialize and optimize two sets of 3D Gaussians. These 3D Gaussians, along with a 2D learnable parameter map, enable high-fidelity video reconstruction. In the second stage, the pre-optimized Video-3DGS serves as a plug-and-play refiner for zero-shot video editors, enhancing temporal consistency by fine-tuning color and opacity parameters while maintaining structural fidelity. |
Video-3DGS significantly outperforms NeRF-based and 3DGS-based state-of-the-art methods in video reconstruction quality and training efficiency on the DAVIS dataset.
It consistently enhances temporal consistency and overall editing quality across three off-the-shelf video editors (Text2Video-Zero, TokenFlow, and RAVE) on 58 challenging monocular videos.
User studies confirm a strong preference for Video-3DGS-refined edits over baseline outputs, highlighting its effectiveness in improving video editing quality. |
Video-3DGS faces challenges when foreground objects exhibit extremely large motion, and it may struggle with edits requiring significant changes to object shapes.
Future work includes exploring the potential of Video-3DGS as a fundamental framework for 4D novel view synthesis. |
video editing, 3d gaussian splatting, temporal consistency, zero-shot learning, video reconstruction |
2406.02539
Report |
Parrot: Multilingual Visual Instruction Tuning |
Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye |
The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V
has marked a significant step towards artificial general intelligence. Existing
methods mainly focus on aligning vision encoders with LLMs through supervised
fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs'
inherent ability to react to multiple languages progressively deteriorate as
the training process evolves. We empirically find that the imbalanced SFT
datasets, primarily composed of English-centric image-text pairs, lead to
significantly reduced performance in non-English languages. This is due to the
failure of aligning the vision encoder and LLM with multilingual tokens during
the SFT process. In this paper, we introduce Parrot, a novel method that
utilizes textual guidance to drive visual token alignment at the language
level. Parrot makes the visual tokens condition on diverse language inputs and
uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens.
Specifically, to enhance non-English visual tokens alignment, we compute the
cross-attention using the initial visual features and textual embeddings, the
result of which is then fed into the MoE router to select the most relevant
experts. The selected experts subsequently convert the initial visual tokens
into language-specific visual tokens. Moreover, considering the current lack of
benchmarks for evaluating multilingual capabilities within the field, we
collect and make available a Massive Multilingual Multimodal Benchmark which
includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our
method not only demonstrates state-of-the-art performance on multilingual
MMBench and MMMB, but also excels across a broad range of multimodal tasks.
Both the source code and the training dataset of Parrot will be made publicly
available. |
This paper introduces MAME, a novel method to enhance multilingual capabilities in Multimodal Large Language Models (MLLMs) by leveraging textual guidance to drive visual token alignment at the language level, addressing the issue of English-centric bias in training data. |
Multilingual capability is crucial for MLLMs to cater to diverse linguistic groups and ensure equitable access to AI benefits across different regions and languages. |
MAME utilizes a Mixture-of-Experts (MoE) module to convert English-biased visual features into language-specific embeddings based on input language, enabling the model to better understand and generate responses in various languages. |
MAME achieves state-of-the-art performance on both MMBench and MMMB multilingual benchmarks, surpassing existing methods in most languages.
The model shows competitive performance across a broad range of multimodal tasks, indicating its effectiveness beyond multilingual capabilities.
MAME achieves significant improvements with significantly less multilingual training data compared to other models, demonstrating its efficiency in low-resource scenarios. |
MLLMs, including MAME, may still face challenges in accurately understanding complex language-specific contexts and may exhibit hallucinations.
The current implementation of MAME relies on CLIP for visual processing, limiting its ability to process high-resolution images effectively. |
multimodal large language models, multilingual alignment, mixture-of-experts, textual guidance, visual token alignment |
2406.02535
Report |
Enhancing 2D Representation Learning with a 3D Prior |
Mehmet Aygün, Prithviraj Dhar, Zhicheng Yan, Oisin Mac Aodha, Rakesh Ranjan |
Learning robust and effective representations of visual data is a fundamental
task in computer vision. Traditionally, this is achieved by training models
with labeled data which can be expensive to obtain. Self-supervised learning
attempts to circumvent the requirement for labeled data by learning
representations from raw unlabeled visual data alone. However, unlike humans
who obtain rich 3D information from their binocular vision and through motion,
the majority of current self-supervised methods are tasked with learning from
monocular 2D image collections. This is noteworthy as it has been demonstrated
that shape-centric visual processing is more robust compared to texture-biased
automated methods. Inspired by this, we propose a new approach for
strengthening existing self-supervised methods by explicitly enforcing a strong
3D structural prior directly into the model during training. Through
experiments, across a range of datasets, we demonstrate that our 3D aware
representations are more robust compared to conventional self-supervised
baselines. |
This paper introduces a novel method to enhance the robustness of self-supervised learning (SSL) by explicitly incorporating 3D structural information during training. |
Current SSL methods primarily focus on 2D image collections, leading to representations that may over-rely on texture and exhibit limited robustness. This work draws inspiration from the human visual system's use of 3D cues for robust understanding. |
The method leverages a proxy 3D reconstruction task. A pre-trained SSL backbone extracts image representations, which are then used to generate 3D triplane features. Volume rendering reconstructs the input image and its depth, using pseudo-depth obtained from pre-trained monocular depth models. A distillation loss from the frozen SSL backbone prevents forgetting of previously learned features. |
The 3D-aware representations demonstrate improved robustness on benchmarks like ImageNet-Rendition, ImageNet-Sketch, and PUG, outperforming baselines that lack 3D priors.
The method does not compromise performance on other downstream tasks, showing comparable or improved results on ImageNet classification, iNat21 fine-grained classification, and NYU-DepthV2 depth estimation.
Analysis confirms an increased shape bias in the learned representations, supporting the hypothesis that incorporating 3D knowledge encourages more robust and shape-centric feature learning. |
The method relies on pseudo-depth maps during training, which could introduce limitations depending on the accuracy and generalization of the pre-trained depth estimation model.
Future work could explore incorporating semantic information during training or investigating alternative 3D representations beyond triplanes. |
self-supervised learning, 3d reconstruction, robustness, shape bias, representation learning |
2406.02528
Report |
Scalable MatMul-free Language Modeling |
Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian |
Matrix multiplication (MatMul) typically dominates the overall computational
cost of large language models (LLMs). This cost only grows as LLMs scale to
larger embedding dimensions and context lengths. In this work, we show that
MatMul operations can be completely eliminated from LLMs while maintaining
strong performance at billion-parameter scales. Our experiments show that our
proposed MatMul-free models achieve performance on-par with state-of-the-art
Transformers that require far more memory during inference at a scale up to at
least 2.7B parameters. We investigate the scaling laws and find that the
performance gap between our MatMul-free models and full precision Transformers
narrows as the model size increases. We also provide a GPU-efficient
implementation of this model which reduces memory usage by up to 61% over an
unoptimized baseline during training. By utilizing an optimized kernel during
inference, our model's memory consumption can be reduced by more than 10x
compared to unoptimized models. To properly quantify the efficiency of our
architecture, we build a custom hardware solution on an FPGA which exploits
lightweight operations beyond what GPUs are capable of. We processed
billion-parameter scale models at 13W beyond human readable throughput, moving
LLMs closer to brain-like efficiency. This work not only shows how far LLMs can
be stripped back while still performing effectively, but also points at the
types of operations future accelerators should be optimized for in processing
the next generation of lightweight LLMs. Our code implementation is available
at https://github.com/ridgerchu/matmulfreellm. |
This paper introduces the first scalable MatMul-free language model (MatMul-free LM) that eliminates matrix multiplication operations by utilizing additive operations in dense layers and element-wise Hadamard products for self-attention-like functions. |
Matrix multiplication (MatMul) is a computationally expensive operation that dominates the cost of large language models (LLMs), particularly as models scale to larger sizes. This work aims to address this challenge by developing a more efficient architecture. |
The paper proposes a novel architecture called MatMul-free LM, which replaces MatMul operations with ternary accumulations in dense layers and employs a MatMul-free token mixer based on a modified Gated Recurrent Unit (GRU). The model is trained using a surrogate gradient method and a large learning rate. |
The MatMul-free LM achieves performance on par with state-of-the-art Transformers while using significantly less memory during inference.
The scaling law analysis reveals that the performance gap between the MatMul-free LM and full-precision Transformers narrows as model size increases.
A custom FPGA implementation demonstrates the hardware efficiency of the MatMul-free LM, achieving brain-like efficiency at billion-parameter scales. |
The MatMul-free LM has not been tested on extremely large-scale models (e.g., 100B+ parameters) due to computational constraints.
Further research is needed to explore the potential of MatMul-free architectures for other natural language processing tasks beyond language modeling. |
language modeling, matrix multiplication, ternary networks, fpga acceleration, efficient deep learning |
2406.02511
Report |
V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation |
Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu, Xiao Han, Wei Yang |
In the field of portrait video generation, the use of single images to
generate portrait videos has become increasingly prevalent. A common approach
involves leveraging generative models to enhance adapters for controlled
generation. However, control signals (e.g., text, audio, reference image, pose,
depth map, etc.) can vary in strength. Among these, weaker conditions often
struggle to be effective due to interference from stronger conditions, posing a
challenge in balancing these conditions. In our work on portrait video
generation, we identified audio signals as particularly weak, often
overshadowed by stronger signals such as facial pose and reference image.
However, direct training with weak signals often leads to difficulties in
convergence. To address this, we propose V-Express, a simple method that
balances different control signals through the progressive training and the
conditional dropout operation. Our method gradually enables effective control
by weak conditions, thereby achieving generation capabilities that
simultaneously take into account the facial pose, reference image, and audio.
The experimental results demonstrate that our method can effectively generate
portrait videos controlled by audio. Furthermore, a potential solution is
provided for the simultaneous and effective use of conditions of varying
strengths. |
Presents V-Express, a novel method for generating high-quality portrait videos with synchronized audio, balancing control signals of varying strengths through progressive training and conditional dropout operations. |
Addresses the challenge in portrait video generation where weaker control signals, like audio, are often overshadowed by stronger ones (e.g., pose, reference image), limiting control and synchronization. |
Utilizes a Latent Diffusion Model (LDM) with ReferenceNet, V-Kps Guider, and Audio Projection to handle control inputs. Progressive training gradually incorporates control, while conditional dropout prevents shortcut learning from dominant signals. |
Effectively generates high-quality portrait videos synchronized with audio input.
Maintains consistency in facial identity and pose guided by reference images and V-Kps.
Demonstrates a balanced approach to integrating multiple control signals with varying strengths. |
Limited multilingual support due to the English-centric Wav2Vec2 audio encoder.
Slow generation speed due to the autoregressive diffusion process for multi-frame generation. |
portrait video generation, audio-driven video generation, latent diffusion model, control signal balancing, conditional dropout |
2406.02509
Report |
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation |
Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, Arash Vahdat |
Recently video diffusion models have emerged as expressive generative tools
for high-quality video content creation readily available to general users.
However, these models often do not offer precise control over camera poses for
video generation, limiting the expression of cinematic language and user
control. To address this issue, we introduce CamCo, which allows fine-grained
Camera pose Control for image-to-video generation. We equip a pre-trained
image-to-video generator with accurately parameterized camera pose input using
Pl\"ucker coordinates. To enhance 3D consistency in the videos produced, we
integrate an epipolar attention module in each attention block that enforces
epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on
real-world videos with camera poses estimated through structure-from-motion
algorithms to better synthesize object motion. Our experiments show that CamCo
significantly improves 3D consistency and camera control capabilities compared
to previous models while effectively generating plausible object motion.
Project page: https://ir1d.github.io/CamCo/ |
This paper presents CamCo, a novel image-to-video generation framework that enables fine-grained camera control and ensures 3D consistency in generated videos. |
Controlling camera motion is crucial for cinematic expression and practical applications of generated videos, but existing video generation models often lack this capability. |
CamCo leverages Plücker coordinates for accurate camera pose representation and integrates an epipolar constraint attention module to enforce geometric consistency across frames. The model is trained on a dataset augmented with dynamic videos and their camera pose annotations. |
CamCo significantly improves 3D consistency and camera control accuracy compared to previous state-of-the-art methods.
The model demonstrates superior visual quality, as evidenced by FID and FVD metrics.
CamCo effectively generates plausible object motion in addition to camera ego-motion. |
The model currently cannot generate complex camera intrinsic changes (e.g., dolly zoom).
The output video length and resolution are limited, potentially restricting its application in large-scale scenes. |
video generation, camera control, 3d consistency, diffusion models, epipolar constraint |
2406.02507
Report |
Guiding a Diffusion Model with a Bad Version of Itself |
Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, Samuli Laine |
The primary axes of interest in image-generating diffusion models are image
quality, the amount of variation in the results, and how well the results align
with a given condition, e.g., a class label or a text prompt. The popular
classifier-free guidance approach uses an unconditional model to guide a
conditional model, leading to simultaneously better prompt alignment and
higher-quality images at the cost of reduced variation. These effects seem
inherently entangled, and thus hard to control. We make the surprising
observation that it is possible to obtain disentangled control over image
quality without compromising the amount of variation by guiding generation
using a smaller, less-trained version of the model itself rather than an
unconditional model. This leads to significant improvements in ImageNet
generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using
publicly available networks. Furthermore, the method is also applicable to
unconditional diffusion models, drastically improving their quality. |
This paper introduces "autoguidance," a novel method for enhancing image quality in diffusion models by guiding generation using a smaller, less-trained version of the model itself, rather than an unconditional model like in classifier-free guidance (CFG). |
This is important because it provides disentangled control over image quality and variation, addressing limitations of CFG, which entangles these aspects and can lead to over-simplified image compositions. |
The method leverages the observation that score matching in diffusion models leads to over-emphasis of low-probability regions. By using a weaker model trained on the same task and data distribution, autoguidance identifies and reduces errors in the stronger model's predictions, leading to improved sample quality without sacrificing variation. |
Autoguidance achieves significant FID and DINO improvements on ImageNet-512 and ImageNet-64, setting new records for these datasets.
It allows for independent control of image quality and variation, enabling the generation of diverse and high-fidelity images.
The method can be applied to both conditional and unconditional diffusion models, substantially improving the quality of unconditional generation, which is typically poor. |
One limitation is the need for early snapshots of smaller models for optimal guidance, which might not be readily available for all large-scale generators.
Future work could explore formalizing the conditions under which autoguidance is beneficial and developing better guidelines for selecting the best guiding model. |
diffusion models, image generation, classifier-free guidance, autoguidance, image quality |
2406.02495
Report |
GenS: Generalizable Neural Surface Reconstruction from Multi-View Images |
Rui Peng, Xiaodong Gu, Luyang Tang, Shihe Shen, Fanqi Yu, Ronggang Wang |
Combining the signed distance function (SDF) and differentiable volume
rendering has emerged as a powerful paradigm for surface reconstruction from
multi-view images without 3D supervision. However, current methods are impeded
by requiring long-time per-scene optimizations and cannot generalize to new
scenes. In this paper, we present GenS, an end-to-end generalizable neural
surface reconstruction model. Unlike coordinate-based methods that train a
separate network for each scene, we construct a generalized multi-scale volume
to directly encode all scenes. Compared with existing solutions, our
representation is more powerful, which can recover high-frequency details while
maintaining global smoothness. Meanwhile, we introduce a multi-scale
feature-metric consistency to impose the multi-view consistency in a more
discriminative multi-scale feature space, which is robust to the failures of
the photometric consistency. And the learnable feature can be self-enhanced to
continuously improve the matching accuracy and mitigate aggregation ambiguity.
Furthermore, we design a view contrast loss to force the model to be robust to
those regions covered by few viewpoints through distilling the geometric prior
from dense input to sparse input. Extensive experiments on popular benchmarks
show that our model can generalize well to new scenes and outperform existing
state-of-the-art methods even those employing ground-truth depth supervision.
Code is available at https://github.com/prstrive/GenS. |
This paper presents GenS, an end-to-end generalizable neural surface reconstruction model that efficiently reconstructs detailed 3D structures from multi-view images without requiring expensive per-scene optimization. |
Current neural surface reconstruction methods suffer from lengthy per-scene optimization and lack of generalization to new scenes, limiting their applicability. |
GenS leverages a generalized multi-scale volume to represent scenes efficiently. It introduces multi-scale feature-metric consistency for robust multi-view matching and a view contrast loss to improve reconstruction accuracy for sparsely viewed regions. |
GenS outperforms state-of-the-art generalizable methods and even some per-scene optimization methods on DTU dataset.
The model demonstrates strong generalization ability on BlendedMVS dataset.
Ablation studies confirm the effectiveness of each proposed component. |
The model struggles with scenes containing large camera motion.
Future work will focus on handling challenging scenarios with improved aggregation features. |
neural surface reconstruction, generalizable model, multi-view consistency, multi-scale volume, view contrast loss |
2406.02485
Report |
Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation |
Jiajun Wang, Morteza Ghahremani, Yitong Li, Björn Ommer, Christian Wachinger |
Controllable text-to-image (T2I) diffusion models have shown impressive
performance in generating high-quality visual content through the incorporation
of various conditions. Current methods, however, exhibit limited performance
when guided by skeleton human poses, especially in complex pose conditions such
as side or rear perspectives of human figures. To address this issue, we
present Stable-Pose, a novel adapter model that introduces a coarse-to-fine
attention masking strategy into a vision Transformer (ViT) to gain accurate
pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose
conditions within pre-trained Stable Diffusion, providing a refined and
efficient way of aligning pose representation during image synthesis. We
leverage the query-key self-attention mechanism of ViTs to explore the
interconnections among different anatomical parts in human pose skeletons.
Masked pose images are used to smoothly refine the attention maps based on
target pose-related features in a hierarchical manner, transitioning from
coarse to fine levels. Additionally, our loss function is formulated to
allocate increased emphasis to the pose region, thereby augmenting the model's
precision in capturing intricate pose details. We assessed the performance of
Stable-Pose across five public datasets under a wide range of indoor and
outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the
LAION-Human dataset, marking around 13% improvement over the established
technique ControlNet. The project link and code is available at
https://github.com/ai-med/StablePose. |
Stable-Pose, a novel adapter model for controllable text-to-image (T2I) diffusion models, improves pose control in human image synthesis by employing a coarse-to-fine attention masking strategy within a vision transformer (ViT). |
Current T2I models struggle with accurate pose guidance, particularly in complex poses (side or rear views). Stable-Pose addresses this by effectively aligning pose representation during image synthesis. |
Stable-Pose integrates a trainable ViT unit into pre-trained T2I models like Stable Diffusion. It utilizes a coarse-to-fine masking approach in the self-attention mechanism to focus on pose-related regions and a pose-mask guided loss for enhanced pose fidelity. |
Stable-Pose achieves superior pose accuracy (AP and CAP) compared to state-of-the-art methods on five datasets.
The model exhibits robust performance in challenging scenarios like side/back poses and multiple figures.
Stable-Pose maintains comparable image quality (FID and KID) and text-image alignment (CLIP score) to other methods. |
Stable-Pose's inference time is slightly longer due to the ViT's self-attention mechanism.
The model's performance with conditions other than pose (e.g., edge maps) has yet to be evaluated. |
text-to-image generation, diffusion models, pose control, vision transformer, attention mechanism |
2406.02461
Report |
RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting |
Qi Wang, Ruijie Lu, Xudong Xu, Jingbo Wang, Michael Yu Wang, Bo Dai, Gang Zeng, Dan Xu |
The advancement of diffusion models has pushed the boundary of text-to-3D
object generation. While it is straightforward to composite objects into a
scene with reasonable geometry, it is nontrivial to texture such a scene
perfectly due to style inconsistency and occlusions between objects. To tackle
these problems, we propose a coarse-to-fine 3D scene texturing framework,
referred to as RoomTex, to generate high-fidelity and style-consistent textures
for untextured compositional scene meshes. In the coarse stage, RoomTex first
unwraps the scene mesh to a panoramic depth map and leverages ControlNet to
generate a room panorama, which is regarded as the coarse reference to ensure
the global texture consistency. In the fine stage, based on the panoramic image
and perspective depth maps, RoomTex will refine and texture every single object
in the room iteratively along a series of selected camera views, until this
object is completely painted. Moreover, we propose to maintain superior
alignment between RGB and depth spaces via subtle edge detection methods.
Extensive experiments show our method is capable of generating high-quality and
diverse room textures, and more importantly, supporting interactive
fine-grained texture control and flexible scene editing thanks to our
inpainting-based framework and compositional mesh input. Our project page is
available at https://qwang666.github.io/RoomTex/. |
Proposes RoomTex, a coarse-to-fine 3D scene texturing framework, for generating high-fidelity and style-consistent textures for untextured compositional scene meshes. |
Automating scene texturing is important for various industries (gaming, filming, AR/VR) but challenging due to style inconsistency and occlusions between objects in a scene. |
Uses a coarse stage to generate a style-consistent room panorama from a panoramic depth map and text prompt. Then, a fine stage refines the panorama and iteratively textures each object from different viewpoints using depth-guided inpainting and an edge detection module for RGB-depth alignment. |
Generates high-quality, diverse, and style-consistent room textures on par with those in professional datasets.
Supports interactive fine-grained texture control, enabling users to edit specific areas using sketches or text descriptions.
Enables flexible scene editing by leveraging the compositional nature of the input mesh, allowing for adding, removing, or modifying individual objects. |
Iterative inpainting may not capture all object views in one run, leading to potential texture inconsistencies.
Fine-grained details on generated objects, especially those with complex topology, can be challenging. |
scene texturing, scene generation, texture synthesis, diffusion models, 3d scene understanding |
2406.02407
Report |
WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections |
Yuze Wang, Junyi Wang, Yue Qi |
Novel View Synthesis (NVS) from unconstrained photo collections is
challenging in computer graphics. Recently, 3D Gaussian Splatting (3DGS) has
shown promise for photorealistic and real-time NVS of static scenes. Building
on 3DGS, we propose an efficient point-based differentiable rendering framework
for scene reconstruction from photo collections. Our key innovation is a
residual-based spherical harmonic coefficients transfer module that adapts 3DGS
to varying lighting conditions and photometric post-processing. This
lightweight module can be pre-computed and ensures efficient gradient
propagation from rendered images to 3D Gaussian attributes. Additionally, we
observe that the appearance encoder and the transient mask predictor, the two
most critical parts of NVS from unconstrained photo collections, can be
mutually beneficial. We introduce a plug-and-play lightweight spatial attention
module to simultaneously predict transient occluders and latent appearance
representation for each image. After training and preprocessing, our method
aligns with the standard 3DGS format and rendering pipeline, facilitating
seamlessly integration into various 3DGS applications. Extensive experiments on
diverse datasets show our approach outperforms existing approaches on the
rendering quality of novel view and appearance synthesis with high converge and
rendering speed. |
WE-GS, an efficient point-based differentiable rendering framework, reconstructs scenes from unconstrained photo collections, effectively handling appearance variations and transient occluders. |
Existing methods struggle to balance rendering quality, speed, and storage efficiency when dealing with real-world photo collections containing varying lighting and moving objects. |
The framework introduces: (1) a residual-based Spherical Harmonic coefficient transfer module for efficient appearance modeling under varying lighting, and (2) a lightweight spatial attention module to simultaneously predict transient masks and latent appearance representations for each image. |
Achieves state-of-the-art novel view and appearance synthesis quality on PhotoTourism and NeRF-OSR datasets.
Significantly reduces storage requirements (over 2x compared to 3DGS) while maintaining real-time rendering speed.
Demonstrates superior efficiency with fast training times (over 17x faster than NeRF-based methods). |
The performance of WE-GS can be affected by the quality of the initial 3D Gaussian estimates from SfM.
Further exploration of the trade-off between rendering quality and efficiency is possible. |
novel view synthesis, unconstrained photo collection, appearance modeling, real-time rendering, 3d gaussian splatting |
2406.02395
Report |
GrootVL: Tree Topology is All You Need in State Space Model |
Yicheng Xiao, Lin Song, Shaoli Huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan |
The state space models, employing recursively propagated features,
demonstrate strong representation capabilities comparable to Transformer models
and superior efficiency. However, constrained by the inherent geometric
constraints of sequences, it still falls short in modeling long-range
dependencies. To address this issue, we propose the GrootVL network, which
first dynamically generates a tree topology based on spatial relationships and
input features. Then, feature propagation is performed based on this graph,
thereby breaking the original sequence constraints to achieve stronger
representation capabilities. Additionally, we introduce a linear complexity
dynamic programming algorithm to enhance long-range interactions without
increasing computational cost. GrootVL is a versatile multimodal framework that
can be applied to both visual and textual tasks. Extensive experiments
demonstrate that our method significantly outperforms existing structured state
space models on image classification, object detection and segmentation.
Besides, by fine-tuning large language models, our approach achieves consistent
improvements in multiple textual tasks at minor training cost. |
This paper proposes GrootVL, a novel framework employing an input-aware tree topology for feature propagation in state-space models to enhance long-range dependency modeling for both visual and language tasks. |
Existing state-space models, while efficient, struggle to capture long-range dependencies. Fixed scanning strategies used for adapting to vision tasks fail to preserve 2D structural information, limiting their effectiveness. |
GrootVL utilizes a tree-scanning algorithm to dynamically generate a tree topology based on input features, enabling more effective long-range interactions. It employs a linear complexity dynamic programming algorithm for efficient propagation. |
GrootVL significantly outperforms existing structured state-space models on image classification, object detection, and segmentation tasks.
GrootV, the visual sub-network, achieves competitive performance with CNN and Transformer-based approaches on ImageNet, MSCOCO, and ADE20K benchmarks.
GrootL, the language sub-network, consistently improves language representation for pre-trained large language models with minor training cost, as demonstrated on various language understanding benchmarks. |
The tree structure in GrootVL requires specific hardware optimization.
Future work could explore the generalization of the tree topology to other applications beyond vision and language tasks. |
state-space models, long-range dependencies, tree topology, dynamic programming, multi-modal learning |
2406.02347
Report |
Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation |
Clement Chadebec, Onur Tasar, Eyal Benaroche, Benjamin Aubin |
In this paper, we propose an efficient, fast, and versatile distillation
method to accelerate the generation of pre-trained diffusion models: Flash
Diffusion. The method reaches state-of-the-art performances in terms of FID and
CLIP-Score for few steps image generation on the COCO2014 and COCO2017
datasets, while requiring only several GPU hours of training and fewer
trainable parameters than existing methods. In addition to its efficiency, the
versatility of the method is also exposed across several tasks such as
text-to-image, inpainting, face-swapping, super-resolution and using different
backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-$\alpha$),
as well as adapters. In all cases, the method allowed to reduce drastically the
number of sampling steps while maintaining very high-quality image generation.
The official implementation is available at
https://github.com/gojasper/flash-diffusion. |
This paper introduces Flash Diffusion, a novel distillation method designed to accelerate the image generation process of pre-trained diffusion models. |
Diffusion models, while powerful, suffer from slow generation speeds due to the iterative nature of their sampling process. Flash Diffusion addresses this by significantly reducing the number of sampling steps required, making them more practical for real-time applications. |
The method trains a student model to predict the output of a multi-step teacher model in a single step. It uses a combination of a distillation loss, an adversarial loss to enhance sample quality, and a distribution matching loss to ensure the student model's output closely resembles the teacher's learned data distribution. |
Flash Diffusion achieves state-of-the-art FID and CLIP scores for few-step image generation on COCO2014 and COCO2017 datasets.
The method demonstrates versatility by effectively performing across various tasks such as text-to-image, inpainting, super-resolution, and face-swapping.
It exhibits strong compatibility with different diffusion model architectures like UNet and DiT, as well as with adapters. |
Further reduction in the number of NFEs is desirable to push the boundaries of real-time generation.
Exploring the application of direct preference optimization techniques on the student model could potentially lead to further enhancements in sample quality. |
diffusion models, distillation, image generation, fast sampling, generative models |
2406.02230
Report |
I4VGen: Image as Stepping Stone for Text-to-Video Generation |
Xiefan Guo, Jinlin Liu, Miaomiao Cui, Di Huang |
Text-to-video generation has lagged behind text-to-image synthesis in quality
and diversity due to the complexity of spatio-temporal modeling and limited
video-text datasets. This paper presents I4VGen, a training-free and
plug-and-play video diffusion inference framework, which enhances text-to-video
generation by leveraging robust image techniques. Specifically, following
text-to-image-to-video, I4VGen decomposes the text-to-video generation into two
stages: anchor image synthesis and anchor image-guided video synthesis.
Correspondingly, a well-designed generation-selection pipeline is employed to
achieve visually-realistic and semantically-faithful anchor image, and an
innovative Noise-Invariant Video Score Distillation Sampling is incorporated to
animate the image to a dynamic video, followed by a video regeneration process
to refine the video. This inference strategy effectively mitigates the
prevalent issue of non-zero terminal signal-to-noise ratio. Extensive
evaluations show that I4VGen not only produces videos with higher visual
realism and textual fidelity but also integrates seamlessly into existing
image-to-video diffusion models, thereby improving overall video quality. |
Introduces I4VGen, a training-free and plug-and-play inference framework for text-to-video generation that leverages image synthesis techniques to improve video quality and text-alignment. |
Text-to-video generation lags behind text-to-image generation due to complex spatio-temporal modeling and limited video-text datasets. I4VGen aims to bridge this gap by leveraging robust image generation techniques without additional training. |
I4VGen decomposes the process into two stages: 1) Anchor image synthesis: Generates multiple candidate images from the text prompt and selects the best one using a reward-based mechanism. 2) Anchor image-guided video synthesis: Animates the static anchor image using Noise-Invariant Video Score Distillation Sampling (NI-VSDS) and refines it through a video regeneration process. |
Significantly improves the visual realism and textual fidelity of generated videos.
Outperforms existing text-to-video generation methods in benchmark evaluations (VBench) across various aspects like temporal consistency, frame quality, and text alignment.
Demonstrates versatility by seamlessly integrating with existing image-to-video diffusion models and enabling user-provided image animation. |
Inference time is longer than baseline models, though shorter than some methods like FreeInit.
Direct integration with FreeInit doesn't yield significant improvements. |
text-to-video generation, diffusion models, image-guided synthesis, score distillation sampling, video quality enhancement |
2406.02058
Report |
OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding |
Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Jian Zhang |
This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting
(3DGS) capable of 3D point-level open vocabulary understanding. Our primary
motivation stems from observing that existing 3DGS-based open vocabulary
methods mainly focus on 2D pixel-level parsing. These methods struggle with 3D
point-level tasks due to weak feature expressiveness and inaccurate 2D-3D
feature associations. To ensure robust feature presentation and 3D point-level
understanding, we first employ SAM masks without cross-frame associations to
train instance features with 3D consistency. These features exhibit both
intra-object consistency and inter-object distinction. Then, we propose a
two-stage codebook to discretize these features from coarse to fine levels. At
the coarse level, we consider the positional information of 3D points to
achieve location-based clustering, which is then refined at the fine level.
Finally, we introduce an instance-level 3D-2D feature association method that
links 3D points to 2D masks, which are further associated with 2D CLIP
features. Extensive experiments, including open vocabulary-based 3D object
selection, 3D point cloud understanding, click-based 3D object selection, and
ablation studies, demonstrate the effectiveness of our proposed method. Project
page: https://3d-aigc.github.io/OpenGaussian |
This paper introduces OpenGaussian, a 3DGS-based method for 3D point-level open vocabulary understanding by associating high-dimensional CLIP features with 3D Gaussian points. |
Existing 3DGS-based open vocabulary methods struggle with 3D point-level tasks due to weak feature expressiveness and inaccurate 2D-3D feature associations, hindering applications requiring 3D point-level understanding, like robotics. |
The method involves: 1) Training 3D point-level instance features with intra-mask smoothing and inter-mask contrastive loss using SAM masks, 2) Discretizing these features using a two-level coarse-to-fine codebook, and 3) Proposing a training-free instance-level 2D-3D association method based on IoU and feature distance to associate CLIP features with 3D instances. |
OpenGaussian outperforms existing methods in open-vocabulary 3D object selection and point cloud understanding tasks.
The method enables accurate click-based 3D object selection without requiring SAM feature supervision.
The two-level codebook and instance-level 2D-3D feature association are shown to be crucial for achieving high performance. |
The method relies on the accuracy of pre-trained SAM masks for instance feature learning.
The computational cost of rendering and processing numerous Gaussians remains a challenge for large-scale scenes. |
3d gaussian splatting, open vocabulary understanding, 3d point cloud segmentation, instance feature learning, 2d-3d feature association |
2406.02021
Report |
MetaMixer Is All You Need |
Seokju Yun, Dongheon Lee, Youngmin Ro |
Transformer, composed of self-attention and Feed-Forward Network, has
revolutionized the landscape of network design across various vision tasks. FFN
is a versatile operator seamlessly integrated into nearly all AI models to
effectively harness rich representations. Recent works also show that FFN
functions like key-value memories. Thus, akin to the query-key-value mechanism
within self-attention, FFN can be viewed as a memory network, where the input
serves as query and the two projection weights operate as keys and values,
respectively. We hypothesize that the importance lies in query-key-value
framework itself rather than in self-attention. To verify this, we propose
converting self-attention into a more FFN-like efficient token mixer with only
convolutions while retaining query-key-value framework, namely FFNification.
Specifically, FFNification replaces query-key and attention coefficient-value
interactions with large kernel convolutions and adopts GELU activation function
instead of softmax. The derived token mixer, FFNified attention, serves as
key-value memories for detecting locally distributed spatial patterns, and
operates in the opposite dimension to the ConvNeXt block within each
corresponding sub-operation of the query-key-value framework. Building upon the
above two modules, we present a family of Fast-Forward Networks. Our FFNet
achieves remarkable performance improvements over previous state-of-the-art
methods across a wide range of tasks. The strong and general performance of our
proposed method validates our hypothesis and leads us to introduce MetaMixer, a
general mixer architecture that does not specify sub-operations within the
query-key-value framework. We show that using only simple operations like
convolution and GELU in the MetaMixer can achieve superior performance. |
The paper introduces MetaMixer, a general mixer architecture based on the query-key-value framework, and proposes FFNification, a process that adapts self-attention to be more efficient by incorporating design elements from Feed-Forward Networks (FFNs). |
This work aims to shift the focus from specific modules like self-attention to a more general understanding of mixer design, emphasizing the importance of the query-key-value framework in achieving high performance across various tasks. |
The authors analyze FFNs in vision models, demonstrating their function as key-value memories. They then propose FFNification, which replaces expensive operations in self-attention with more efficient alternatives like depthwise convolution and GELU activation. They further validate the efficacy of the MetaMixer framework by introducing Fast-Forward Network (FFNet), a family of models built using FFNified attention and ConvNeXt blocks, and evaluate its performance on diverse tasks including image classification, object detection, semantic segmentation, super-resolution, 3D semantic segmentation, and time series forecasting. |
FFNet models achieve state-of-the-art performance across a wide range of tasks, demonstrating a superior performance-speed trade-off compared to both transformer-based and convolution-based methods.
The use of large-kernel depthwise convolution in FFNified attention enables efficient context aggregation and broader Effective Receptive Fields (ERFs), leading to improved performance.
FFNet exhibits strong robustness, outperforming existing models on benchmark datasets designed to test generalization capabilities. |
The effectiveness of MetaMixer-based convolutional mixers in large-scale datasets and generative modeling remains unproven.
While the proposed method offers a novel perspective on mixer design, it leverages recent advancements, and future work should explore its full potential in a broader range of scenarios. |
metamixer, ffnification, query-key-value framework, convolutional mixer, deep learning |
2406.01970
Report |
The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise |
Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, Minhao Cheng |
Diffusion models have achieved remarkable success in text-to-image generation
tasks; however, the role of initial noise has been rarely explored. In this
study, we identify specific regions within the initial noise image, termed
trigger patches, that play a key role for object generation in the resulting
images. Notably, these patches are ``universal'' and can be generalized across
various positions, seeds, and prompts. To be specific, extracting these patches
from one noise and injecting them into another noise leads to object generation
in targeted areas. We identify these patches by analyzing the dispersion of
object bounding boxes across generated images, leading to the development of a
posterior analysis technique. Furthermore, we create a dataset consisting of
Gaussian noises labeled with bounding boxes corresponding to the objects
appearing in the generated images and train a detector that identifies these
patches from the initial noise. To explain the formation of these patches, we
reveal that they are outliers in Gaussian noise, and follow distinct
distributions through two-sample tests. Finally, we find the misalignment
between prompts and the trigger patch patterns can result in unsuccessful image
generations. The study proposes a reject-sampling strategy to obtain optimal
noise, aiming to improve prompt adherence and positional diversity in image
generation. |
This paper discovers and leverages "trigger patches" – specific regions in the initial noise of diffusion models that strongly influence object location in generated images. |
This work provides a new understanding of how diffusion models work and offers a way to improve control over image generation, addressing limitations in adhering to prompt instructions. |
The authors first use a posterior analysis method, calculating "trigger entropy" to quantify object position consistency across images generated from the same noise. Then, they train a detector directly on noise to identify trigger patches, achieving promising results. They further investigate the nature of trigger patches, hypothesizing and verifying that they are outliers in the Gaussian noise distribution. |
Trigger patches exist: Specific patches in the initial noise consistently lead to object generation at their corresponding locations across different prompts.
Trigger patches can be detected directly from noise: A trained detector shows promising performance in identifying these patches without running the diffusion process.
Trigger patches are outliers: They deviate significantly from the standard Gaussian distribution of initial noise. |
The paper hasn't fully explored the case of multiple trigger patches within a single noise.
The dataset used for analysis is limited to five object classes and 25 prompts. |
diffusion models, image generation, object detection, trigger patches, positional bias |
2406.01956
Report |
Enhance Image-to-Image Generation with LLaVA Prompt and Negative Prompt |
Zhicheng Ding, Panfeng Li, Qikai Yang, Siyang Li |
This paper presents a novel approach to enhance image-to-image generation by
leveraging the multimodal capabilities of the Large Language and Vision
Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and
generates textual descriptions, hereinafter LLaVA-generated prompts. These
prompts, along with the original image, are fed into the image-to-image
generation pipeline. This enriched representation guides the generation process
towards outputs that exhibit a stronger resemblance to the input image.
Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts
in promoting image similarity. We observe a significant improvement in the
visual coherence between the generated and input images compared to traditional
methods. Future work will explore fine-tuning LLaVA prompts for increased
control over the creative process. By providing more specific details within
the prompts, we aim to achieve a delicate balance between faithfulness to the
original image and artistic expression in the generated outputs. |
This paper proposes a novel framework that enhances image-to-image generation by incorporating LLaVA-generated prompts into Stable Diffusion, resulting in outputs that exhibit a stronger resemblance to the input image. |
Relying solely on input images for generation can lead to deviations from user intent. This framework addresses these limitations by leveraging LLaVA's image understanding to create more accurate and detailed prompts, enhancing control and fidelity in image generation. |
The input image is analyzed by LLaVA to generate textual descriptions (prompts). These prompts, along with the original image, are fed into Stable Diffusion to guide the generation process toward outputs that closely resemble the input. |
LLaVA-generated prompts significantly improve visual coherence between generated and input images compared to traditional methods.
Quantitative image similarity metrics (RMSE, PSNR, FSIM, SSIM, UIQ, SRE) confirm that LLaVA-generated prompts lead to the generation of more similar images.
Extensive experiments across various scenarios consistently demonstrate the effectiveness of the proposed approach in enhancing image similarity. |
Limitations in LLaVA's negative prompt generation accuracy require further investigation.
Future work will explore fine-tuning LLaVA prompts to achieve a balance between faithfulness to the original image and artistic expression. |
image-to-image generation, llava, stable diffusion, multimodal prompt generation, image similarity |
2406.01954
Report |
Plug-and-Play Diffusion Distillation |
Yi-Ting Hsiao, Siavash Khodadadeh, Kevin Duarte, Wei-An Lin, Hui Qu, Mingi Kwon, Ratheesh Kalarot |
Diffusion models have shown tremendous results in image generation. However,
due to the iterative nature of the diffusion process and its reliance on
classifier-free guidance, inference times are slow. In this paper, we propose a
new distillation approach for guided diffusion models in which an external
lightweight guide model is trained while the original text-to-image model
remains frozen. We show that our method reduces the inference computation of
classifier-free guided latent-space diffusion models by almost half, and only
requires 1\% trainable parameters of the base model. Furthermore, once trained,
our guide model can be applied to various fine-tuned, domain-specific versions
of the base diffusion model without the need for additional training: this
"plug-and-play" functionality drastically improves inference computation while
maintaining the visual fidelity of generated images. Empirically, we show that
our approach is able to produce visually appealing results and achieve a
comparable FID score to the teacher with as few as 8 to 16 steps. |
This paper introduces a novel distillation approach for guided diffusion models, using an external lightweight guide model trained alongside a frozen text-to-image model, effectively reducing inference computation without compromising image quality. |
Diffusion models, while powerful in image generation, suffer from slow inference times due to their iterative process and reliance on classifier-free guidance. This work addresses this limitation by significantly reducing computational cost and preserving the advantages of the base model. |
The method involves training a lightweight guide model that takes guidance values, time and text embeddings, and latent image representations as input. This model injects feature maps into the decoder of the original diffusion model to guide image generation. Two guide model architectures are explored: one based on ControlNet and a simplified 'tiny' version. The method is further enhanced by incorporating sampling steps distillation, progressively reducing the steps required for high-quality image generation. |
The proposed approach reduces inference computation for classifier-free guided latent-space diffusion models by almost half, using only 1% of the base model's trainable parameters.
The guide model, once trained, can be applied to various fine-tuned, domain-specific versions of the base diffusion model without requiring additional training, enabling a 'plug-and-play' functionality.
Visualizations of the guide model's feature map injections provide insights into how classifier-free guidance influences image generation at different timesteps. |
Unlike classifier-free guidance, the proposed approach may be less efficient when running in batches due to the parallel execution of the U-Net and the guide module.
Future work could explore the application of this distillation method to pixel-based diffusion models. |
diffusion models, distillation, image generation, classifier-free guidance, inference time reduction |
2406.01900
Report |
Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation |
Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen |
We present Follow-Your-Emoji, a diffusion-based framework for portrait
animation, which animates a reference portrait with target landmark sequences.
The main challenge of portrait animation is to preserve the identity of the
reference portrait and transfer the target expression to this portrait while
maintaining temporal consistency and fidelity. To address these challenges,
Follow-Your-Emoji equipped the powerful Stable Diffusion model with two
well-designed technologies. Specifically, we first adopt a new explicit motion
signal, namely expression-aware landmark, to guide the animation process. We
discover this landmark can not only ensure the accurate motion alignment
between the reference portrait and target motion during inference but also
increase the ability to portray exaggerated expressions (i.e., large pupil
movements) and avoid identity leakage. Then, we propose a facial fine-grained
loss to improve the model's ability of subtle expression perception and
reference portrait appearance reconstruction by using both expression and
facial masks. Accordingly, our method demonstrates significant performance in
controlling the expression of freestyle portraits, including real humans,
cartoons, sculptures, and even animals. By leveraging a simple and effective
progressive generation strategy, we extend our model to stable long-term
animation, thus increasing its potential application value. To address the lack
of a benchmark for this field, we introduce EmojiBench, a comprehensive
benchmark comprising diverse portrait images, driving videos, and landmarks. We
show extensive evaluations on EmojiBench to verify the superiority of
Follow-Your-Emoji. |
Follow-Your-Emoji, a diffusion-based framework for portrait animation, enables animating diverse reference portraits (e.g., humans, cartoons, sculptures, animals) using target landmark sequences while preserving identity and achieving high fidelity. |
Existing methods struggle to maintain identity and generate high-quality animations, particularly for uncommon portrait styles and subtle expressions. |
The framework utilizes: (1) Expression-aware landmarks for accurate motion alignment and exaggerated expression portrayal; (2) Facial fine-grained loss to enhance facial appearance and expression generation; (3) Progressive generation strategy for long-term animation stability. |
Follow-Your-Emoji effectively animates portraits in diverse styles with accurate expression transfer and identity preservation.
The proposed expression-aware landmarks and facial fine-grained loss improve animation quality, especially for subtle expressions.
Quantitative and qualitative evaluations on EmojiBench demonstrate the superiority of Follow-Your-Emoji over existing methods. |
The reliance on MediaPipe for landmark detection can be limiting for certain portrait styles.
Future work includes exploring alternative landmark detection methods and further improving long-term animation coherence. |
portrait animation, diffusion models, expression-aware landmarks, facial fine-grained loss, emojibench |
2406.01733
Report |
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching |
Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang |
Diffusion Transformers have recently demonstrated unprecedented generative
capabilities for various tasks. The encouraging results, however, come with the
cost of slow inference, since each denoising step requires inference on a
transformer model with a large scale of parameters. In this study, we make an
interesting and somehow surprising observation: the computation of a large
proportion of layers in the diffusion transformer, through introducing a
caching mechanism, can be readily removed even without updating the model
parameters. In the case of U-ViT-H/2, for example, we may remove up to 93.68%
of the computation in the cache steps (46.84% for all steps), with less than
0.01 drop in FID. To achieve this, we introduce a novel scheme, named
Learning-to-Cache (L2C), that learns to conduct caching in a dynamic manner for
diffusion transformers. Specifically, by leveraging the identical structure of
layers in transformers and the sequential nature of diffusion, we explore
redundant computations between timesteps by treating each layer as the
fundamental unit for caching. To address the challenge of the exponential
search space in deep models for identifying layers to cache and remove, we
propose a novel differentiable optimization objective. An input-invariant yet
timestep-variant router is then optimized, which can finally produce a static
computation graph. Experimental results show that L2C largely outperforms
samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at
the same inference speed. |
This paper introduces Learning-to-Cache (L2C), a novel caching mechanism to accelerate inference for diffusion transformers, exploiting layer redundancy across different timesteps. |
Diffusion transformers excel in generative tasks but suffer from slow inference due to their large-scale parameter inference at each denoising step. L2C aims to accelerate this process without compromising image quality. |
L2C leverages the identical layer structure in transformers and the sequential nature of diffusion to identify redundant computations. It employs a differentiable optimization objective to learn an input-invariant but timestep-variant router, enabling a static computation graph for efficient layer caching. |
L2C significantly outperforms samplers with fewer steps (DDIM, DPM-Solver) and prior cache-based methods at the same inference speed.
Experiments on DiT and U-ViT show that a large proportion of layers (up to 93.68% for U-ViT-H/2) can be cached with negligible FID degradation (<0.01).
The learned caching patterns reveal distinct sparsity for DiT and U-ViT, suggesting architectural variations influence layer redundancy in diffusion transformers. |
The effectiveness of L2C is dependent on the trained diffusion model architecture, limiting its generalizability.
The current L2C implementation is capped at 2x speedup due to the two-step inference scheme, requiring further development for higher acceleration. |
diffusion models, transformers, inference acceleration, caching, generative models |
2406.01595
Report |
MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild |
Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, Jie Song |
We present MultiPly, a novel framework to reconstruct multiple people in 3D
from monocular in-the-wild videos. Reconstructing multiple individuals moving
and interacting naturally from monocular in-the-wild videos poses a challenging
task. Addressing it necessitates precise pixel-level disentanglement of
individuals without any prior knowledge about the subjects. Moreover, it
requires recovering intricate and complete 3D human shapes from short video
sequences, intensifying the level of difficulty. To tackle these challenges, we
first define a layered neural representation for the entire scene, composited
by individual human and background models. We learn the layered neural
representation from videos via our layer-wise differentiable volume rendering.
This learning process is further enhanced by our hybrid instance segmentation
approach which combines the self-supervised 3D segmentation and the promptable
2D segmentation module, yielding reliable instance segmentation supervision
even under close human interaction. A confidence-guided optimization
formulation is introduced to optimize the human poses and shape/appearance
alternately. We incorporate effective objectives to refine human poses via
photometric information and impose physically plausible constraints on human
dynamics, leading to temporally consistent 3D reconstructions with high
fidelity. The evaluation of our method shows the superiority over prior art on
publicly available datasets and in-the-wild videos. |
This paper introduces MultiPly, a novel framework for reconstructing detailed 3D human models of multiple people from in-the-wild monocular videos. |
Reconstructing multiple interacting individuals in 3D from monocular videos is crucial for applications like AR/VR and 4D social activity replay but remains a challenging task due to occlusions, complex dynamics, and depth ambiguities. |
MultiPly utilizes a layered neural representation for the scene, combining individual human and background models. It leverages layer-wise differentiable volume rendering for learning and a hybrid instance segmentation approach combining self-supervised 3D and promptable 2D segmentation (using SAM). A confidence-guided optimization strategy alternates between optimizing pose and shape/appearance based on per-frame confidence. |
MultiPly outperforms state-of-the-art methods in multi-person 3D reconstruction from monocular video, showing significant improvements in metrics like V-IoU and Chamfer distance.
The proposed framework achieves superior novel view synthesis results compared to existing methods, generating sharper images with fewer artifacts.
MultiPly demonstrates robust instance segmentation capabilities, surpassing baseline methods in accuracy, particularly in scenes with close human interaction. |
The model's complexity scales linearly with the number of people, limiting its efficiency for crowded scenes.
The current method does not explicitly model hands, presenting an opportunity for future work by integrating expressive hand models like SMPL-X. |
3d human reconstruction, multi-person, monocular video, neural implicit representation, instance segmentation |
2406.01594
Report |
DiffUHaul: A Training-Free Method for Object Dragging in Images |
Omri Avrahami, Rinon Gal, Gal Chechik, Ohad Fried, Dani Lischinski, Arash Vahdat, Weili Nie |
Text-to-image diffusion models have proven effective for solving many image
editing tasks. However, the seemingly straightforward task of seamlessly
relocating objects within a scene remains surprisingly challenging. Existing
methods addressing this problem often struggle to function reliably in
real-world scenarios due to lacking spatial reasoning. In this work, we propose
a training-free method, dubbed DiffUHaul, that harnesses the spatial
understanding of a localized text-to-image model, for the object dragging task.
Blindly manipulating layout inputs of the localized model tends to cause low
editing performance due to the intrinsic entanglement of object representation
in the model. To this end, we first apply attention masking in each denoising
step to make the generation more disentangled across different objects and
adopt the self-attention sharing mechanism to preserve the high-level object
appearance. Furthermore, we propose a new diffusion anchoring technique: in the
early denoising steps, we interpolate the attention features between source and
target images to smoothly fuse new layouts with the original appearance; in the
later denoising steps, we pass the localized features from the source images to
the interpolated images to retain fine-grained object details. To adapt
DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that
can better reconstruct real images with the localized model. Finally, we
introduce an automated evaluation pipeline for this task and showcase the
efficacy of our method. Our results are reinforced through a user preference
study. |
Proposes DiffUHaul, a training-free method for dragging objects in images using the spatial understanding of a localized text-to-image model (BlobGEN). |
Addresses the challenge of seamlessly relocating objects in images, a task that remains difficult for existing image editing techniques. |
Utilizes BlobGEN's spatial understanding and introduces: (1) Gated self-attention masking to improve disentanglement, (2) Soft anchoring mechanism for fusing source object appearance with target location, (3) DDPM self-attention bucketing for real image editing. |
Achieves superior object dragging performance compared to baselines, quantitatively and qualitatively.
Demonstrates robustness in avoiding object traces, a common issue in other methods.
Preferred by human evaluators in a user study for its effectiveness and realism. |
Limitations in handling object rotation, resizing, and collisions.
Future work includes addressing these limitations and exploring applications in other creative tasks. |
object dragging, image editing, diffusion models, localized text-to-image generation, attention mechanisms |
2406.01593
Report |
Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting |
Shaojie Ma, Yawei Luo, Yi Yang |
3D reconstruction and simulation, while interrelated, have distinct
objectives: reconstruction demands a flexible 3D representation adaptable to
diverse scenes, whereas simulation requires a structured representation to
model motion principles effectively. This paper introduces the Mesh-adsorbed
Gaussian Splatting (MaGS) method to resolve such a dilemma. MaGS constrains 3D
Gaussians to hover on the mesh surface, creating a mutual-adsorbed
mesh-Gaussian 3D representation that combines the rendering flexibility of 3D
Gaussians with the spatial coherence of meshes. Leveraging this representation,
we introduce a learnable Relative Deformation Field (RDF) to model the relative
displacement between the mesh and 3D Gaussians, extending traditional
mesh-driven deformation paradigms that only rely on ARAP prior, thus capturing
the motion of each 3D Gaussian more precisely. By joint optimizing meshes, 3D
Gaussians, and RDF, MaGS achieves both high rendering accuracy and realistic
deformation. Extensive experiments on the D-NeRF and NeRF-DS datasets
demonstrate that MaGS can generate competitive results in both reconstruction
and simulation. |
Proposes MaGS, a novel method that combines 3D Gaussian Splatting with mesh representations for unified 3D reconstruction and simulation of dynamic objects from monocular videos. |
Addresses the challenge of simultaneously achieving flexible 3D reconstruction and physically plausible simulations, which existing methods struggle to achieve within a single framework. |
Utilizes a two-stage approach: 1) extracts a static mesh and estimates deformation field from 3D Gaussians, 2) introduces mesh-adsorbed Gaussians and a learnable Relative Deformation Field (RDF) to model fine-grained motions while preserving spatial coherence. |
Achieves state-of-the-art results on D-NeRF and NeRF-DS datasets, demonstrating superior rendering quality and accuracy compared to existing methods.
Enables realistic and user-interactive simulations like dragging by directly manipulating the mesh and propagating deformations to the adsorbed Gaussians.
Ablation studies highlight the contribution of mesh-adsorbed Gaussians and RDF in improving reconstruction and simulation fidelity. |
Performance depends on the accuracy of the initial mesh, posing challenges for low-resolution images or limited viewing angles.
Future work includes extending MaGS to handle topology changes and incorporating physical priors for more realistic simulations. |
3d reconstruction, 3d simulation, gaussian splatting, mesh deformation, dynamic scenes |
2406.01592
Report |
Text-guided Controllable Mesh Refinement for Interactive 3D Modeling |
Yun-Chun Chen, Selena Ling, Zhiqin Chen, Vladimir G. Kim, Matheus Gadelha, Alec Jacobson |
We propose a novel technique for adding geometric details to an input coarse
3D mesh guided by a text prompt. Our method is composed of three stages. First,
we generate a single-view RGB image conditioned on the input coarse geometry
and the input text prompt. This single-view image generation step allows the
user to pre-visualize the result and offers stronger conditioning for
subsequent multi-view generation. Second, we use our novel multi-view normal
generation architecture to jointly generate six different views of the normal
images. The joint view generation reduces inconsistencies and leads to sharper
details. Third, we optimize our mesh with respect to all views and generate a
fine, detailed geometry as output. The resulting method produces an output
within seconds and offers explicit user control over the coarse structure,
pose, and desired details of the resulting 3D mesh. Project page:
https://text-mesh-refinement.github.io. |
This paper introduces a novel technique for refining coarse 3D meshes by adding geometric details guided by text prompts. |
Existing text-to-3D methods often lack control over the generated shape's structure, limiting their utility for artists. This method allows for detailed 3D mesh creation while maintaining control over both global structure and local details. |
The method employs a three-stage process: 1) generating a single-view RGB preview image from the input mesh and text, 2) using a novel multi-view ControlNet to generate consistent normal images from multiple viewpoints guided by the preview image and input mesh, and 3) refining the input mesh based on the generated multi-view normals. |
The method produces high-quality 3D meshes with better geometric details than state-of-the-art methods, as demonstrated by quantitative and subjective evaluations.
The method offers control over the level of detail and pose of the final mesh.
The method is significantly faster (at least 90x) than competing methods due to its reliance on feed-forward networks and direct mesh optimization. |
The limited number of views and image resolution used during training restricts the level of detail achievable.
Mesh refinement relies on an external image segmentation model, potentially introducing artifacts if the segmentation is inaccurate. |
3d mesh refinement, text-guided generation, multi-view controlnet, differentiable rendering, interactive 3d modeling |
2406.01584
Report |
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model |
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu |
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D
vision and language tasks. However, their ability to reason about spatial
arrangements remains limited. In this work, we introduce Spatial Region GPT
(SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.
SpatialRGPT advances VLMs' spatial understanding through two key innovations:
(1) a data curation pipeline that enables effective learning of regional
representation from 3D scene graphs, and (2) a flexible plugin module for
integrating depth information into the visual encoder of existing VLMs. During
inference, when provided with user-specified region proposals, SpatialRGPT can
accurately perceive their relative directions and distances. Additionally, we
propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations
encompassing indoor, outdoor, and simulated environments, for evaluating 3D
spatial cognition in VLMs. Our results demonstrate that SpatialRGPT
significantly enhances performance in spatial reasoning tasks, both with and
without local region prompts. The model also exhibits strong generalization
capabilities, effectively reasoning about complex spatial relations and
functioning as a region-aware dense reward annotator for robotic tasks. Code,
dataset, and benchmark will be released at
https://www.anjiecheng.me/SpatialRGPT |
Spatial Region GPT (SpatialRGPT) enhances the spatial reasoning abilities of Vision Language Models (VLMs) by incorporating a region representation module and a flexible plugin for depth information. |
Existing VLMs struggle with spatial reasoning tasks, limiting their application in fields like robotics and augmented reality where precise spatial awareness is crucial. |
The authors introduce: (1) a data curation pipeline to build 3D scene graphs from 2D images, generating region-aware spatial reasoning QAs; (2) a novel VLM architecture integrating depth information through a plugin module; and (3) SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations for evaluating 3D spatial cognition in VLMs. |
SpatialRGPT significantly outperforms existing VLMs on the newly introduced SpatialRGBT-Bench, demonstrating superior spatial reasoning capabilities.
The model effectively generalizes its learned spatial knowledge to real-world applications, functioning as a region-aware dense reward annotator for robotics.
SpatialRGPT exhibits proficiency in complex spatial reasoning tasks, surpassing the capabilities of current leading vision-language models like GPT-4V. |
The current implementation uses axis-aligned bounding boxes, which can be less accurate than oriented bounding boxes in estimating object dimensions, especially for partially elevated objects.
Future work could explore integrating object pose estimation to improve the accuracy of object representation and spatial reasoning. |
vision language models, spatial reasoning, 3d scene understanding, region-aware representation, depth information |
2406.01583
Report |
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP |
Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi |
Recent works have explored how individual components of the CLIP-ViT model
contribute to the final representation by leveraging the shared image-text
representation space of CLIP. These components, such as attention heads and
MLPs, have been shown to capture distinct image features like shape, color or
texture. However, understanding the role of these components in arbitrary
vision transformers (ViTs) is challenging. To this end, we introduce a general
framework which can identify the roles of various components in ViTs beyond
CLIP. Specifically, we (a) automate the decomposition of the final
representation into contributions from different model components, and (b)
linearly map these contributions to CLIP space to interpret them via text.
Additionally, we introduce a novel scoring function to rank components by their
importance with respect to specific features. Applying our framework to various
ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the
roles of different components concerning particular image features.These
insights facilitate applications such as image retrieval using text
descriptions or reference images, visualizing token importance heatmaps, and
mitigating spurious correlations. |
This paper presents a general framework for interpreting vision transformers (ViTs) by decomposing representations into contributions from individual components (like attention heads) and mapping them to CLIP space for text-based interpretation. |
Understanding how ViTs process information and which components contribute to specific image features is crucial for improving their interpretability and reliability. |
The framework utilizes: 1) **AutoDecompose:** An algorithm that automatically decomposes representations into component contributions by traversing the model's computational graph. 2) **CompAlign:** A method for mapping component contributions to CLIP's image representation space using trained linear maps, allowing for text-based interpretation via CLIP's text encoder. 3) **Scoring Function:** A novel function that quantifies the importance of each component for specific image features. |
ImageNet-trained ViTs exhibit significant redundancy, with multiple layers encoding similar features.
The scoring function successfully ranks components based on their relevance to specific features, enabling applications like targeted image retrieval and token importance visualization.
The framework allows for zero-shot mitigation of spurious correlations in datasets like Waterbirds by ablating components highly associated with confounding factors. |
The analysis primarily considers direct contributions from the last few layers and doesn't fully explore indirect contributions or finer component decompositions.
Future work could investigate higher-order contributions and more granular decompositions, potentially identifying specific directions or subspaces within component contributions strongly associated with certain properties. |
vision transformers, interpretability, clip, representation learning, feature attribution |
2406.01579
Report |
Tetrahedron Splatting for 3D Generation |
Chun Gu, Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang |
3D representation is essential to the significant advance of 3D generation
with 2D diffusion priors. As a flexible representation, NeRF has been first
adopted for 3D representation. With density-based volumetric rendering, it
however suffers both intensive computational overhead and inaccurate mesh
extraction. Using a signed distance field and Marching Tetrahedra, DMTet allows
for precise mesh extraction and real-time rendering but is limited in handling
large topological changes in meshes, leading to optimization challenges.
Alternatively, 3D Gaussian Splatting (3DGS) is favored in both training and
rendering efficiency while falling short in mesh extraction. In this work, we
introduce a novel 3D representation, Tetrahedron Splatting (TeT-Splatting),
that supports easy convergence during optimization, precise mesh extraction,
and real-time rendering simultaneously. This is achieved by integrating
surface-based volumetric rendering within a structured tetrahedral grid while
preserving the desired ability of precise mesh extraction, and a tile-based
differentiable tetrahedron rasterizer. Furthermore, we incorporate eikonal and
normal consistency regularization terms for the signed distance field to
improve generation quality and stability. Critically, our representation can be
trained without mesh extraction, making the optimization process easier to
converge. Our TeT-Splatting can be readily integrated in existing 3D generation
pipelines, along with polygonal mesh for texture optimization. Extensive
experiments show that our TeT-Splatting strikes a superior tradeoff among
convergence speed, render efficiency, and mesh quality as compared to previous
alternatives under varying 3D generation settings. |
This paper introduces Tetrahedron Splatting (TeT-Splatting), a novel 3D representation for 3D generation that leverages volumetric rendering within a structured tetrahedral grid. |
Existing 3D representations for 3D generation face trade-offs between convergence speed, render efficiency, and mesh quality. This work aims to address these limitations and enable high-fidelity 3D generation. |
The method integrates surface-based volumetric rendering into a tetrahedral grid, enabling precise mesh extraction through Marching Tetrahedra. It employs a tile-based fast differentiable rasterizer for real-time rendering and incorporates eikonal and normal consistency regularization for improved generation quality. |
TeT-Splatting demonstrates superior trade-off among convergence speed, render efficiency, and mesh quality compared to alternatives like Instant-NGP, DMTet, and 3DGS.
The method achieves rapid and stable convergence in 3D generation tasks, effectively handling topological changes, unlike DMTet.
Evaluations with both vanilla and rich diffusion priors show TeT-Splatting produces high-fidelity 3D content with detailed geometries and textures. |
TeT-Splatting struggles with modeling high-frequency features due to the limitations of using tetrahedra as rendering primitives.
The implemented rasterizer's rendering speed, although real-time, is slower than 3DGS and could be further improved. |
3d generation, 3d representation, tetrahedron splatting, volumetric rendering, diffusion models |
2406.01561
Report |
Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation |
Mingyuan Zhou, Zhendong Wang, Huangjie Zheng, Hai Huang |
Diffusion-based text-to-image generation models trained on extensive
text-image pairs have shown the capacity to generate photorealistic images
consistent with textual descriptions. However, a significant limitation of
these models is their slow sample generation, which requires iterative
refinement through the same network. In this paper, we enhance Score identity
Distillation (SiD) by developing long and short classifier-free guidance (LSG)
to efficiently distill pretrained Stable Diffusion models without using real
training data. SiD aims to optimize a model-based explicit score matching loss,
utilizing a score-identity-based approximation alongside the proposed LSG for
practical computation. By training exclusively with fake images synthesized
with its one-step generator, SiD equipped with LSG rapidly improves FID and
CLIP scores, achieving state-of-the-art FID performance while maintaining a
competitive CLIP score. Specifically, its data-free distillation of Stable
Diffusion 1.5 achieves a record low FID of 8.15 on the COCO-2014 validation
set, with a CLIP score of 0.304 at an LSG scale of 1.5, and a FID of 9.56 with
a CLIP score of 0.313 at an LSG scale of 2. We will make our PyTorch
implementation and distilled Stable Diffusion one-step generators available at
https://github.com/mingyuanzhou/SiD-LSG |
This paper introduces a novel method combining Classifier-Free Guidance (CFG) with Score Identity Distillation (SiD) to effectively distill Stable Diffusion models into one-step generators, using only synthesized fake images. |
Diffusion models, while powerful for text-to-image generation, are computationally expensive due to their iterative nature. This work addresses this limitation by enabling fast, one-step generation without sacrificing performance. |
The study introduces "long and short guidance" (LSG) strategies for injecting CFG into SiD. It explores enhancing CFG for the teacher network, reducing it for the student network, and a combined approach for optimized FID and CLIP score balance. |
The proposed SiD-LSG achieves state-of-the-art FID scores among one-step distillation methods on the COCO-2014 dataset.
The method demonstrates successful distillation of both SD 1.5 and 2.1-base, achieving FID scores as low as 9.56 and 10.97, respectively, while maintaining competitive CLIP scores.
A record low FID of 8.15 is achieved with SD1.5 distillation by reducing the guidance scale and extending training time, outperforming even the teacher model. |
The current SiD-LSG implementation shows limitations in reaching the full text-image alignment capabilities of the teacher model, suggesting future exploration of multi-step generation or model size increase.
While FP16 mixed precision accelerates training, it currently limits achieving the lowest FID and highest CLIP scores compared to FP32, necessitating further optimization research. |
text-to-image generation, diffusion models, model distillation, classifier-free guidance, stable diffusion |
2406.01493
Report |
Learning Temporally Consistent Video Depth from Video Diffusion Priors |
Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Matteo Poggi, Yiyi Liao |
This work addresses the challenge of video depth estimation, which expects
not only per-frame accuracy but, more importantly, cross-frame consistency.
Instead of directly developing a depth estimator from scratch, we reformulate
the prediction task into a conditional generation problem. This allows us to
leverage the prior knowledge embedded in existing video generation models,
thereby reducing learning difficulty and enhancing generalizability.
Concretely, we study how to tame the public Stable Video Diffusion (SVD) to
predict reliable depth from input videos using a mixture of image depth and
video depth datasets. We empirically confirm that a procedural training
strategy -- first optimizing the spatial layers of SVD and then optimizing the
temporal layers while keeping the spatial layers frozen -- yields the best
results in terms of both spatial accuracy and temporal consistency. We further
examine the sliding window strategy for inference on arbitrarily long videos.
Our observations indicate a trade-off between efficiency and performance, with
a one-frame overlap already producing favorable results. Extensive experimental
results demonstrate the superiority of our approach, termed ChronoDepth, over
existing alternatives, particularly in terms of the temporal consistency of the
estimated depth. Additionally, we highlight the benefits of more consistent
video depth in two practical applications: depth-conditioned video generation
and novel view synthesis. Our project page is available at
https://jhaoshao.github.io/ChronoDepth/. |
This paper introduces ChronoDepth, a novel video depth estimation method that prioritizes temporal consistency by leveraging pre-trained video generation models (specifically, Stable Video Diffusion). |
Temporal consistency in video depth estimation is crucial for eliminating flickering artifacts and ensuring realistic 3D applications, yet current methods struggle to achieve both temporal consistency and spatial accuracy. |
The authors reformulate depth estimation as a conditional denoising diffusion generation task. They propose a two-stage fine-tuning strategy: optimizing spatial layers with single-frame depths, then freezing them and optimizing temporal layers using randomly-sized video clips. For inference, a novel temporal inpaint strategy enhances consistency across clips. |
ChronoDepth achieves state-of-the-art temporal consistency on benchmark datasets, surpassing both image and video depth estimation methods.
It maintains comparable spatial accuracy to state-of-the-art single-image depth estimators.
ChronoDepth demonstrates superior performance in downstream applications like depth-conditioned video generation and novel view synthesis. |
The reliance on synthetic datasets for training might limit generalization to diverse real-world scenarios.
Future work could explore larger and more varied datasets, as well as alternative video generation models. |
video depth estimation, temporal consistency, video diffusion models, stable video diffusion, conditional denoising diffusion |
2406.01476
Report |
DreamPhysics: Learning Physical Properties of Dynamic 3D Gaussians with Video Diffusion Priors |
Tianyu Huang, Yihan Zeng, Hui Li, Wangmeng Zuo, Rynson W. H. Lau |
Dynamic 3D interaction has witnessed great interest in recent works, while
creating such 4D content remains challenging. One solution is to animate 3D
scenes with physics-based simulation, and the other is to learn the deformation
of static 3D objects with the distillation of video generative models. The
former one requires assigning precise physical properties to the target object,
otherwise the simulated results would become unnatural. The latter tends to
formulate the video with minor motions and discontinuous frames, due to the
absence of physical constraints in deformation learning. We think that video
generative models are trained with real-world captured data, capable of judging
physical phenomenon in simulation environments. To this end, we propose
DreamPhysics in this work, which estimates physical properties of 3D Gaussian
Splatting with video diffusion priors. DreamPhysics supports both image- and
text-conditioned guidance, optimizing physical parameters via score
distillation sampling with frame interpolation and log gradient. Based on a
material point method simulator with proper physical parameters, our method can
generate 4D content with realistic motions. Experimental results demonstrate
that, by distilling the prior knowledge of video diffusion models, inaccurate
physical properties can be gradually refined for high-quality simulation. Codes
are released at: https://github.com/tyhuang0428/DreamPhysics. |
DreamPhysics, a novel framework, leverages video diffusion priors to estimate physical properties for dynamic 3D Gaussian Splatting (GS), enabling the generation of realistic 4D content. |
Creating dynamic 3D content with realistic physics remains challenging. Existing methods either rely on manual assignment of physical properties, leading to unnatural results, or learn deformation from video data lacking physical constraints, resulting in limited and unrealistic motion. |
DreamPhysics employs a Material Point Method (MPM) simulator to animate 3D GS scenes. It leverages Score Distillation Sampling (SDS) to optimize physical parameters based on video diffusion models' guidance, ensuring adherence to realistic physical behavior during animation. |
DreamPhysics effectively distills physical priors from video diffusion models, enabling accurate estimation of physical properties for 3D objects.
The framework supports both image- and text-conditioned optimization, broadening its applicability.
Compared to existing 4D generation methods, DreamPhysics achieves more realistic motion simulation and faster training. |
The range of simulated motions is currently limited, requiring further exploration of various physical constraints.
Current evaluation metrics for simulated videos rely on visual quality, necessitating the development of physics-based metrics for more comprehensive assessment. |
4d content generation, physics-based simulation, video diffusion models, 3d gaussian splatting, score distillation sampling |
2406.01467
Report |
RaDe-GS: Rasterizing Depth in Gaussian Splatting |
Baowen Zhang, Chuan Fang, Rakesh Shrestha, Yixun Liang, Xiaoxiao Long, Ping Tan |
Gaussian Splatting (GS) has proven to be highly effective in novel view
synthesis, achieving high-quality and real-time rendering. However, its
potential for reconstructing detailed 3D shapes has not been fully explored.
Existing methods often suffer from limited shape accuracy due to the discrete
and unstructured nature of Gaussian splats, which complicates the shape
extraction. While recent techniques like 2D GS have attempted to improve shape
reconstruction, they often reformulate the Gaussian primitives in ways that
reduce both rendering quality and computational efficiency. To address these
problems, our work introduces a rasterized approach to render the depth maps
and surface normal maps of general 3D Gaussian splats. Our method not only
significantly enhances shape reconstruction accuracy but also maintains the
computational efficiency intrinsic to Gaussian Splatting. Our approach achieves
a Chamfer distance error comparable to NeuraLangelo on the DTU dataset and
similar training and rendering time as traditional Gaussian Splatting on the
Tanks & Temples dataset. Our method is a significant advancement in Gaussian
Splatting and can be directly integrated into existing Gaussian Splatting-based
methods. |
This paper introduces RaDe-GS, a novel rasterized method for computing depth and normal maps of general 3D Gaussian splats, enhancing 3D shape reconstruction accuracy in Gaussian Splatting while maintaining its computational efficiency. |
Gaussian Splatting is efficient for novel view synthesis but struggles with accurate 3D shape reconstruction due to the discrete nature of Gaussian splats. Existing methods trying to address this compromise rendering quality and efficiency. |
The authors derive a closed-form solution for intersections of light rays and Gaussian splats, enabling efficient depth map calculation. They leverage the approximate affine projection to compute spatially varying depth within projected Gaussian splats, enabling rasterization for depth and normal map computation. |
RaDe-GS achieves a Chamfer distance error of 0.69 mm on the DTU dataset, comparable to NeuraLangelo and surpassing other Gaussian Splatting methods.
The method maintains similar training and rendering time as traditional Gaussian Splatting on the Tanks & Temples dataset (around 17.8 minutes).
It achieves high-quality novel view synthesis, outperforming other Gaussian Splatting methods in PSNR and perceptual metrics. |
Current TSDF fusion is limited to low-resolution voxel grids for large scenes, impacting surface extraction accuracy.
Reconstruction of reflective surfaces is limited by the simple color function in 3D GS, potentially addressed by incorporating advanced color representations. |
gaussian splatting, 3d reconstruction, novel view synthesis, depth map estimation, rasterization |
2406.01460
Report |
MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization |
Yu Zhang, Qi Zhang, Zixuan Gong, Yiwei Shi, Yepeng Liu, Duoqian Miao, Yang Liu, Ke Liu, Kun Yi, Wei Fan, Liang Hu, Changwei Wang |
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable
success, leading to rapid advancements in multimodal studies. However, CLIP
faces a notable challenge in terms of inefficient data utilization. It relies
on a single contrastive supervision for each image-text pair during
representation learning, disregarding a substantial amount of valuable
information that could offer richer supervision. Additionally, the retention of
non-informative tokens leads to increased computational demands and time costs,
particularly in CLIP's ViT image encoder. To address these issues, we propose
Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the
frequency transform's sensitivity to both high and low-frequency variations,
which complements the spatial domain's sensitivity limited to low-frequency
variations only. By incorporating frequency transforms and token-level
alignment, we expand CILP's single supervision into multi-domain and
multi-level supervision, enabling a more thorough exploration of informative
image features. Additionally, we introduce a token merging method guided by
comprehensive semantics from the frequency and spatial domains. This allows us
to merge tokens to multi-granularity tokens with a controllable compression
rate to accelerate CLIP. Extensive experiments validate the effectiveness of
our design. |
Proposes MLIP, a Multi-Perspective Language-Image Pretraining framework, which introduces frequency domain analysis and token merging to improve CLIP's data efficiency and training speed. |
CLIP suffers from inefficient data utilization and high computational costs due to its reliance on single contrastive supervision and the presence of non-informative tokens. |
MLIP splits the image encoder into Frequency and Spatial Stages for multi-domain supervision. It implements joint spatial-frequency token alignment for fine-grained representation learning and utilizes token merging guided by frequency-spatial information for acceleration. |
MLIP achieves competitive zero-shot and linear-probe image classification accuracy compared to CLIP and its variants.
MLIP demonstrates superior performance in zero-shot image-text retrieval tasks, particularly in recall@1 metrics.
MLIP achieves a better computation-performance balance than other CLIP-like models. |
MLIP is currently only explored with ViT-based architectures, limiting its applicability to CNN-based models.
The token merging in MLIP poses challenges for its application in dense vision downstream tasks like segmentation. |
multimodal learning, vision-language pretraining, contrastive learning, frequency domain analysis, token merging |
2406.01388
Report |
AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation |
Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang |
As cutting-edge Text-to-Image (T2I) generation models already excel at
producing remarkable single images, an even more challenging task, i.e.,
multi-turn interactive image generation begins to attract the attention of
related research communities. This task requires models to interact with users
over multiple turns to generate a coherent sequence of images. However, since
users may switch subjects frequently, current efforts struggle to maintain
subject consistency while generating diverse images. To address this issue, we
introduce a training-free multi-agent framework called AutoStudio. AutoStudio
employs three agents based on large language models (LLMs) to handle
interactions, along with a stable diffusion (SD) based agent for generating
high-quality images. Specifically, AutoStudio consists of (i) a subject manager
to interpret interaction dialogues and manage the context of each subject, (ii)
a layout generator to generate fine-grained bounding boxes to control subject
locations, (iii) a supervisor to provide suggestions for layout refinements,
and (iv) a drawer to complete image generation. Furthermore, we introduce a
Parallel-UNet to replace the original UNet in the drawer, which employs two
parallel cross-attention modules for exploiting subject-aware features. We also
introduce a subject-initialized generation method to better preserve small
subjects. Our AutoStudio hereby can generate a sequence of multi-subject images
interactively and consistently. Extensive experiments on the public CMIGBench
benchmark and human evaluations show that AutoStudio maintains multi-subject
consistency across multiple turns well, and it also raises the state-of-the-art
performance by 13.65% in average Frechet Inception Distance and 2.83% in
average character-character similarity. |
This paper proposes AutoStudio, a training-free multi-agent framework for multi-turn interactive image generation, which addresses the challenge of maintaining multi-subject consistency over multiple turns. |
Existing methods struggle to maintain consistency across multiple subjects in interactive image generation tasks, especially when users frequently switch subjects or provide complex instructions. |
AutoStudio employs three LLM-based agents for dialogue interpretation, layout generation, and layout supervision, along with a stable diffusion-based agent enhanced by a Parallel-UNet and a subject-initialized generation method for image synthesis. |
AutoStudio outperforms existing methods on CMIGBench, demonstrating superior performance in maintaining multi-subject consistency and generating high-quality images.
The proposed P-UNet architecture and subject-initialized generation method effectively enhance subject consistency during image generation.
Human evaluation confirms AutoStudio's ability to generate images that align better with user intentions. |
AutoStudio may exhibit limitations in generating intricate details, particularly in close-interaction scenarios between subjects.
The use of multiple agents can increase computational time and resource requirements. |
multi-turn interactive image generation, multi-agent framework, subject consistency, stable diffusion, layout generation |
2406.01334
Report |
HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models |
Mengcheng Li, Hongwen Zhang, Yuxiang Zhang, Ruizhi Shao, Tao Yu, Yebin Liu |
Recent years have witnessed a trend of the deep integration of the generation
and reconstruction paradigms. In this paper, we extend the ability of
controllable generative models for a more comprehensive hand mesh recovery
task: direct hand mesh generation, inpainting, reconstruction, and fitting in a
single framework, which we name as Holistic Hand Mesh Recovery (HHMR). Our key
observation is that different kinds of hand mesh recovery tasks can be achieved
by a single generative model with strong multimodal controllability, and in
such a framework, realizing different tasks only requires giving different
signals as conditions. To achieve this goal, we propose an all-in-one diffusion
framework based on graph convolution and attention mechanisms for holistic hand
mesh recovery. In order to achieve strong control generation capability while
ensuring the decoupling of multimodal control signals, we map different
modalities to a shared feature space and apply cross-scale random masking in
both modality and feature levels. In this way, the correlation between
different modalities can be fully exploited during the learning of hand priors.
Furthermore, we propose Condition-aligned Gradient Guidance to enhance the
alignment of the generated model with the control signals, which significantly
improves the accuracy of the hand mesh reconstruction and fitting. Experiments
show that our novel framework can realize multiple hand mesh recovery tasks
simultaneously and outperform the existing methods in different tasks, which
provides more possibilities for subsequent downstream applications including
gesture recognition, pose generation, mesh editing, and so on. |
This paper presents HHMR, a unified graph diffusion-based framework for holistic hand mesh recovery, enabling simultaneous direct generation, inpainting, reconstruction, and fitting. |
Unifying these tasks within a single framework can enhance their mutual benefits and improve efficiency compared to separate models. |
The method utilizes a U-shaped graph convolutional network with self- and cross-attention to learn hand priors from various input conditions (images, skeletons, etc.) and progressively denoise a 3D hand mesh. It also employs random masking and a condition-aligned gradient guidance strategy for enhanced control and accuracy. |
HHMR generates more diverse and realistic hand meshes compared to PCA-based methods.
It achieves comparable single-hypothesis reconstruction results and superior multi-hypothesis results on FreiHAND dataset, outperforming state-of-the-art approaches.
The condition-aligned gradient guidance significantly improves accuracy in 2D hand mesh fitting tasks. |
The model might not perform well with extremely noisy or incomplete input conditions.
Increasing denoising steps for higher precision comes with increased computational cost. |
hand mesh recovery, diffusion models, generative models, graph convolutional networks, multimodal learning |
2406.01210
Report |
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer |
Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, Xinghao Chen |
Cross-modal transformers have demonstrated superiority in various vision
tasks by effectively integrating different modalities. This paper first
critiques prior token exchange methods which replace less informative tokens
with inter-modal features, and demonstrate exchange based methods underperform
cross-attention mechanisms, while the computational demand of the latter
inevitably restricts its use with longer sequences. To surmount the
computational challenges, we propose GeminiFusion, a pixel-wise fusion approach
that capitalizes on aligned cross-modal representations. GeminiFusion elegantly
combines intra-modal and inter-modal attentions, dynamically integrating
complementary information across modalities. We employ a layer-adaptive noise
to adaptively control their interplay on a per-layer basis, thereby achieving a
harmonized fusion process. Notably, GeminiFusion maintains linear complexity
with respect to the number of input tokens, ensuring this multimodal framework
operates with efficiency comparable to unimodal networks. Comprehensive
evaluations across multimodal image-to-image translation, 3D object detection
and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR,
event data, etc. demonstrate the superior performance of our GeminiFusion
against leading-edge techniques. The PyTorch code is available at
https://github.com/JiaDingCN/GeminiFusion |
This paper introduces GeminiFusion, an efficient pixel-wise multimodal fusion module for vision transformers that leverages the inherent alignment of multi-modality input in vision tasks, outperforming token exchange methods like TokenFusion. |
Multimodal fusion in vision transformers is often limited by either the sub-optimality of token exchange methods or the computational overhead of cross-attention mechanisms. GeminiFusion addresses these limitations, offering both efficiency and state-of-the-art performance. |
GeminiFusion prioritizes interactions between spatially co-located patches from different modalities using a pixel-wise attention mechanism. It incorporates a relation discriminator to improve feature selection and layer-adaptive noise for better self/cross-attention balance. |
GeminiFusion consistently outperforms TokenFusion on multimodal semantic segmentation tasks, achieving improvements up to 3.4% in mIoU on the DeLiVER dataset.
It also excels in image-to-image translation, showing significant improvements in FID/KID scores on the Taskonomy dataset.
GeminiFusion demonstrates efficiency gains over TokenFusion, achieving comparable inference latency to unimodal networks. |
GeminiFusion, in its current form, is primarily designed for homogeneous modalities and might not be directly applicable to heterogeneous data like images paired with audio or text.
Further research is needed to extend its capabilities to handle heterogeneous data combinations. |
multimodal fusion, vision transformer, geminifusion, semantic segmentation, image-to-image translation |
2406.01203
Report |
Scaling Up Deep Clustering Methods Beyond ImageNet-1K |
Nikolas Adaloglou, Felix Michels, Kaspar Senft, Diana Petrusheva, Markus Kollmann |
Deep image clustering methods are typically evaluated on small-scale balanced
classification datasets while feature-based $k$-means has been applied on
proprietary billion-scale datasets. In this work, we explore the performance of
feature-based deep clustering approaches on large-scale benchmarks whilst
disentangling the impact of the following data-related factors: i) class
imbalance, ii) class granularity, iii) easy-to-recognize classes, and iv) the
ability to capture multiple classes. Consequently, we develop multiple new
benchmarks based on ImageNet21K. Our experimental analysis reveals that
feature-based $k$-means is often unfairly evaluated on balanced datasets.
However, deep clustering methods outperform $k$-means across most large-scale
benchmarks. Interestingly, $k$-means underperforms on easy-to-classify
benchmarks by large margins. The performance gap, however, diminishes on the
highest data regimes such as ImageNet21K. Finally, we find that non-primary
cluster predictions capture meaningful classes (i.e. coarser classes). |
This paper presents a comprehensive experimental study on large-scale image clustering methods and benchmarks, focusing on factors like class imbalance, granularity, ease of classification, and multi-label capture. |
Existing deep image clustering methods are often evaluated on small, balanced datasets, limiting their applicability to real-world, large-scale scenarios. This work addresses this gap by exploring their performance on challenging, large-scale benchmarks. |
The study creates new benchmarks based on ImageNet21K, varying factors like class imbalance and granularity. It compares the performance of feature-based k-means with deep clustering methods like TEMI and SCANv2 on these benchmarks. |
Deep clustering methods outperform k-means on most benchmarks, except for cases with highly coarse labels or the largest dataset scales.
K-means performs poorly on benchmarks with easily classifiable classes, suggesting limitations in capturing irregular class shapes.
Non-primary cluster predictions from clustering methods can capture meaningful secondary classes like coarser labels. |
The study relies on pre-trained feature extractors, limiting the exploration of feature learning in clustering.
The sensitivity of SCANv2 to mini-batch size poses computational challenges for large-scale datasets. |
image clustering, large-scale benchmarks, class imbalance, class granularity, multi-label clustering |
2406.01188
Report |
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation |
Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, Nong Sang |
Recent diffusion-based human image animation techniques have demonstrated
impressive success in synthesizing videos that faithfully follow a given
reference identity and a sequence of desired movement poses. Despite this,
there are still two limitations: i) an extra reference model is required to
align the identity image with the main video branch, which significantly
increases the optimization burden and model parameters; ii) the generated video
is usually short in time (e.g., 24 frames), hampering practical applications.
To address these shortcomings, we present a UniAnimate framework to enable
efficient and long-term human video generation. First, to reduce the
optimization difficulty and ensure temporal coherence, we map the reference
image along with the posture guidance and noise video into a common feature
space by incorporating a unified video diffusion model. Second, we propose a
unified noise input that supports random noised input as well as first frame
conditioned input, which enhances the ability to generate long-term video.
Finally, to further efficiently handle long sequences, we explore an
alternative temporal modeling architecture based on state space model to
replace the original computation-consuming temporal Transformer. Extensive
experimental results indicate that UniAnimate achieves superior synthesis
results over existing state-of-the-art counterparts in both quantitative and
qualitative evaluations. Notably, UniAnimate can even generate highly
consistent one-minute videos by iteratively employing the first frame
conditioning strategy. Code and models will be publicly available. Project
page: https://unianimate.github.io/. |
This paper proposes UniAnimate, a novel video diffusion model framework for consistent and efficient human image animation, addressing limitations of existing methods in handling long video generation and appearance misalignment. |
Human image animation is a challenging task crucial for various applications like video creation and virtual reality. Existing methods face limitations in maintaining temporal consistency, aligning appearance with reference images, and generating long videos. |
UniAnimate leverages a unified video diffusion model to encode both reference image and video content in a shared feature space for enhanced appearance alignment. It introduces a unified noised input supporting both random and first-frame conditioned videos for smooth transitions in long sequences. Additionally, it explores temporal Mamba, an efficient alternative to temporal Transformers for long-range temporal modeling. |
UniAnimate demonstrates superior performance over state-of-the-art methods on benchmark datasets, achieving higher visual quality, identity preservation, and temporal consistency.
The proposed unified video diffusion model significantly improves appearance alignment compared to using separate networks for reference image and video generation.
Temporal Mamba proves to be an effective and efficient alternative to temporal Transformers for long video generation, exhibiting comparable performance with reduced memory consumption. |
Generating fine-grained details in facial and hand regions remains challenging.
Occasional inconsistencies in completing invisible parts across different video segments may lead to temporal artifacts. |
video generation, human image animation, diffusion model, temporal modeling, appearance alignment |
2406.01159
Report |
Dimba: Transformer-Mamba Diffusion Models |
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, Junshi Huang |
This paper unveils Dimba, a new text-to-image diffusion model that employs a
distinctive hybrid architecture combining Transformer and Mamba elements.
Specifically, Dimba sequentially stacked blocks alternate between Transformer
and Mamba layers, and integrate conditional information through the
cross-attention layer, thus capitalizing on the advantages of both
architectural paradigms. We investigate several optimization strategies,
including quality tuning, resolution adaption, and identify critical
configurations necessary for large-scale image generation. The model's flexible
design supports scenarios that cater to specific resource constraints and
objectives. When scaled appropriately, Dimba offers substantial throughput and
a reduced memory footprint relative to conventional pure Transformers-based
benchmarks. Extensive experiments indicate that Dimba achieves comparable
performance compared with benchmarks in terms of image quality, artistic
rendering, and semantic control. We also report several intriguing properties
of architecture discovered during evaluation and release checkpoints in
experiments. Our findings emphasize the promise of large-scale hybrid
Transformer-Mamba architectures in the foundational stage of diffusion models,
suggesting a bright future for text-to-image generation. |
This paper introduces Dimba, a novel text-to-image diffusion model that leverages a hybrid architecture combining Transformer and Mamba layers for enhanced efficiency and performance. |
Existing text-to-image models often suffer from high memory requirements and limitations in handling long contexts. Dimba addresses these limitations by integrating the strengths of both Transformer and Mamba architectures. |
Dimba interleaves Transformer and Mamba layers, incorporating conditional information through cross-attention. The authors trained Dimba using a large-scale, curated image-text dataset with a focus on aesthetic quality, employing techniques like quality tuning and resolution adaptation. |
Dimba achieves comparable image quality and semantic alignment compared to existing diffusion models, as evidenced by FID scores and T2I-CompBench results.
The hybrid architecture allows for flexibility in balancing throughput and memory requirements based on specific needs.
Quality tuning with a curated dataset significantly improves the aesthetic quality of generated images. |
Dimba may inherit biases from the training data, impacting its ability to generate certain styles, scenes, or objects.
Potential negative social impacts, such as perpetuating stereotypes, need to be addressed in future research. |
text-to-image generation, diffusion models, hybrid architecture, transformer, mamba |
2406.01125
Report |
$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers |
Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, Tao Chen |
Diffusion models are widely recognized for generating high-quality and
diverse images, but their poor real-time performance has led to numerous
acceleration works, primarily focusing on UNet-based structures. With the more
successful results achieved by diffusion transformers (DiT), there is still a
lack of exploration regarding the impact of DiT structure on generation, as
well as the absence of an acceleration framework tailored to the DiT
architecture. To tackle these challenges, we conduct an investigation into the
correlation between DiT blocks and image generation. Our findings reveal that
the front blocks of DiT are associated with the outline of the generated
images, while the rear blocks are linked to the details. Based on this insight,
we propose an overall training-free inference acceleration framework
$\Delta$-DiT: using a designed cache mechanism to accelerate the rear DiT
blocks in the early sampling stages and the front DiT blocks in the later
stages. Specifically, a DiT-specific cache mechanism called $\Delta$-Cache is
proposed, which considers the inputs of the previous sampling image and reduces
the bias in the inference. Extensive experiments on PIXART-$\alpha$ and DiT-XL
demonstrate that the $\Delta$-DiT can achieve a $1.6\times$ speedup on the
20-step generation and even improves performance in most cases. In the scenario
of 4-step consistent model generation and the more challenging $1.12\times$
acceleration, our method significantly outperforms existing methods. Our code
will be publicly available. |
This paper introduces Δ-DiT, a training-free inference acceleration method for diffusion transformers (DiT) that leverages a novel cache mechanism called Δ-Cache. |
Existing diffusion model acceleration techniques primarily focus on UNet architectures, while DiT models lack dedicated acceleration frameworks despite their success. |
The paper first analyzes challenges in applying existing cache methods to DiT and proposes Δ-Cache, which caches feature map deviations instead of the maps themselves to preserve information. It then investigates the impact of DiT blocks on generation, finding that front blocks contribute to outlines while rear blocks contribute to details. Δ-DiT leverages this by caching rear blocks in early sampling stages (outline generation) and front blocks in later stages (detail generation). |
Δ-DiT achieves a 1.6x speedup on 20-step generation with comparable or better generation quality compared to baseline models.
Δ-DiT outperforms existing methods in challenging scenarios like 4-step consistent model generation and at higher acceleration ratios (1.12x).
The proposed Δ-Cache method is compatible with various advanced solvers and consistently outperforms baseline methods. |
The exploration of the relationship between DiT blocks and generated images is preliminary and coarse-grained.
Future work could explore more fine-grained search or learning strategies for further improvements. |
diffusion models, transformers, inference acceleration, cache mechanism, image generation |
2406.01069
Report |
UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment |
Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li |
Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to
simulate human subjective perception of image visual quality and aesthetic
appeal. Existing methods typically address these tasks independently due to
distinct learning objectives. However, they neglect the underlying
interconnectedness of both tasks, which hinders the learning of task-agnostic
shared representations for human subjective perception. To confront this
challenge, we propose Unified vision-language pre-training of Quality and
Aesthetics (UniQA), to learn general perceptions of two tasks, thereby
benefiting them simultaneously. Addressing the absence of text in the IQA
datasets and the presence of textual noise in the IAA datasets, (1) we utilize
multimodal large language models (MLLMs) to generate high-quality text
descriptions; (2) the generated text for IAA serves as metadata to purify noisy
IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we
further propose a lightweight adapter that utilizes versatile cues to fully
exploit the extensive knowledge of the pre-trained model. Extensive experiments
demonstrate that our approach attains a new state-of-the-art performance on
both IQA and IAA tasks, while concurrently showcasing exceptional zero-shot and
few-label image assessment capabilities. The source code will be available at
https://github.com/zht8506/UniQA. |
This paper proposes UniQA, a novel method for unified vision-language pre-training for both Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) tasks. |
Existing methods often address IQA and IAA independently, neglecting the interconnectedness of human perception of image quality and aesthetics. This unified approach aims to learn generalizable representations for both tasks, enhancing their effectiveness and efficiency. |
The proposed UniQA utilizes Multimodal Large Language Models (MLLMs) to generate quality- and aesthetics-related descriptions for IQA and IAA datasets. It leverages these descriptions to pre-train a vision-language model and introduces a lightweight Multi-Cue Integration Adapter to fine-tune the pre-trained model on specific IQA and IAA datasets. |
UniQA achieves state-of-the-art performance on multiple benchmark datasets for both IQA and IAA tasks.
The model demonstrates excellent zero-shot and few-label image assessment capabilities, indicating its strong generalization ability and data efficiency.
Qualitative results showcasing the model's ability to retrieve images based on quality- and aesthetics-related queries provide further evidence of its effectiveness. |
The generated captions by MLLMs often have similar structures, potentially limiting the diversity and richness of representations learned during pre-training.
Exploring methods to enhance the diversity of MLLMs-generated captions, such as integrating multiple MLLMs or using in-context learning, is an important direction for future research. |
image quality assessment, image aesthetic assessment, vision-language pre-training, multimodal large language models, zero-shot learning |
2406.01062
Report |
SceneTextGen: Layout-Agnostic Scene Text Image Synthesis with Diffusion Models |
Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xiaoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N. Metaxas, Praveen Krishnan |
While diffusion models have significantly advanced the quality of image
generation, their capability to accurately and coherently render text within
these images remains a substantial challenge. Conventional diffusion-based
methods for scene text generation are typically limited by their reliance on an
intermediate layout output. This dependency often results in a constrained
diversity of text styles and fonts, an inherent limitation stemming from the
deterministic nature of the layout generation phase. To address these
challenges, this paper introduces SceneTextGen, a novel diffusion-based model
specifically designed to circumvent the need for a predefined layout stage. By
doing so, SceneTextGen facilitates a more natural and varied representation of
text. The novelty of SceneTextGen lies in its integration of three key
components: a character-level encoder for capturing detailed typographic
properties, coupled with a character-level instance segmentation model and a
word-level spotting model to address the issues of unwanted text generation and
minor character inaccuracies. We validate the performance of our method by
demonstrating improved character recognition rates on generated images across
different public visual text datasets in comparison to both standard diffusion
based methods and text specific methods. |
Introduces SceneTextGen, a novel diffusion-based model for scene text generation that surpasses the limitations of predefined layouts, allowing flexible text placement and diverse text styles. |
Current diffusion models struggle to generate text within images that is both visually appealing and contextually relevant. Existing methods are limited by predefined layouts, restricting diversity in font styles and text positioning. |
SceneTextGen utilizes a character-level encoder to capture typographic properties and integrates this information into the cross-attention layers of a diffusion model. It also employs word-level and character-level losses to ensure text accuracy and prevent excessive text generation. |
SceneTextGen outperforms existing models in OCR-based text recognition scores, indicating its ability to generate clear and accurate text.
SceneTextGen demonstrates superior diversity in font styles compared to methods relying on predefined layouts.
The model shows strong generalization capability, achieving robust OCR scores on datasets beyond its training data. |
SceneTextGen faces challenges in generating complex visual elements in conjunction with text.
The model's performance in terms of text accuracy and coherence decreases with increasing text length. |
text generation, image generation, diffusion models, scene text, computer vision |
2406.01042
Report |
Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting |
Fang Li, Hao Zhang, Narendra Ahuja |
Gaussian Splatting (GS) has significantly elevated scene reconstruction
efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance
Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS
methods, whether based on GS or NeRF, primarily rely on camera parameters
provided by COLMAP and even utilize sparse point clouds generated by COLMAP for
initialization, which lack accuracy as well are time-consuming. This sometimes
results in poor dynamic scene representation, especially in scenes with large
object movements, or extreme camera conditions e.g. small translations combined
with large rotations. Some studies simultaneously optimize the estimation of
camera parameters and scenes, supervised by additional information like depth,
optical flow, etc. obtained from off-the-shelf models. Using this unverified
information as ground truth can reduce robustness and accuracy, which does
frequently occur for long monocular videos (with e.g. > hundreds of frames). We
propose a novel approach that learns a high-fidelity 4D GS scene representation
with self-calibration of camera parameters. It includes the extraction of 2D
point features that robustly represent 3D structure, and their use for
subsequent joint optimization of camera parameters and 3D structure towards
overall 4D scene optimization. We demonstrate the accuracy and time efficiency
of our method through extensive quantitative and qualitative experimental
results on several standard benchmarks. The results show significant
improvements over state-of-the-art methods for 4D novel view synthesis. The
source code will be released soon at https://github.com/fangli333/SC-4DGS. |
This paper proposes SC-4DGS, a novel method for high-fidelity 4D novel view synthesis of dynamic scenes using Gaussian Splatting with self-calibrated camera parameters, eliminating the need for camera priors and handling videos of varying lengths. |
Current 4D NVS methods often rely on external camera parameter estimation tools like COLMAP, which can be inaccurate and time-consuming, especially for dynamic scenes with large object movements or complex camera trajectories. SC-4DGS addresses these limitations by jointly optimizing camera parameters and scene representation. |
The method employs a three-step process: 1) Structural Point Extraction (SPE) to establish 2D-3D correspondences of structural points across frames. 2) Joint optimization of camera parameters and 3D structural points using extracted 2D points and their correspondence. 3) Optimization of dynamic scene representation using a Canonical Field and a Deformation Field, initialized with the optimized 3D structural points. |
SC-4DGS achieves comparable or superior novel view synthesis quality to state-of-the-art methods on benchmark datasets like NeRF-DS and DAVIS.
The proposed method demonstrates more robust and accurate camera parameter estimation compared to COLMAP and RoDynRF, especially in scenes with extreme camera motions.
SC-4DGS efficiently handles long monocular videos, overcoming limitations of existing methods like RoDynRF. |
The method currently assumes a constant focal length throughout the video, limiting its applicability to scenarios with zoom effects.
Reliance on ground truth motion masks as input poses challenges for scenes with complex, high-speed fluid motion, suggesting future work in automatic motion segmentation. |
novel view synthesis, gaussian splatting, self-calibration, dynamic scene reconstruction, camera parameter estimation |
2406.01020
Report |
CLIP-Guided Attribute Aware Pretraining for Generalizable Image Quality Assessment |
Daekyu Kwon, Dongyoung Kim, Sehwan Ki, Younghyun Jo, Hyong-Euk Lee, Seon Joo Kim |
In no-reference image quality assessment (NR-IQA), the challenge of limited
dataset sizes hampers the development of robust and generalizable models.
Conventional methods address this issue by utilizing large datasets to extract
rich representations for IQA. Also, some approaches propose vision language
models (VLM) based IQA, but the domain gap between generic VLM and IQA
constrains their scalability. In this work, we propose a novel pretraining
framework that constructs a generalizable representation for IQA by selectively
extracting quality-related knowledge from VLM and leveraging the scalability of
large datasets. Specifically, we carefully select optimal text prompts for five
representative image quality attributes and use VLM to generate pseudo-labels.
Numerous attribute-aware pseudo-labels can be generated with large image
datasets, allowing our IQA model to learn rich representations about image
quality. Our approach achieves state-of-the-art performance on multiple IQA
datasets and exhibits remarkable generalization capabilities. Leveraging these
strengths, we propose several applications, such as evaluating image generation
models and training image enhancement models, demonstrating our model's
real-world applicability. We will make the code available for access. |
Presents ATTIQA, a novel pretraining framework for IQA that leverages CLIP's knowledge and large datasets to construct a generalizable and robust attribute-aware representation space. |
Addresses the limitations of traditional IQA methods, which suffer from small dataset sizes and poor generalization abilities, by effectively integrating VLM (CLIP) and large-scale data pretraining for enhanced IQA. |
Employs a two-stage approach: 1) Prompt Selection: Utilizes GPT-4 to generate candidate prompts and selects optimal prompts for 5 key attributes via proxy tasks measuring distortion intensity and human perception alignment. 2) Pretraining Pipeline: Generates attribute-aware pseudo-labels using CLIP with selected prompts on a large dataset (ImageNet) and trains the IQA model with a ranking-based loss for enhanced robustness. |
Achieves state-of-the-art performance on multiple IQA and aesthetic quality datasets, demonstrating significant improvements over existing methods.
Exhibits superior generalization capabilities in cross-dataset validation, indicating robustness and adaptability to unseen data.
Successfully applied as a metric for evaluating generative models and guiding image enhancement through reinforcement learning, highlighting its practical value in real-world scenarios. |
Current attribute focus is limited to five common attributes, potentially overlooking other relevant image quality factors.
Future work will explore expanding the representation space to incorporate additional attributes and further enhance the model's comprehensiveness. |
image quality assessment, vision language model, clip, pretraining, generalization |
2406.00985
Report |
MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models |
Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Suresh Lokhande, Siwei Lyu |
Text-driven image synthesis has made significant advancements with the
development of diffusion models, transforming how visual content is generated
from text prompts. Despite these advances, text-driven image editing, a key
area in computer graphics, faces unique challenges. A major challenge is making
simultaneous edits across multiple objects or attributes. Applying these
methods sequentially for multi-aspect edits increases computational demands and
efficiency losses. In this paper, we address these challenges with significant
contributions. Our main contribution is the development of MultiEdits, a method
that seamlessly manages simultaneous edits across multiple attributes. In
contrast to previous approaches, MultiEdits not only preserves the quality of
single attribute edits but also significantly improves the performance of
multitasking edits. This is achieved through an innovative attention
distribution mechanism and a multi-branch design that operates across several
processing heads. Additionally, we introduce the PIE-Bench++ dataset, an
expansion of the original PIE-Bench dataset, to better support evaluating
image-editing tasks involving multiple objects and attributes simultaneously.
This dataset is a benchmark for evaluating text-driven image editing methods in
multifaceted scenarios. Dataset and code are available at
https://mingzhenhuang.com/projects/MultiEdits.html. |
Introduces MultiEdits, a method for text-driven image editing that efficiently handles simultaneous edits across multiple attributes. |
Addresses the limitations of existing methods that struggle with multi-aspect editing due to computational overhead and error accumulation in sequential applications. |
Utilizes an attention grouping mechanism to categorize edits, employs multiple target branches for parallel processing, and leverages cross-branch interactions for consistency. |
Outperforms state-of-the-art methods in terms of editing effectiveness and efficiency on the introduced PIE-Bench++ dataset.
Demonstrates robustness across varying numbers of editing aspects.
Maintains content and background preservation during multi-aspect editing. |
Limitations in handling text editing within images and dramatic background changes.
Future work includes exploring semantic order of edits and addressing limitations. |
text-driven image editing, multi-aspect editing, diffusion models, attention mechanism, pie-bench++ dataset |
2406.00908
Report |
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation |
Shaoshu Yang, Yong Zhang, Xiaodong Cun, Ying Shan, Ran He |
Video generation has made remarkable progress in recent years, especially
since the advent of the video diffusion models. Many video generation models
can produce plausible synthetic videos, e.g., Stable Video Diffusion (SVD).
However, most video models can only generate low frame rate videos due to the
limited GPU memory as well as the difficulty of modeling a large set of frames.
The training videos are always uniformly sampled at a specified interval for
temporal compression. Previous methods promote the frame rate by either
training a video interpolation model in pixel space as a postprocessing stage
or training an interpolation model in latent space for a specific base video
model. In this paper, we propose a training-free video interpolation method for
generative video diffusion models, which is generalizable to different models
in a plug-and-play manner. We investigate the non-linearity in the feature
space of video diffusion models and transform a video model into a
self-cascaded video diffusion model with incorporating the designed hidden
state correction modules. The self-cascaded architecture and the correction
module are proposed to retain the temporal consistency between key frames and
the interpolated frames. Extensive evaluations are preformed on multiple
popular video models to demonstrate the effectiveness of the propose method,
especially that our training-free method is even comparable to trained
interpolation models supported by huge compute resources and large-scale
datasets. |
This paper introduces a training-free video interpolation method for generative video diffusion models, enhancing their frame rate generation capabilities without requiring additional training data or parameter updates. |
Existing video generation models often produce low frame rate videos due to GPU memory limitations and challenges in modeling long sequences. Current interpolation methods necessitate training or are model-specific, hindering their generalizability. |
The method transforms the target video model into a self-cascaded architecture with hidden state correction modules. These modules refine hidden states within the transformer blocks for improved temporal consistency across generated frames. |
The method generates high frame rate (2x and 4x) videos with superior visual quality and temporal consistency compared to direct inference and latent space back-projection.
ZeroSmooth maintains key frame content effectively during high frame rate generation, evidenced by high PSNR and SSIM scores.
The approach exhibits competitive performance against training-based video interpolation methods while remaining entirely training-free. |
The interpolation performance heavily depends on the quality and consistency of the base video model's generated frames.
Future work could explore extending this method to handle variable frame rate interpolation for more flexible video generation. |
video generation, video interpolation, diffusion models, training-free, self-cascaded architecture |
2406.00830
Report |
Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection |
Yang Cao, Yihan Zeng, Hang Xu, Dan Xu |
Open-vocabulary 3D Object Detection (OV-3DDet) addresses the detection of
objects from an arbitrary list of novel categories in 3D scenes, which remains
a very challenging problem. In this work, we propose CoDAv2, a unified
framework designed to innovatively tackle both the localization and
classification of novel 3D objects, under the condition of limited base
categories. For localization, the proposed 3D Novel Object Discovery (3D-NOD)
strategy utilizes 3D geometries and 2D open-vocabulary semantic priors to
discover pseudo labels for novel objects during training. 3D-NOD is further
extended with an Enrichment strategy that significantly enriches the novel
object distribution in the training scenes, and then enhances the model's
ability to localize more novel objects. The 3D-NOD with Enrichment is termed
3D-NODE. For classification, the Discovery-driven Cross-modal Alignment (DCMA)
module aligns features from 3D point clouds and 2D/textual modalities,
employing both class-agnostic and class-specific alignments that are
iteratively refined to handle the expanding vocabulary of objects. Besides, 2D
box guidance boosts the classification accuracy against complex background
noises, which is coined as Box-DCMA. Extensive evaluation demonstrates the
superiority of CoDAv2. CoDAv2 outperforms the best-performing method by a large
margin (AP_Novel of 9.17 vs. 3.61 on SUN-RGBD and 9.12 vs. 3.74 on ScanNetv2).
Source code and pre-trained models are available at the GitHub project page. |
This paper presents CoDAv2, an open-vocabulary 3D object detection framework that can localize and classify novel 3D objects by learning from limited base categories. |
Open-Vocabulary 3D Object Detection (OV-3DDet) is crucial for real-world applications where novel object categories are frequently encountered. |
CoDAv2 employs a 3D Novel Object Discovery with Enrichment (3D-NODE) strategy for localization and a Discovery-driven Cross-Modal Alignment with box guidance (Box-DCMA) module for classification. |
CoDAv2 significantly outperforms previous state-of-the-art methods, achieving AP_Novel scores of 9.17 vs. 3.61 on SUN-RGBD and 9.12 vs. 3.74 on ScanNetv2.
3D-NODE effectively discovers novel objects during training by leveraging 3D geometry and 2D semantic priors, leading to improved localization.
Box-DCMA aligns 3D features with 2D/textual features from CLIP, enhancing classification accuracy and effectively discriminating against background noise. |
The open-vocabulary ability decreases when tested with a large number of novel categories due to the limitations of using only point cloud data.
Future work may explore incorporating multi-modality inputs to enhance performance on larger vocabularies. |
open-vocabulary 3d object detection, 3d novel object discovery, cross-modal alignment, 3d perception, multi-modality learning |
2406.00687
Report |
Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors |
Ohad Rahamim, Hilit Segev, Idan Achituve, Yuval Atzmon, Yoni Kasten, Gal Chechik |
Generating 3D visual scenes is at the forefront of visual generative AI, but
current 3D generation techniques struggle with generating scenes with multiple
high-resolution objects. Here we introduce Lay-A-Scene, which solves the task
of Open-set 3D Object Arrangement, effectively arranging unseen objects. Given
a set of 3D objects, the task is to find a plausible arrangement of these
objects in a scene. We address this task by leveraging pre-trained
text-to-image models. We personalize the model and explain how to generate
images of a scene that contains multiple predefined objects without neglecting
any of them. Then, we describe how to infer the 3D poses and arrangement of
objects from a 2D generated image by finding a consistent projection of objects
onto the 2D scene. We evaluate the quality of Lay-A-Scene using 3D objects from
Objaverse and human raters and find that it often generates coherent and
feasible 3D object arrangements. |
\ourmethod{} is a novel method for open-set 3D object arrangement, leveraging pre-trained text-to-image diffusion models to arrange unseen 3D objects into plausible scenes based on textual descriptions. |
Generating 3D scenes with multiple, high-resolution objects is a challenging problem in visual generative AI. Existing methods often struggle with object neglect and generating coherent layouts. \ourmethod{} addresses these challenges by leveraging the rich spatial understanding of text-to-image models. |
\ourmethod{} uses a two-stage approach: 1) **Personalized Image Generation:** Fine-tunes a pre-trained text-to-image model with rendered views of the input objects to generate a scene image incorporating them. 2) **Transformation Optimization:** Infers 3D object poses from the scene image using \ourpnp{}, a novel method that combines Perspective-n-Points with physical constraints to find plausible object placements. |
Outperforms baseline methods in terms of FID, KID, and CLIP similarity scores, indicating more realistic and textually-aligned scene generation.
Human raters significantly prefer \ourmethod{} generated layouts over random and circular arrangements, demonstrating its ability to create more plausible and aesthetically pleasing scenes.
Ablation studies highlight the importance of both the personalization stage and the \ourpnp{} method in achieving high-quality results. |
Limited to arranging objects and does not generate scene context.
Performance depends on the underlying text-to-image personalization method, which can suffer from object neglect, particularly with a large number of objects. |
3d scene synthesis, text-to-image generation, object arrangement, personalization, perspective-n-points |
2406.00670
Report |
Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation |
Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng |
Pre-trained vision-language models, e.g., CLIP, have been successfully
applied to zero-shot semantic segmentation. Existing CLIP-based approaches
primarily utilize visual features from the last layer to align with text
embeddings, while they neglect the crucial information in intermediate layers
that contain rich object details. However, we find that directly aggregating
the multi-level visual features weakens the zero-shot ability for novel
classes. The large differences between the visual features from different
layers make these features hard to align well with the text embeddings. We
resolve this problem by introducing a series of independent decoders to align
the multi-level visual features with the text embeddings in a cascaded way,
forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is
flexible and can be easily applied to existing zero-shot semantic segmentation
methods. Experimental results show that our simple Cascade-CLIP achieves
superior zero-shot performance on segmentation benchmarks, like COCO-Stuff,
Pascal-VOC, and Pascal-Context. Our code is available at:
https://github.com/HVision-NKU/Cascade-CLIP |
The paper proposes Cascade-CLIP, a cascaded vision-language embedding alignment framework, for zero-shot semantic segmentation using multi-level features from pre-trained CLIP models. |
Existing CLIP-based methods for zero-shot semantic segmentation primarily use features from the last layer, neglecting the rich object details present in intermediate layers. Directly aggregating multi-level features degrades performance due to large feature discrepancies between layers, weakening CLIP's zero-shot capability. |
Cascade-CLIP splits the CLIP visual encoder into stages and aligns multi-level visual features with text embeddings using cascaded decoders. It employs a Neighborhood Gaussian Aggregation (NGA) module to fuse multi-level features within each stage, assigning weights based on feature block proximity. |
Cascade-CLIP significantly improves zero-shot segmentation performance on COCO-Stuff, Pascal-VOC, and Pascal-Context datasets.
It effectively captures object details and boundaries by leveraging multi-level features, outperforming methods relying solely on last-layer features.
The cascaded alignment with independent decoders and NGA module effectively addresses the challenge of feature discrepancies between different layers in CLIP. |
The performance improvement with increasing cascaded decoders plateaus after a certain point.
Further exploration of optimal stage splitting and feature aggregation strategies within Cascade-CLIP is possible. |
zero-shot learning, semantic segmentation, vision-language models, clip, multi-level features |
2406.00637
Report |
Representing Animatable Avatar via Factorized Neural Fields |
Chunjin Song, Zhijie Wu, Bastian Wandt, Leonid Sigal, Helge Rhodin |
For reconstructing high-fidelity human 3D models from monocular videos, it is
crucial to maintain consistent large-scale body shapes along with finely
matched subtle wrinkles. This paper explores the observation that the per-frame
rendering results can be factorized into a pose-independent component and a
corresponding pose-dependent equivalent to facilitate frame consistency. Pose
adaptive textures can be further improved by restricting frequency bands of
these two components. In detail, pose-independent outputs are expected to be
low-frequency, while highfrequency information is linked to pose-dependent
factors. We achieve a coherent preservation of both coarse body contours across
the entire input video and finegrained texture features that are time variant
with a dual-branch network with distinct frequency components. The first branch
takes coordinates in canonical space as input, while the second branch
additionally considers features outputted by the first branch and pose
information of each frame. Our network integrates the information predicted by
both branches and utilizes volume rendering to generate photo-realistic 3D
human images. Through experiments, we demonstrate that our network surpasses
the neural radiance fields (NeRF) based state-of-the-art methods in preserving
high-frequency details and ensuring consistent body contours. |
This paper introduces a novel two-branch neural network that factorizes animatable avatar rendering into pose-independent and pose-dependent components, associating them with low and high frequencies respectively, to improve avatar representation learning from monocular videos. |
Reconstructing high-fidelity human avatars from monocular videos requires preserving both consistent large-scale body shapes and fine-grained, time-variant details like wrinkles, which is challenging for existing methods. |
The method uses skeletal deformation to obtain canonical coordinates and employs a dual-branch network. One branch processes low-frequency pose-independent information, while the other handles high-frequency pose-dependent details. A common loss function encourages information maximization in the pose-independent branch. The final output merges both branches' results and uses SDF-based volume rendering to generate images. |
The method outperforms state-of-the-art approaches in novel view synthesis, demonstrating superior texture detail and shape preservation.
It exhibits significant improvement in novel pose rendering, generating more realistic and artifact-free results, particularly for challenging unseen poses.
The approach excels in 3D shape reconstruction, capturing both smooth body surfaces and intricate geometric details like wrinkles more effectively. |
The model's reliance on dense MLP computations within the volume rendering framework poses limitations on real-time applications.
The current framework lacks explicit pattern editing capabilities. |
avatar representation learning, neural rendering, monocular human reconstruction, frequency-aware factorization, signed distance function (sdf) |
2406.00633
Report |
Improving GFlowNets for Text-to-Image Diffusion Alignment |
Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Josh Susskind, Navdeep Jaitly, Shuangfei Zhai |
Diffusion models have become the \textit{de-facto} approach for generating
visual data, which are trained to match the distribution of the training
dataset. In addition, we also want to control generation to fulfill desired
properties such as alignment to a text description, which can be specified with
a black-box reward function. Prior works fine-tune pretrained diffusion models
to achieve this goal through reinforcement learning-based algorithms.
Nonetheless, they suffer from issues including slow credit assignment as well
as low quality in their generated samples. In this work, we explore techniques
that do not directly maximize the reward but rather generate high-reward images
with relatively high probability -- a natural scenario for the framework of
generative flow networks (GFlowNets). To this end, we propose the
\textbf{D}iffusion \textbf{A}lignment with \textbf{G}FlowNet (DAG) algorithm to
post-train diffusion models with black-box property functions. Extensive
experiments on Stable Diffusion and various reward specifications corroborate
that our method could effectively align large-scale text-to-image diffusion
models with given reward information. |
Presents DAG, a novel GFlowNet-based algorithm, for post-training text-to-image diffusion models to optimize black-box reward functions, improving large-scale text-to-image alignment. |
Addresses the limitation of traditional diffusion models in controlling generation towards outputs with specific, desirable properties defined by reward functions, crucial in fields like drug discovery. |
Leverages GFlowNets to train generative models to produce objects with probability proportional to a reward function, proposing both a DB-based objective and a novel KL-based objective with REINFORCE gradient. |
DAG effectively incorporates reward characteristics into generated images, improving aesthetics, compressibility, and text-image alignment.
Both DAG-DB and DAG-KL demonstrate significantly faster credit assignment than the DDPO baseline across various reward functions.
Qualitative analysis showcases DAG's ability to gradually improve alignment over training, handling complex concepts and relationships better than the baseline. |
The current implementation uses single-step transitions due to GPU memory constraints, limiting exploration of more sophisticated GFlowNet objectives.
Future work could explore using DAG for posterior approximate inference, treating the reward function as likelihood information. |
diffusion models, text-to-image synthesis, gflownets, reinforcement learning, generative ai |
2406.00609
Report |
SuperGaussian: Repurposing Video Models for 3D Super Resolution |
Yuan Shen, Duygu Ceylan, Paul Guerrero, Zexiang Xu, Niloy J. Mitra, Shenlong Wang, Anna Frühstück |
We present a simple, modular, and generic method that upsamples coarse 3D
models by adding geometric and appearance details. While generative 3D models
now exist, they do not yet match the quality of their counterparts in image and
video domains. We demonstrate that it is possible to directly repurpose
existing (pretrained) video models for 3D super-resolution and thus sidestep
the problem of the shortage of large repositories of high-quality 3D training
models. We describe how to repurpose video upsampling models, which are not 3D
consistent, and combine them with 3D consolidation to produce 3D-consistent
results. As output, we produce high quality Gaussian Splat models, which are
object centric and effective. Our method is category agnostic and can be easily
incorporated into existing 3D workflows. We evaluate our proposed SuperGaussian
on a variety of 3D inputs, which are diverse both in terms of complexity and
representation (e.g., Gaussian Splats or NeRFs), and demonstrate that our
simple method significantly improves the fidelity of the final 3D models. Check
our project website for details: supergaussian.github.io |
Presents \methodname, a simple and generic method that leverages pre-trained video upsampling models to perform 3D super-resolution, enhancing the resolution and detail of coarse 3D models in a category-agnostic manner. |
Current generative 3D models lag behind their image and video counterparts in quality due to limitations in 3D representation and the availability of large, high-quality 3D training datasets. This method overcomes these challenges by repurposing readily available video models. |
Renders a video of the coarse 3D input from multiple viewpoints, upsamples the video using a pre-trained video upsampler (optionally fine-tuned on 3D data), and reconstructs a 3D-consistent output in the form of Gaussian Splats. |
Demonstrates superior performance over image-based upsampling methods both qualitatively and quantitatively.
Successfully upsamples diverse 3D inputs, including Gaussian Splats, NeRFs, low-poly meshes, and noisy 3D reconstructions.
Shows improved performance after fine-tuning the video upsampler on a dataset of low-resolution Gaussian Splats. |
Limited by the generalization and inference speed of pre-trained video models.
Unable to recover missing or occluded parts in the input 3D model, requiring sufficient viewpoint coverage. |
3d super-resolution, video upsampling, category-agnostic, 3d generation, gaussian splatting |
2406.00598
Report |
Efficient Neural Light Fields (ENeLF) for Mobile Devices |
Austin Peng |
Novel view synthesis (NVS) is a challenge in computer vision and graphics,
focusing on generating realistic images of a scene from unobserved camera
poses, given a limited set of authentic input images. Neural radiance fields
(NeRF) achieved impressive results in rendering quality by utilizing volumetric
rendering. However, NeRF and its variants are unsuitable for mobile devices due
to the high computational cost of volumetric rendering. Emerging research in
neural light fields (NeLF) eliminates the need for volumetric rendering by
directly learning a mapping from ray representation to pixel color. NeLF has
demonstrated its capability to achieve results similar to NeRF but requires a
more extensive, computationally intensive network that is not mobile-friendly.
Unlike existing works, this research builds upon the novel network architecture
introduced by MobileR2L and aggressively applies a compression technique
(channel-wise structure pruning) to produce a model that runs efficiently on
mobile devices with lower latency and smaller sizes, with a slight decrease in
performance. |
ENeLF compresses a neural light field (NeLF) network to enable real-time novel view synthesis on mobile devices, sacrificing minimal performance for efficiency. |
NeRF methods are computationally expensive, hindering mobile deployment. NeLFs, while faster, still present challenges in model size and latency. ENeLF addresses these limitations. |
ENeLF leverages MobileR2L's efficient CNN backbone and super-resolution modules, incorporating channel-wise structure pruning and reordering BN and CONV layers for compression. |
ENeLF achieves significant reductions in model parameters, FLOPs, and size compared to MobileR2L.
It maintains competitive performance with slightly lower PSNR, SSIM, and LPIPS scores.
Pruning enables faster inference speeds, suitable for mobile devices. |
Training time for ENeLF remains high due to data distillation.
The pruned model exhibits some loss of detail rendering, particularly fine-grained features. |
novel view synthesis, neural light field, pruning, mobile devices, real-time rendering |
2406.00508
Report |
FlowIE: Efficient Image Enhancement via Rectified Flow |
Yixuan Zhu, Wenliang Zhao, Ao Li, Yansong Tang, Jie Zhou, Jiwen Lu |
Image enhancement holds extensive applications in real-world scenarios due to
complex environments and limitations of imaging devices. Conventional methods
are often constrained by their tailored models, resulting in diminished
robustness when confronted with challenging degradation conditions. In
response, we propose FlowIE, a simple yet highly effective flow-based image
enhancement framework that estimates straight-line paths from an elementary
distribution to high-quality images. Unlike previous diffusion-based methods
that suffer from long-time inference, FlowIE constructs a linear many-to-one
transport mapping via conditioned rectified flow. The rectification straightens
the trajectories of probability transfer, accelerating inference by an order of
magnitude. This design enables our FlowIE to fully exploit rich knowledge in
the pre-trained diffusion model, rendering it well-suited for various
real-world applications. Moreover, we devise a faster inference algorithm,
inspired by Lagrange's Mean Value Theorem, harnessing midpoint tangent
direction to optimize path estimation, ultimately yielding visually superior
results. Thanks to these designs, our FlowIE adeptly manages a diverse range of
enhancement tasks within a concise sequence of fewer than 5 steps. Our
contributions are rigorously validated through comprehensive experiments on
synthetic and real-world datasets, unveiling the compelling efficacy and
efficiency of our proposed FlowIE. Code is available at
https://github.com/EternalEvan/FlowIE. |
This paper proposes FlowIE, a flow-based image enhancement framework that uses rectified flow to leverage the generative priors of pre-trained diffusion models for fast and high-quality image enhancement. |
Existing image enhancement methods, including predictive, GAN-based, and diffusion-based methods, struggle with either robustness, efficiency, or adaptability. FlowIE addresses these limitations by combining the strengths of pre-trained diffusion models with the efficiency of rectified flow. |
FlowIE employs a conditioned rectified flow model to learn a many-to-one mapping from a simple elementary distribution to clean images. It uses an initial stage model for coarse restoration, a ControlNet branch for guidance, and a mean value sampling technique for accurate path prediction. |
FlowIE achieves state-of-the-art results on blind face restoration, surpassing previous methods on FID and IDS while maintaining competitive scores on other metrics.
FlowIE demonstrates superior performance on blind image super-resolution, achieving high MANIQA scores and exhibiting efficient inference speed comparable to one-step GAN-based methods.
FlowIE shows strong generalization capabilities, effectively extending to tasks like face color enhancement and face inpainting with minimal fine-tuning. |
The performance of FlowIE may be compromised when dealing with images that have undergone extremely severe degradation.
The inference speed could be further enhanced by exploring more efficient sampling strategies or alternative flow-based models. |
image enhancement, rectified flow, diffusion model, generative prior, mean value sampling |
2406.00505
Report |
Improving Text Generation on Images with Synthetic Captions |
Jun Young Koh, Sang Hyun Park, Joy Song |
The recent emergence of latent diffusion models such as SDXL and SD 1.5 has
shown significant capability in generating highly detailed and realistic
images. Despite their remarkable ability to produce images, generating accurate
text within images still remains a challenging task. In this paper, we examine
the validity of fine-tuning approaches in generating legible text within the
image. We propose a low-cost approach by leveraging SDXL without any
time-consuming training on large-scale datasets. The proposed strategy employs
a fine-tuning technique that examines the effects of data refinement levels and
synthetic captions. Moreover, our results demonstrate how our small scale
fine-tuning approach can improve the accuracy of text generation in different
scenarios without the need of additional multimodal encoders. Our experiments
show that with the addition of random letters to our raw dataset, our model's
performance improves in producing well-formed visual text. |
This paper introduces a low-cost fine-tuning approach for SDXL to enhance the generation of legible text within images, leveraging synthetic captions and data refinement. |
Generating accurate text within images remains a challenge for text-to-image diffusion models, hindering their application in tasks demanding clear visual text. |
The study explores fine-tuning SDXL with varying ratios of original data, synthetic captions (random characters and detailed descriptions), and refined captions (manual and automatic). |
Data refinement level significantly impacts the model's ability to render accurate text.
Adding synthetic data with random characters improves performance, especially with large datasets.
Solely relying on synthetic data leads to performance degradation, highlighting the importance of real data. |
Over-reliance on synthetic data can lead to mode collapse, necessitating further investigation into diverse synthetic data.
The model exhibits semantic leakage, struggling to disentangle text content from visual attributes, requiring exploration of dense captions and diverse datasets. |
synthetic data, diffusion models, text generation, image generation, multimodal learning |
2406.00457
Report |
The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP |
Hidir Yesiltepe, Yusuf Dalva, Pinar Yanardag |
Diffusion models have become prominent in creating high-quality images.
However, unlike GAN models celebrated for their ability to edit images in a
disentangled manner, diffusion-based text-to-image models struggle to achieve
the same level of precise attribute manipulation without compromising image
coherence. In this paper, CLIP which is often used in popular text-to-image
diffusion models such as Stable Diffusion is capable of performing disentangled
editing in a zero-shot manner. Through both qualitative and quantitative
comparisons with state-of-the-art editing methods, we show that our approach
yields competitive results. This insight may open opportunities for applying
this method to various tasks, including image and video editing, providing a
lightweight and efficient approach for disentangled editing. |
This paper reveals that CLIP, a popular model used in text-to-image diffusion models, can function as a zero-shot image editing tool via its EOS token. |
Diffusion models excel at generating high-quality images but struggle with disentangled editing (changing specific attributes without affecting others) unlike GANs. This work offers a simple, efficient approach for disentangled editing within diffusion models. |
The method leverages the EOS (end-of-sentence) token representation from CLIP's text encoder to modify the source text embedding, guiding the diffusion model to generate an image reflecting the desired attribute change. |
The EOS token method achieves comparable qualitative results to state-of-the-art editing methods like SEGA, Ledits++, and Cycle Diffusion.
It demonstrates effectiveness in various editing tasks, including facial attribute changes, background replacement, and NSFW content moderation.
A user study confirms its competitiveness in terms of edit quality and disentanglement capabilities. |
The method inherits CLIP's biases which might lead to unintended attribute changes.
Further exploration is needed to fully understand and exploit the potential of CLIP's EOS token for image editing. |
diffusion models, image editing, disentangled editing, clip, zero-shot learning |
2406.00449
Report |
Dual Hyperspectral Mamba for Efficient Spectral Compressive Imaging |
Jiahua Dong, Hui Yin, Hongliu Li, Wenbo Li, Yulun Zhang, Salman Khan, Fahad Shahbaz Khan |
Deep unfolding methods have made impressive progress in restoring 3D
hyperspectral images (HSIs) from 2D measurements through convolution neural
networks or Transformers in spectral compressive imaging. However, they cannot
efficiently capture long-range dependencies using global receptive fields,
which significantly limits their performance in HSI reconstruction. Moreover,
these methods may suffer from local context neglect if we directly utilize
Mamba to unfold a 2D feature map as a 1D sequence for modeling global
long-range dependencies. To address these challenges, we propose a novel Dual
Hyperspectral Mamba (DHM) to explore both global long-range dependencies and
local contexts for efficient HSI reconstruction. After learning informative
parameters to estimate degradation patterns of the CASSI system, we use them to
scale the linear projection and offer noise level for the denoiser (i.e., our
proposed DHM). Specifically, our DHM consists of multiple dual hyperspectral S4
blocks (DHSBs) to restore original HSIs. Particularly, each DHSB contains a
global hyperspectral S4 block (GHSB) to model long-range dependencies across
the entire high-resolution HSIs using global receptive fields, and a local
hyperspectral S4 block (LHSB) to address local context neglect by establishing
structured state-space sequence (S4) models within local windows. Experiments
verify the benefits of our DHM for HSI reconstruction. The source codes and
models will be available at https://github.com/JiahuaDong/DHM. |
This paper presents Dual Hyperspectral Mamba (DHM), a novel deep unfolding method for reconstructing Hyperspectral Images (HSIs) from compressed measurements acquired by a Coded Aperture Snapshot Spectral Imaging (CASSI) system. |
Existing deep unfolding methods for HSI reconstruction struggle to efficiently capture long-range dependencies and often neglect local context, limiting their performance. |
DHM employs a multi-stage unfolding framework. It learns parameters to estimate degradation patterns of the CASSI system, which are used to scale linear projections and provide noise levels for the denoiser. The core of DHM is the Dual Hyperspectral S4 block (DHSB), consisting of a global hyperspectral S4 block (GHSB) to model long-range dependencies with global receptive fields and a local hyperspectral S4 block (LHSB) to address local context neglect by applying S4 models within local windows. |
DHM significantly outperforms state-of-the-art deep unfolding methods for HSI reconstruction in both quantitative and qualitative evaluations.
The method effectively captures both global and local contexts, leading to improved restoration of fine details and reduced artifacts.
DHM achieves superior performance while maintaining lower model complexity and computational cost compared to existing approaches. |
The paper assumes a specific degradation model of the CASSI system, which might limit its generalizability to other compressive imaging systems.
Future work could explore extending DHM to handle different noise models and incorporate other priors for HSI reconstruction. |
hyperspectral image reconstruction, deep unfolding, coded aperture snapshot spectral imaging (cassi), structured state space sequence (s4) models, global and local context modeling |
2406.00448
Report |
Bilateral Guided Radiance Field Processing |
Yuehao Wang, Chaoyi Wang, Bingchen Gong, Tianfan Xue |
Neural Radiance Fields (NeRF) achieves unprecedented performance in
synthesizing novel view synthesis, utilizing multi-view consistency. When
capturing multiple inputs, image signal processing (ISP) in modern cameras will
independently enhance them, including exposure adjustment, color correction,
local tone mapping, etc. While these processings greatly improve image quality,
they often break the multi-view consistency assumption, leading to "floaters"
in the reconstructed radiance fields. To address this concern without
compromising visual aesthetics, we aim to first disentangle the enhancement by
ISP at the NeRF training stage and re-apply user-desired enhancements to the
reconstructed radiance fields at the finishing stage. Furthermore, to make the
re-applied enhancements consistent between novel views, we need to perform
imaging signal processing in 3D space (i.e. "3D ISP"). For this goal, we adopt
the bilateral grid, a locally-affine model, as a generalized representation of
ISP processing. Specifically, we optimize per-view 3D bilateral grids with
radiance fields to approximate the effects of camera pipelines for each input
view. To achieve user-adjustable 3D finishing, we propose to learn a low-rank
4D bilateral grid from a given single view edit, lifting photo enhancements to
the whole 3D scene. We demonstrate our approach can boost the visual quality of
novel view synthesis by effectively removing floaters and performing
enhancements from user retouching. The source code and our data are available
at: https://bilarfpro.github.io. |
This paper introduces a bilateral guided training and finishing approach for Neural Radiance Fields (NeRF) to address photometric inconsistencies and enable advanced editing. |
Modern camera image signal processing (ISP) introduces inconsistencies across multi-view images, causing artifacts in NeRF reconstructions. This work aims to disentangle and leverage ISP effects for improved quality and editing. |
The authors employ differentiable 3D bilateral grids to approximate per-view ISP enhancements during NeRF training. For finishing, a novel low-rank 4D bilateral grid lifts 2D view edits to the 3D scene. |
The method achieves state-of-the-art novel view synthesis quality on challenging scenes with significant photometric variation.
It effectively removes floaters caused by inconsistent ISP processing across views.
The 4D bilateral grid enables consistent and intuitive 3D scene retouching by lifting 2D editing operations. |
The approach struggles to handle transient objects like moving clouds.
Lifting sophisticated local edits with high fidelity remains a challenge. |
neural radiance fields, novel view synthesis, image signal processing, bilateral grid, 3d scene editing |
2406.00434
Report |
MoDGS: Dynamic Gaussian Splatting from Causually-captured Monocular Videos |
Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lv, Peng Wang, Wenping Wang, Junhui Hou |
In this paper, we propose MoDGS, a new pipeline to render novel-view images
in dynamic scenes using only casually captured monocular videos. Previous
monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid
movement of input cameras to construct multiview consistency but fail to
reconstruct dynamic scenes on casually captured input videos whose cameras are
static or move slowly. To address this challenging task, MoDGS adopts recent
single-view depth estimation methods to guide the learning of the dynamic
scene. Then, a novel 3D-aware initialization method is proposed to learn a
reasonable deformation field and a new robust depth loss is proposed to guide
the learning of dynamic scene geometry. Comprehensive experiments demonstrate
that MoDGS is able to render high-quality novel view images of dynamic scenes
from just a casually captured monocular video, which outperforms baseline
methods by a significant margin. |
MoDGS introduces a novel pipeline for rendering novel-view images of dynamic scenes from casually captured monocular videos, addressing the limitations of previous methods that rely on large camera movements. |
Existing monocular dynamic view synthesis methods struggle with casually captured videos where camera movement is limited, hindering accurate 3D scene reconstruction. |
MoDGS leverages single-view depth estimation for 3D guidance and introduces a 3D-aware initialization scheme for the deformation field. It further enhances depth supervision using a novel ordinal depth loss that accounts for scale inconsistencies across frames. |
MoDGS successfully synthesizes high-quality novel-view images from casually captured monocular videos, outperforming baseline methods.
The 3D-aware initialization scheme significantly improves reconstruction quality compared to random initialization.
The ordinal depth loss proves more robust than traditional depth losses, leading to smoother depth maps and sharper edge preservation. |
MoDGS struggles to reconstruct unseen regions, leading to artifacts in novel views.
Training time remains comparable to existing DVS methods and heavily relies on single-view depth estimation accuracy. |
novel view synthesis, monocular video, dynamic scenes, gaussian splatting, depth estimation |
2406.00432
Report |
Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner |
Xing Cui, Peipei Li, Zekun Li, Xuannan Liu, Yueying Zou, Zhaofeng He |
Flexible and accurate drag-based editing is a challenging task that has
recently garnered significant attention. Current methods typically model this
problem as automatically learning ``how to drag'' through point dragging and
often produce one deterministic estimation, which presents two key limitations:
1) Overlooking the inherently ill-posed nature of drag-based editing, where
multiple results may correspond to a given input, as illustrated in Fig.1; 2)
Ignoring the constraint of image quality, which may lead to unexpected
distortion. To alleviate this, we propose LucidDrag, which shifts the focus
from ``how to drag'' to a paradigm of ``what-then-how''. LucidDrag comprises an
intention reasoner and a collaborative guidance sampling mechanism. The former
infers several optimal editing strategies, identifying what content and what
semantic direction to be edited. Based on the former, the latter addresses "how
to drag" by collaboratively integrating existing editing guidance with the
newly proposed semantic guidance and quality guidance. Specifically, semantic
guidance is derived by establishing a semantic editing direction based on
reasoned intentions, while quality guidance is achieved through classifier
guidance using an image fidelity discriminator. Both qualitative and
quantitative comparisons demonstrate the superiority of LucidDrag over previous
methods. The code will be released. |
This paper introduces LucidDrag, a novel framework for drag-based image editing that shifts from a "how to drag" to a "what-then-how" paradigm. |
Existing drag-based editing methods often produce deterministic results and may neglect the semantic ambiguity of drag intentions and the preservation of image quality. LucidDrag addresses these limitations by first understanding the user's editing intention. |
LucidDrag consists of an intention reasoner and a collaborative guidance sampling mechanism. The intention reasoner, using LVLM and LLM, deduces possible editing intentions. The collaborative guidance sampling combines editing guidance with semantic and quality guidance based on the reasoned intentions, ensuring both accurate and high-quality editing. |
LucidDrag demonstrates superior semantic understanding and generates diverse editing results aligned with user intentions.
Quantitative evaluations show that LucidDrag outperforms existing methods in both dragging accuracy and image quality.
Ablation studies confirm the importance of the intention reasoner and the quality guidance for achieving high-quality and semantically accurate editing results. |
Dragging complex objects over long distances remains challenging due to limitations in object comprehension and tracking.
Manually tuning hyperparameters can be sub-optimal and future work could explore LLM-based automatic hyperparameter determination. |
image editing, drag-based editing, diffusion models, large language models, semantic understanding |
2406.00427
Report |
You Only Need Less Attention at Each Stage in Vision Transformers |
Shuoxi Zhang, Hanpeng Liu, Stephen Lin, Kun He |
The advent of Vision Transformers (ViTs) marks a substantial paradigm shift
in the realm of computer vision. ViTs capture the global information of images
through self-attention modules, which perform dot product computations among
patchified image tokens. While self-attention modules empower ViTs to capture
long-range dependencies, the computational complexity grows quadratically with
the number of tokens, which is a major hindrance to the practical application
of ViTs. Moreover, the self-attention mechanism in deep ViTs is also
susceptible to the attention saturation issue. Accordingly, we argue against
the necessity of computing the attention scores in every layer, and we propose
the Less-Attention Vision Transformer (LaViT), which computes only a few
attention operations at each stage and calculates the subsequent feature
alignments in other layers via attention transformations that leverage the
previously calculated attention scores. This novel approach can mitigate two
primary issues plaguing traditional self-attention modules: the heavy
computational burden and attention saturation. Our proposed architecture offers
superior efficiency and ease of implementation, merely requiring matrix
multiplications that are highly optimized in contemporary deep learning
frameworks. Moreover, our architecture demonstrates exceptional performance
across various vision tasks including classification, detection and
segmentation. |
This paper proposes LaViT, a novel Vision Transformer architecture that enhances efficiency by re-parameterizing attention scores from previous layers, thus mitigating computational burden and attention saturation |
Addressing the quadratic computational complexity and attention saturation issues in Vision Transformers is crucial for their practical application in computer vision tasks |
LaViT employs Less Attention layers that apply transformations to previously computed attention scores, uses residual connections for attention downsampling across stages, and introduces a Diagonality Preserving loss to maintain inter-token relationships in the transformed attention matrices |
LaViT achieves state-of-the-art performance on ImageNet-1K classification with reduced computational cost compared to existing ViT models
It demonstrates superior object detection results on COCO2017, outperforming both CNN and Transformer counterparts
LaViT also excels in semantic segmentation on ADE20K, surpassing Swin Transformer in terms of mIoU while being computationally more efficient |
The selection of the starting layer for Less Attention in deep ViTs needs careful consideration for optimal performance
Further investigation into alternative transformation functions for attention re-parameterization could potentially yield additional benefits |
vision transformer, self-attention, computational efficiency, attention saturation, image classification, object detection, semantic segmentation |
2406.00272
Report |
Temporally Consistent Object Editing in Videos using Extended Attention |
AmirHossein Zamani, Amir G. Aghdam, Tiberiu Popa, Eugene Belilovsky |
Image generation and editing have seen a great deal of advancements with the
rise of large-scale diffusion models that allow user control of different
modalities such as text, mask, depth maps, etc. However, controlled editing of
videos still lags behind. Prior work in this area has focused on using 2D
diffusion models to globally change the style of an existing video. On the
other hand, in many practical applications, editing localized parts of the
video is critical. In this work, we propose a method to edit videos using a
pre-trained inpainting image diffusion model. We systematically redesign the
forward path of the model by replacing the self-attention modules with an
extended version of attention modules that creates frame-level dependencies. In
this way, we ensure that the edited information will be consistent across all
the video frames no matter what the shape and position of the masked area is.
We qualitatively compare our results with state-of-the-art in terms of accuracy
on several video editing tasks like object retargeting, object replacement, and
object removal tasks. Simulations demonstrate the superior performance of the
proposed strategy. |
This paper presents a new method for temporally consistent video editing using a pre-trained inpainting image diffusion model with mask and text guidance. |
Controlled editing of localized regions in videos while maintaining temporal consistency remains a challenge. Existing methods struggle with inconsistencies across frames, especially when masks change shape or position, and often require costly fine-tuning or training. |
The authors extend the self-attention mechanism in a pre-trained inpainting diffusion model to incorporate frame-level dependencies. This allows the model to consider information from multiple frames during the editing process, leading to temporally consistent results. |
The method achieves high-quality object replacement, a task not addressed by previous mask-guided approaches.
It demonstrates competitive performance on object removal, matching the visual fidelity of state-of-the-art methods.
The approach excels in consistent video object retargeting, surpassing existing techniques in visual quality and temporal coherence. |
While achieving competitive results on object removal, there is room for improvement to match state-of-the-art quantitative performance.
Future work could explore generalizing the method to a wider range of video editing tasks beyond the ones explored in this paper. |
video editing, diffusion models, temporal consistency, inpainting, object retargeting |
2406.00259
Report |
PuzzleFusion++: Auto-agglomerative 3D Fracture Assembly by Denoise and Verify |
Zhengqing Wang, Jiacheng Chen, Yasutaka Furukawa |
This paper proposes a novel "auto-agglomerative" 3D fracture assembly method,
PuzzleFusion++, resembling how humans solve challenging spatial puzzles.
Starting from individual fragments, the approach 1) aligns and merges fragments
into larger groups akin to agglomerative clustering and 2) repeats the process
iteratively in completing the assembly akin to auto-regressive methods.
Concretely, a diffusion model denoises the 6-DoF alignment parameters of the
fragments simultaneously, and a transformer model verifies and merges pairwise
alignments into larger ones, whose process repeats iteratively. Extensive
experiments on the Breaking Bad dataset show that PuzzleFusion++ outperforms
all other state-of-the-art techniques by significant margins across all
metrics, in particular by over 10% in part accuracy and 50% in Chamfer
distance. The code will be available on our project page:
https://puzzlefusion-plusplus.github.io. |
Presents \ourmethod, an auto-agglomerative 3D fracture assembly method that simulates human puzzle-solving by iteratively aligning and merging fragments into larger groups using a diffusion model and a transformer for verification. |
Addresses the challenging problem of 3D fracture assembly, with applications in archaeology, forensics, biochemistry, and more. |
Uses PointNet++ and VQ-VAE to encode fragments, a diffusion model to denoise 6-DoF alignment parameters, and a transformer to verify pairwise alignments and merge them. |
\ourmethod significantly outperforms six state-of-the-art methods across all metrics on the Breaking Bad dataset, including over 10% improvement in part accuracy and over 50% in Chamfer distance.
The auto-agglomerative process is shown to be crucial for handling complex assemblies with many fragments.
The method demonstrates robustness even with fewer sampling steps in the diffusion model. |
Limitations include challenges with local geometric ambiguity and small fracture surfaces leading to misaligned fragments.
Future work will focus on improving inference speed and scaling to assemblies with up to 100 fragments. |
3d fracture assembly, diffusion models, transformers, auto-agglomerative, point cloud processing |
2406.00258
Report |
Artemis: Towards Referential Understanding in Complex Videos |
Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, Yunjie Tian |
Videos carry rich visual information including object description, action,
interaction, etc., but the existing multimodal large language models (MLLMs)
fell short in referential understanding scenarios such as video-based
referring. In this paper, we present Artemis, an MLLM that pushes video-based
referential understanding to a finer level. Given a video, Artemis receives a
natural-language question with a bounding box in any video frame and describes
the referred target in the entire video. The key to achieving this goal lies in
extracting compact, target-specific video features, where we set a solid
baseline by tracking and selecting spatiotemporal features from the video. We
train Artemis on the newly established VideoRef45K dataset with 45K video-QA
pairs and design a computationally efficient, three-stage training procedure.
Results are promising both quantitatively and qualitatively. Additionally, we
show that \model can be integrated with video grounding and text summarization
tools to understand more complex scenarios. Code and data are available at
https://github.com/qiujihao19/Artemis. |
Introduces Artemis, a multimodal large language model (MLLM) baseline for fine-level video understanding, specifically video-based referential understanding. |
Existing MLLMs fall short in referential understanding scenarios for videos, lacking the ability to comprehend and describe target actions in complex, longer videos. |
Utilizes a three-stage training approach: video-text pre-training, video-based instruction tuning, and video-based referring instruction tuning. Employs RoI tracking and selection to extract compact, target-specific video features, reducing redundancy and enhancing training efficiency. |
Outperforms existing MLLMs in video-based referring benchmarks, demonstrating superior comprehensiveness and accuracy in describing target actions.
Serves as a building block for complex video understanding tasks, including multi-round dialogues with grounding and long video understanding with summarization.
Achieves competitive performance in general video question answering tasks, highlighting the transferability of its fine-level understanding capabilities. |
Reliance on external tracking algorithms for RoI generation can introduce inaccuracies, impacting overall performance.
Susceptibility to general video understanding challenges like spatial-temporal aliasing, which can lead to inaccurate descriptions of visual content. |
multimodal large language models, video understanding, referential understanding, video-based referring, roi tracking and selection |
2406.00121
Report |
Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations |
Tiancheng Shen, Jun Hao Liew, Long Mai, Lu Qi, Jiashi Feng, Jiaya Jia |
Advances in text-based image generation and editing have revolutionized
content creation, enabling users to create impressive content from imaginative
text prompts. However, existing methods are not designed to work well with the
oversimplified prompts that are often encountered in typical scenarios when
users start their editing with only vague or abstract purposes in mind. Those
scenarios demand elaborate ideation efforts from the users to bridge the gap
between such vague starting points and the detailed creative ideas needed to
depict the desired results. In this paper, we introduce the task of Image
Editing Recommendation (IER). This task aims to automatically generate diverse
creative editing instructions from an input image and a simple prompt
representing the users' under-specified editing purpose. To this end, we
introduce Creativity-Vision Language Assistant~(Creativity-VLA), a multimodal
framework designed specifically for edit-instruction generation. We train
Creativity-VLA on our edit-instruction dataset specifically curated for IER. We
further enhance our model with a novel 'token-for-localization' mechanism,
enabling it to support both global and local editing operations. Our
experimental results demonstrate the effectiveness of \ours{} in suggesting
instructions that not only contain engaging creative elements but also maintain
high relevance to both the input image and the user's initial hint. |
This paper introduces Image Editing Recommendation (IER), a novel task to bridge the creativity gap in image editing by automatically generating diverse creative editing instructions from an input image and a simple user prompt. |
Existing image editing tools often require detailed instructions, making it challenging for users with vague ideas to achieve their desired results. This work aims to ease the ideation process and make image editing more accessible. |
The authors propose Creativity-VLA, a multimodal framework trained on a curated instruction dataset. This framework leverages a Vision Language Model (VLM) for visual understanding and creative reasoning, and employs a novel 'token-for-localization' mechanism to support both global and local image edits. |
Creativity-VLA outperforms existing image editing tools (MagicBrush, InstructDiffusion) and VLMs (LLaVA-v1.5, GPT-4V) in generating creative and relevant editing suggestions based on user study.
The proposed method effectively bridges the gap between vague editing hints and concrete instructions, as demonstrated by improved CLIP similarity scores and qualitative comparisons.
The 'token-for-localization' mechanism enables Creativity-VLA to suggest both global and local edits, broadening its applicability and allowing for more fine-grained control over image modifications. |
The current model sometimes struggles to balance image alignment with substantial modifications based on user feedback.
Future work could explore incorporating user feedback during the editing process for iterative improvement and personalization. |
image editing, vision-language model, creativity, instruction generation, token-for-localization |
2406.00093
Report |
Bootstrap3D: Improving 3D Content Creation with Synthetic Data |
Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang |
Recent years have witnessed remarkable progress in multi-view diffusion
models for 3D content creation. However, there remains a significant gap in
image quality and prompt-following ability compared to 2D diffusion models. A
critical bottleneck is the scarcity of high-quality 3D assets with detailed
captions. To address this challenge, we propose Bootstrap3D, a novel framework
that automatically generates an arbitrary quantity of multi-view images to
assist in training multi-view diffusion models. Specifically, we introduce a
data generation pipeline that employs (1) 2D and video diffusion models to
generate multi-view images based on constructed text prompts, and (2) our
fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting
inaccurate captions. Leveraging this pipeline, we have generated 1 million
high-quality synthetic multi-view images with dense descriptive captions to
address the shortage of high-quality 3D data. Furthermore, we present a
Training Timestep Reschedule (TTR) strategy that leverages the denoising
process to learn multi-view consistency while maintaining the original 2D
diffusion prior. Extensive experiments demonstrate that Bootstrap3D can
generate high-quality multi-view images with superior aesthetic quality,
image-text alignment, and maintained view consistency. |
This paper proposes Bootstrap3D, a framework leveraging Multimodal Large Language Models (MLLMs) and diffusion models to generate high-quality synthetic data for training multi-view diffusion models, addressing the scarcity of high-quality 3D data with detailed captions. |
This is important because the lack of high-quality 3D data hinders the development of 3D content creation models, leading to lower quality and less diverse results compared to 2D models. |
The method consists of 1) a data generation pipeline using 2D/video diffusion models and a fine-tuned 3D-aware MV-LLaVA for data generation, filtering, and caption rewriting, and 2) a Training Timestep Reschedule (TTR) strategy to fine-tune multi-view diffusion models using both synthetic and real data. |
Bootstrap3D generates 1 million multi-view images with detailed captions, suitable for training multi-view diffusion models.
The framework significantly improves text-to-3D generation quality, achieving better image-text alignment, higher visual fidelity, and improved view consistency.
Quantitative evaluations show Bootstrap3D outperforms state-of-the-art methods on various metrics, including CLIP score, CLIP-R score, and FID. |
Current sparse view reconstruction models, mainly trained on limited datasets like Objaverse, may not fully utilize the potential of the generated data.
Detecting subtle view inconsistencies remains challenging, potentially leading to blurred areas in the final 3D reconstructions. |
3d content creation, multi-view diffusion models, synthetic data generation, multimodal large language models, data augmentation |
2405.21075
Report |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis |
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun |
In the quest for artificial general intelligence, Multi-modal Large Language
Models (MLLMs) have emerged as a focal point in recent advancements. However,
the predominant focus remains on developing their capabilities in static image
understanding. The potential of MLLMs in processing sequential visual data is
still insufficiently explored, highlighting the absence of a comprehensive,
high-quality assessment of their performance. In this paper, we introduce
Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of
MLLMs in Video analysis. Our work distinguishes from existing benchmarks
through four key features: 1) Diversity in video types, spanning 6 primary
visual domains with 30 subfields to ensure broad scenario generalizability; 2)
Duration in temporal dimension, encompassing both short-, medium-, and
long-term videos, ranging from 11 seconds to 1 hour, for robust contextual
dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides
video frames, including subtitles and audios, to unveil the all-round
capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual
labeling by expert annotators to facilitate precise and reliable model
assessment. 900 videos with a total of 256 hours are manually selected and
annotated by repeatedly viewing all the video content, resulting in 2,700
question-answer pairs. With Video-MME, we extensively evaluate various
state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as
open-source image models like InternVL-Chat-V1.5 and video models like
LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the
best-performing commercial model, significantly outperforming the open-source
models. Our dataset along with these findings underscores the need for further
improvements in handling longer sequences and multi-modal data. Project Page:
https://video-mme.github.io |
This paper introduces Video-MME, the first comprehensive multi-modal benchmark designed to evaluate Multi-modal Large Language Models (MLLMs) on video understanding tasks. |
Current MLLMs primarily focus on static image understanding. Evaluating MLLMs on video data is crucial for assessing their ability to handle the dynamic nature of real-world scenarios, paving the way for artificial general intelligence. |
The authors curated a dataset of 900 videos across diverse scenarios, annotated with 2,700 multiple-choice questions. These videos vary in duration (11 seconds to 1 hour) and are enriched with subtitles and audio tracks. They benchmarked state-of-the-art MLLMs, including GPT-4, Gemini 1.5 Pro, and open-source models, using accuracy as the evaluation metric. |
Gemini 1.5 Pro is the best-performing commercial model (75.7% accuracy), significantly outperforming open-source models.
Integrating subtitles and audio significantly enhances video understanding, particularly for longer videos.
MLLM performance declines as video duration increases, indicating limitations in processing long sequences and highlighting the need for architectural and training data improvements. |
The study primarily focuses on multiple-choice questions, potentially limiting the assessment of MLLMs' generative capabilities for video understanding.
Future work includes developing more robust MLLM architectures for long context modeling and creating datasets with more complex temporal reasoning scenarios. |
multi-modal large language models, video understanding, benchmarking, temporal reasoning, multi-modal evaluation |
2405.21074
Report |
Latent Intrinsics Emerge from Training to Relight |
Xiao Zhang, William Gao, Seemandhar Jain, Michael Maire, David. A. Forsyth, Anand Bhattad |
Image relighting is the task of showing what a scene from a source image
would look like if illuminated differently. Inverse graphics schemes recover an
explicit representation of geometry and a set of chosen intrinsics, then
relight with some form of renderer. However error control for inverse graphics
is difficult, and inverse graphics methods can represent only the effects of
the chosen intrinsics. This paper describes a relighting method that is
entirely data-driven, where intrinsics and lighting are each represented as
latent variables. Our approach produces SOTA relightings of real scenes, as
measured by standard metrics. We show that albedo can be recovered from our
latent intrinsics without using any example albedos, and that the albedos
recovered are competitive with SOTA methods. |
This paper introduces a novel data-driven image relighting method that learns latent representations of scene intrinsics and lighting conditions for relighting images of real scenes. |
Existing inverse graphics-based relighting methods face challenges in error control and are limited in representing complex lighting effects. This work explores a purely data-driven approach for accurate and generalizable relighting. |
The method utilizes an autoencoder framework with two encoders to extract intrinsic features from a target scene image and extrinsic features from a reference lighting image. A constrained scaling mechanism combines these features, restricting information flow from the reference image to prevent feature leakage. The decoder then generates the relit image. |
The method achieves state-of-the-art relighting accuracy on a real-world dataset, outperforming existing unsupervised methods and competing with supervised approaches.
The learned latent intrinsic representation enables zero-shot albedo estimation, achieving competitive results with state-of-the-art albedo estimation methods without requiring explicit albedo training data.
The method successfully generalizes to synthetically generated images with significant lighting variations, demonstrating its ability to infer high-level lighting concepts. |
The method currently relies on paired relighting data from the same scene, which can be resource-intensive to acquire.
The latent representation of intrinsics poses challenges for applications requiring explicit intrinsic information like depth or normals. |
image relighting, intrinsic image decomposition, unsupervised learning, deep learning, computer vision |
2405.21066
Report |
Mixed Diffusion for 3D Indoor Scene Synthesis |
Siyi Hu, Diego Martin Arroyo, Stephanie Debats, Fabian Manhardt, Luca Carlone, Federico Tombari |
Realistic conditional 3D scene synthesis significantly enhances and
accelerates the creation of virtual environments, which can also provide
extensive training data for computer vision and robotics research among other
applications. Diffusion models have shown great performance in related
applications, e.g., making precise arrangements of unordered sets. However,
these models have not been fully explored in floor-conditioned scene synthesis
problems. We present MiDiffusion, a novel mixed discrete-continuous diffusion
model architecture, designed to synthesize plausible 3D indoor scenes from
given room types, floor plans, and potentially pre-existing objects. We
represent a scene layout by a 2D floor plan and a set of objects, each defined
by its category, location, size, and orientation. Our approach uniquely
implements structured corruption across the mixed discrete semantic and
continuous geometric domains, resulting in a better conditioned problem for the
reverse denoising step. We evaluate our approach on the 3D-FRONT dataset. Our
experimental results demonstrate that MiDiffusion substantially outperforms
state-of-the-art autoregressive and diffusion models in floor-conditioned 3D
scene synthesis. In addition, our models can handle partial object constraints
via a corruption-and-masking strategy without task specific training. We show
MiDiffusion maintains clear advantages over existing approaches in scene
completion and furniture arrangement experiments. |
MiDiffusion, a novel mixed discrete-continuous diffusion model for synthesizing plausible 3D indoor scenes from room types, floor plans, and potentially pre-existing objects. |
Realistic conditional 3D scene synthesis accelerates the creation of virtual environments and provides training data for computer vision and robotics. |
Combines Denoising Diffusion Probabilistic Models (DDPM) for continuous geometric attributes and Discrete Denoising Diffusion Probabilistic Models (D3PM) for discrete semantic labels. Employs a time-variant transformer-based denoising network conditioned on floor plan features. |
MiDiffusion outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis on the 3D-FRONT dataset.
Generates more realistic scene layouts with accurate geometric arrangement and adherence to boundary constraints.
Handles partial object constraints (e.g., scene completion) via a corruption-and-masking strategy without task-specific training. |
Current object representation as bounding box features and labels is not highly precise for 3D.
Requires a model retrieving strategy to compose the final 3D scene. |
3d scene synthesis, diffusion models, mixed discrete-continuous, floor plan conditioned, scene completion |
2405.21059
Report |
Unified Directly Denoising for Both Variance Preserving and Variance Exploding Diffusion Models |
Jingjing Wang, Dan Zhang, Feng Luo |
Previous work has demonstrated that, in the Variance Preserving (VP)
scenario, the nascent Directly Denoising Diffusion Models (DDDM) can generate
high-quality images in one step while achieving even better performance in
multistep sampling. However, the Pseudo-LPIPS loss used in DDDM leads to
concerns about the bias in assessment. Here, we propose a unified DDDM (uDDDM)
framework that generates images in one-step/multiple steps for both Variance
Preserving (VP) and Variance Exploding (VE) cases. We provide theoretical
proofs of the existence and uniqueness of the model's solution paths, as well
as the non-intersecting property of the sampling paths. Additionally, we
propose an adaptive Pseudo-Huber loss function to balance the convergence to
the true solution and the stability of convergence process.Through a
comprehensive evaluation, we demonstrate that uDDDMs achieve FID scores
comparable to the best-performing methods available for CIFAR-10 in both VP and
VE. Specifically, uDDDM achieves one-step generation on CIFAR10 with FID of
2.63 and 2.53 for VE and VP respectively. By extending the sampling to 1000
steps, we further reduce FID score to 1.71 and 1.65 for VE and VP respectively,
setting state-of-the-art performance in both cases. |
This paper introduces uDDDM, a unified Directly Denoising Diffusion Model framework that generates high-quality images in one or multiple steps for both Variance Preserving (VP) and Variance Exploding (VE) diffusion processes. |
The work addresses limitations of existing one-step generative models like Consistency Models and TRACT, aiming to improve efficiency and quality of image generation with diffusion models. |
The paper proposes a unified framework for VP and VE diffusion, introduces an adaptive Pseudo-Huber loss function for training, and provides theoretical proofs for properties like existence, uniqueness, and non-intersection of solution paths. |
uDDDMs achieve FID scores comparable to the best-performing methods for CIFAR-10 in both VP and VE.
The model achieves one-step generation on CIFAR10 with FID of 2.63 (VE) and 2.53 (VP), outperforming StyleGAN2-ADA.
Extending sampling to 1000 steps further reduces FID to 1.71 (VE) and 1.65 (VP), setting new state-of-the-art performance. |
Training uDDDM requires additional memory to store intermediate estimations, posing challenges for large datasets.
The VE model consistently underperforms compared to the VP model, potentially due to suboptimal loss function hyperparameters and noise scheduling strategies. |
diffusion models, generative models, image generation, one-step generation, variance exploding/preserving sde |
2405.21050
Report |
Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models |
Xinxi Zhang, Song Wen, Ligong Han, Felix Juefei-Xu, Akash Srivastava, Junzhou Huang, Hao Wang, Molei Tao, Dimitris N. Metaxas |
Adapting large-scale pre-trained generative models in a parameter-efficient
manner is gaining traction. Traditional methods like low rank adaptation
achieve parameter efficiency by imposing constraints but may not be optimal for
tasks requiring high representation capacity. We propose a novel spectrum-aware
adaptation framework for generative models. Our method adjusts both singular
values and their basis vectors of pretrained weights. Using the Kronecker
product and efficient Stiefel optimizers, we achieve parameter-efficient
adaptation of orthogonal matrices. We introduce Spectral Orthogonal
Decomposition Adaptation (SODA), which balances computational efficiency and
representation capacity. Extensive evaluations on text-to-image diffusion
models demonstrate SODA's effectiveness, offering a spectrum-aware alternative
to existing fine-tuning methods. |
This paper introduces SODA, a novel spectrum-aware adaptation framework for generative models, which improves parameter efficiency by leveraging the spectral space of pre-trained weights. |
Adapting large-scale pre-trained generative models like Stable Diffusion to specific tasks requires parameter-efficient fine-tuning methods that can capture complex data representations without extensive retraining. |
SODA adjusts both singular values and singular vectors during fine-tuning, employing a Kronecker product to rotate the singular vectors for parameter efficiency. It utilizes SVD or LQ/QR decomposition to decompose pre-trained weights and updates spectral and basis components separately. |
SODA outperforms baselines like LoRA and OFT in subject and style personalization tasks for text-to-image diffusion models.
Jointly adjusting magnitude and orientation of decomposed weights improves utilization of model priors and reduces overfitting.
Stiefel optimizer used in SODA exhibits robustness and achieves better performance compared to Cayley parameterization. |
SODA's training is slower than LoRA due to the Stiefel optimizer.
Future work will focus on accelerating optimization algorithms and applying SODA to large language models. |
parameter-efficient fine-tuning, generative models, text-to-image diffusion, spectrum-aware adaptation, stiefel optimization |
2405.21048
Report |
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling |
Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind |
Diffusion models have emerged as a powerful tool for generating high-quality
images from textual descriptions. Despite their successes, these models often
exhibit limited diversity in the sampled images, particularly when sampling
with a high classifier-free guidance weight. To address this issue, we present
Kaleido, a novel approach that enhances the diversity of samples by
incorporating autoregressive latent priors. Kaleido integrates an
autoregressive language model that encodes the original caption and generates
latent variables, serving as abstract and intermediary representations for
guiding and facilitating the image generation process. In this paper, we
explore a variety of discrete latent representations, including textual
descriptions, detection bounding boxes, object blobs, and visual tokens. These
representations diversify and enrich the input conditions to the diffusion
models, enabling more diverse outputs. Our experimental results demonstrate
that Kaleido effectively broadens the diversity of the generated image samples
from a given textual description while maintaining high image quality.
Furthermore, we show that Kaleido adheres closely to the guidance provided by
the generated latent variables, demonstrating its capability to effectively
control and direct the image generation process. |
Kaleido, a novel approach that enhances the diversity of samples generated by diffusion models from textual descriptions by incorporating autoregressive latent priors. |
Existing text-to-image diffusion models often lack diversity in their generated images, particularly when using high classifier-free guidance weights. This limits their practical applications where diverse visual interpretations are desired. |
Kaleido utilizes an autoregressive language model to generate latent variables from the original caption. These variables (textual descriptions, bounding boxes, object blobs, or visual tokens) act as abstract representations to guide the diffusion model's image generation process. |
Kaleido effectively broadens the diversity of generated image samples from a given textual description.
Kaleido maintains high image quality comparable to standard diffusion models.
The generated latent variables offer interpretability and control over the image generation process. |
Training Kaleido can be more complex and resource-intensive than standard diffusion models.
Identifying the most effective latent variables for optimal diversity might require extensive experimentation. |
diffusion models, image generation, text-to-image synthesis, diversity, autoregressive models |
2405.21013
Report |
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond |
Pengyuan Lyu, Yulin Li, Hao Zhou, Weihong Ma, Xingyu Wan, Qunyi Xie, Liang Wu, Chengquan Zhang, Kun Yao, Errui Ding, Jingdong Wang |
Text-rich images have significant and extensive value, deeply integrated into
various aspects of human life. Notably, both visual cues and linguistic symbols
in text-rich images play crucial roles in information transmission but are
accompanied by diverse challenges. Therefore, the efficient and effective
understanding of text-rich images is a crucial litmus test for the capability
of Vision-Language Models. We have crafted an efficient vision-language model,
StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
The significant design of StrucTexTv3 is presented in the following aspects:
Firstly, we adopt a combination of an effective multi-scale reduced visual
transformer and a multi-granularity token sampler (MG-Sampler) as a visual
token generator, successfully solving the challenges of high-resolution input
and complex representation learning for text-rich images. Secondly, we enhance
the perception and comprehension abilities of StrucTexTv3 through instruction
learning, seamlessly integrating various text-oriented tasks into a unified
framework. Thirdly, we have curated a comprehensive collection of high-quality
text-rich images, abbreviated as TIM-30M, encompassing diverse scenarios like
incidental scenes, office documents, web pages, and screenshots, thereby
improving the robustness of our model. Our method achieved SOTA results in
text-rich image perception tasks, and significantly improved performance in
comprehension tasks. Among multimodal models with LLM decoder of approximately
1.8B parameters, it stands out as a leader, which also makes the deployment of
edge devices feasible. In summary, the StrucTexTv3 model, featuring efficient
structural design, outstanding performance, and broad adaptability, offers
robust support for diverse intelligent application tasks involving text-rich
images, thus exhibiting immense potential for widespread application. |
This paper introduces StrucTexTv3, an efficient vision-language model designed for perception and comprehension tasks on text-rich images, addressing challenges of high-resolution inputs and complex representation learning. |
Efficiently understanding text-rich images, crucial for information transmission in many aspects of human life, is a significant test for Vision-Language Models. Current methods struggle with high-resolution input and require large resources. |
StrucTexTv3 leverages a hierarchical vision transformer, a multi-granularity token sampler (MG-Sampler), and a 1.8B parameter LLM. It's trained with TIM-30M, a 30 million text-rich image dataset, using a three-stage training pipeline: pre-training, multi-task pre-training, and supervised fine-tuning. |
StrucTexTv3 achieves state-of-the-art performance on various benchmarks, including text spotting, document parsing, and key information extraction.
It demonstrates competitive results in document-oriented VQA, table image understanding, and text image translation, outperforming models with significantly larger LLM sizes.
The model's efficiency allows for potential deployment on edge devices. |
Limited context handling for multi-page documents and videos.
Further research needed on scaling laws for larger datasets and models. |
vision-language model, text-rich images, high-resolution input, multimodal learning, instruction learning |
2405.20985
Report |
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models |
Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou |
The visual projector, which bridges the vision and language modalities and
facilitates cross-modal alignment, serves as a crucial component in MLLMs.
However, measuring the effectiveness of projectors in vision-language alignment
remains under-explored, which currently can only be inferred from the
performance of MLLMs on downstream tasks. Motivated by the problem, this study
examines the projector module by interpreting the vision-language semantic flow
within MLLMs. Specifically, we trace back the semantic relevance flow from
generated language tokens to raw visual encoder patches and the intermediate
outputs produced by projectors. Our findings reveal that compressive projectors
(e.g., QFormer), abstract visual patches into a limited set of semantic
concepts, such as objects or attributes, resulting in a 'double abstraction'
phenomenon. This involves a first visual semantic abstraction by the projector
referring to pre-defined query tokens, and a second extraction by the LLM based
on text instructions. The double abstraction is inefficient in training and
will result in cumulative vision semantics deficiency. To mitigate this issue,
we propose the key insight of 'Decouple Compression from Abstraction (DeCo),
that is compressing the visual token number at the patch level by projectors
and allowing the LLM to handle visual semantic abstraction entirely.
Consequently, we adopt a simple compressor, i.e., 2D Adaptive Pooling, to
downsample visual patches in a parameter-free manner. Empirical evaluation
demonstrates that DeCo surpasses traditional compressive projectors regarding
both performance and efficiency. It achieves performance gains of 0.9%, 7.1%,
and 2.9% across the MLLM Benchmarks, Visual Localization, and Open-ended VQA
tasks with fewer trainable parameters and faster convergence speed. |
This paper proposes DeCo, a novel method for Multimodal Large Language Models (MLLMs) that decouples compression from visual semantic abstraction, improving efficiency and spatial understanding. |
Existing MLLM visual projectors suffer from a "double abstraction" problem, where visual semantics are redundantly extracted by both the projector and the LLM, leading to inefficiencies and semantic loss. |
The paper introduces R-GAE, a new explainability tool to analyze vision-language semantic flow in MLLMs. It then proposes DeCo, which utilizes a simple Adaptive Average Pooling to compress visual tokens at the patch level, leaving semantic abstraction to the LLM. |
DeCo outperforms existing compressive projectors on various MLLM benchmarks, visual localization, and open-ended VQA tasks.
DeCo demonstrates faster training convergence compared to other compressive projectors due to its parameter-free compression mechanism.
DeCo exhibits superior spatial understanding capabilities and robustness across different vision backbones, image resolutions, and LLMs. |
High compression ratios in DeCo might lead to substantial visual information loss compared to semantic-level compression.
The advantages of DeCo are more pronounced under limited training resources (GPUs and data), and its significance might diminish with abundant resources. |
multimodal large language models, vision-language alignment, projector module, semantic abstraction, explainability |
2405.20971
Report |
Amortizing intractable inference in diffusion models for vision, language, and control |
Siddarth Venkatraman, Moksh Jain, Luca Scimeca, Minsu Kim, Marcin Sendera, Mohsin Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid Rector-Brooks, Yoshua Bengio, Glen Berseth, Nikolay Malkin |
Diffusion models have emerged as effective distribution estimators in vision,
language, and reinforcement learning, but their use as priors in downstream
tasks poses an intractable posterior inference problem. This paper studies
amortized sampling of the posterior over data, $\mathbf{x}\sim p^{\rm
post}(\mathbf{x})\propto p(\mathbf{x})r(\mathbf{x})$, in a model that consists
of a diffusion generative model prior $p(\mathbf{x})$ and a black-box
constraint or likelihood function $r(\mathbf{x})$. We state and prove the
asymptotic correctness of a data-free learning objective, relative trajectory
balance, for training a diffusion model that samples from this posterior, a
problem that existing methods solve only approximately or in restricted cases.
Relative trajectory balance arises from the generative flow network perspective
on diffusion models, which allows the use of deep reinforcement learning
techniques to improve mode coverage. Experiments illustrate the broad potential
of unbiased inference of arbitrary posteriors under diffusion priors: in vision
(classifier guidance), language (infilling under a discrete diffusion LLM), and
multimodal data (text-to-image generation). Beyond generative modeling, we
apply relative trajectory balance to the problem of continuous control with a
score-based behavior prior, achieving state-of-the-art results on benchmarks in
offline reinforcement learning. |
The paper proposes Relative Trajectory Balance (RTB), an asymptotically unbiased training objective for training diffusion models to sample from posterior distributions under a diffusion model prior. |
Sampling from posteriors under diffusion priors is crucial in many downstream tasks across vision, language, and reinforcement learning, but existing methods are often approximate or limited in scope. |
RTB leverages the generative flow network perspective on diffusion models and enforces a constraint on the ratio of denoising trajectories under the prior and posterior. The objective can be optimized off-policy, allowing flexible exploration of the posterior. |
RTB achieves competitive classifier-guided image generation with unconditional diffusion priors and improves text-to-image generation under foundation model priors.
RTB shows strong results for text infilling with discrete diffusion language models.
RTB obtains state-of-the-art performance on continuous control benchmarks in offline reinforcement learning. |
RTB relies on simulation-based training, which can be computationally intensive.
The lack of local credit assignment in the RTB objective can lead to high variance gradients. |
diffusion models, posterior sampling, generative flow networks, classifier guidance, offline reinforcement learning |
2405.20853
Report |
MeshXL: Neural Coordinate Field for Generative 3D Foundation Models |
Sijin Chen, Xin Chen, Anqi Pang, Xianfang Zeng, Wei Cheng, Yijun Fu, Fukun Yin, Yanru Wang, Zhibin Wang, Chi Zhang, Jingyi Yu, Gang Yu, Bin Fu, Tao Chen |
The polygon mesh representation of 3D data exhibits great flexibility, fast
rendering speed, and storage efficiency, which is widely preferred in various
applications. However, given its unstructured graph representation, the direct
generation of high-fidelity 3D meshes is challenging. Fortunately, with a
pre-defined ordering strategy, 3D meshes can be represented as sequences, and
the generation process can be seamlessly treated as an auto-regressive problem.
In this paper, we validate the Neural Coordinate Field (NeurCF), an explicit
coordinate representation with implicit neural embeddings, is a
simple-yet-effective representation for large-scale sequential mesh modeling.
After that, we present MeshXL, a family of generative pre-trained
auto-regressive models, which addresses the process of 3D mesh generation with
modern large language model approaches. Extensive experiments show that MeshXL
is able to generate high-quality 3D meshes, and can also serve as foundation
models for various down-stream applications. |
Introduces MeshXL, a family of auto-regressive transformer models for direct generation of high-fidelity 3D meshes using a novel Neural Coordinate Field (NeurCF) representation. |
Addresses challenges in generating high-quality 3D meshes due to their unstructured graph representation and the need for accurate spatial and connectivity estimation. |
Utilizes NeurCF, an explicit coordinate representation with implicit neural embeddings, and trains MeshXL models with a pre-defined ordering strategy for auto-regressive generation. Pre-trains models on a large dataset of 2.5M meshes from ShapeNet, 3D-FUTURE, Objaverse, and Objaverse-XL. |
MeshXL outperforms prior arts in generating high-quality and diverse 3D meshes, as evidenced by quantitative metrics (COV, MMD, 1-NNA, JSD, FID, KID) on ShapeNet benchmark.
Demonstrates effectiveness in downstream tasks like shape completion and conditional mesh generation from images or text.
Shows improved performance with increasing model size and benefits from large-scale pre-training. |
Inference time is a limitation due to the auto-regressive process.
Future work can explore faster RNN-based methods or multi-token prediction to reduce inference cost. |
3d mesh generation, neural coordinate field, auto-regressive models, generative pre-training, transformer |
2405.20791
Report |
GS-Phong: Meta-Learned 3D Gaussians for Relightable Novel View Synthesis |
Yumeng He, Yunbo Wang, Xiaokang Yang |
Decoupling the illumination in 3D scenes is crucial for novel view synthesis
and relighting. In this paper, we propose a novel method for representing a
scene illuminated by a point light using a set of relightable 3D Gaussian
points. Inspired by the Blinn-Phong model, our approach decomposes the scene
into ambient, diffuse, and specular components, enabling the synthesis of
realistic lighting effects. To facilitate the decomposition of geometric
information independent of lighting conditions, we introduce a novel bilevel
optimization-based meta-learning framework. The fundamental idea is to view the
rendering tasks under various lighting positions as a multi-task learning
problem, which our meta-learning approach effectively addresses by generalizing
the learned Gaussian geometries not only across different viewpoints but also
across diverse light positions. Experimental results demonstrate the
effectiveness of our approach in terms of training efficiency and rendering
quality compared to existing methods for free-viewpoint relighting. |
This paper introduces Phong-Inspired Gaussian Illumination Decomposition (Phong-GID), a novel method for representing and relighting 3D scenes illuminated by a point light using a set of relightable 3D Gaussian points. |
Decoupling illumination in 3D scenes is crucial for applications like novel view synthesis and relighting, especially under challenging One Light At a Time (OLAT) settings. |
The method decomposes the scene into ambient, diffuse, and specular components using the Blinn-Phong model and employs a bilevel optimization-based meta-learning framework to learn light-independent geometric information. |
Phong-GID demonstrates superior performance in novel view synthesis and relighting compared to existing 3D Gaussian Splatting-based methods on both synthetic and real-world OLAT datasets.
The proposed meta-learning framework effectively learns uniform Gaussian geometries that generalize across diverse viewpoints and light positions.
Ablation studies confirm the effectiveness of the decomposed rendering pipeline, geometry optimization via meta-learning, and introduced geometry and color priors. |
The model's robustness in handling extreme lighting conditions or highly complex scenes requires further investigation.
Future work will focus on extending the model to handle more challenging lighting scenarios and complex scene geometries. |
3d relighting, 3d gaussian splatting, novel view synthesis, meta-learning, blinn-phong model |
2405.20750
Report |
Diffusion Models Are Innate One-Step Generators |
Bowen Zheng, Tianming Yang |
Diffusion Models (DMs) have achieved great success in image generation and
other fields. By fine sampling through the trajectory defined by the SDE/ODE
solver based on a well-trained score model, DMs can generate remarkable
high-quality results. However, this precise sampling often requires multiple
steps and is computationally demanding. To address this problem, instance-based
distillation methods have been proposed to distill a one-step generator from a
DM by having a simpler student model mimic a more complex teacher model. Yet,
our research reveals an inherent limitations in these methods: the teacher
model, with more steps and more parameters, occupies different local minima
compared to the student model, leading to suboptimal performance when the
student model attempts to replicate the teacher. To avoid this problem, we
introduce a novel distributional distillation method, which uses an exclusive
distributional loss. This method exceeds state-of-the-art (SOTA) results while
requiring significantly fewer training images. Additionally, we show that DMs'
layers are activated differently at different time steps, leading to an
inherent capability to generate images in a single step. Freezing most of the
convolutional layers in a DM during distributional distillation leads to
further performance improvements. Our method achieves the SOTA results on
CIFAR-10 (FID 1.54), AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and
ImageNet 64x64 (FID 1.16) with great efficiency. Most of those results are
obtained with only 5 million training images within 6 hours on 8 A100 GPUs.
This breakthrough not only enhances the understanding of efficient image
generation models but also offers a scalable framework for advancing the state
of the art in various applications. |
This paper introduces GDD, a novel distributional distillation method for training one-step image generators from pre-trained diffusion models, using only a distributional loss (GAN loss) without instance-level supervision. |
Diffusion models (DMs) excel in image generation but suffer from high computational cost due to multi-step sampling. Existing distillation methods are either computationally expensive or yield suboptimal performance. |
The authors first analyze limitations of instance-based distillation methods, attributing it to different local minima between teacher and student models. They then propose GDD, which uses solely a GAN loss for training a one-step generator, initialized from a pre-trained DM, against real data. |
GDD surpasses state-of-the-art (SOTA) results on CIFAR-10, AFHQv2 64x64, FFHQ 64x64, and ImageNet 64x64 with fewer training images.
Analysis reveals differential activation of DM layers across time steps, suggesting innate one-step generation capability.
GDD-I, a variant freezing most convolutional layers during distillation, further improves performance, supporting the innate capability hypothesis. |
Experiments are mainly conducted on low-resolution datasets, and performance on high-resolution datasets needs further investigation.
While the study shows differential layer activation, the specific roles of these layers in multi-step vs. one-step generation remain to be explored. |
diffusion models, image generation, model distillation, generative adversarial networks (gans), one-step generation |
2405.20721
Report |
ContextGS: Compact 3D Gaussian Splatting with Anchor Level Context Model |
Yufei Wang, Zhihao Li, Lanqing Guo, Wenhan Yang, Alex C. Kot, Bihan Wen |
Recently, 3D Gaussian Splatting (3DGS) has become a promising framework for
novel view synthesis, offering fast rendering speeds and high fidelity.
However, the large number of Gaussians and their associated attributes require
effective compression techniques. Existing methods primarily compress neural
Gaussians individually and independently, i.e., coding all the neural Gaussians
at the same time, with little design for their interactions and spatial
dependence. Inspired by the effectiveness of the context model in image
compression, we propose the first autoregressive model at the anchor level for
3DGS compression in this work. We divide anchors into different levels and the
anchors that are not coded yet can be predicted based on the already coded ones
in all the coarser levels, leading to more accurate modeling and higher coding
efficiency. To further improve the efficiency of entropy coding, e.g., to code
the coarsest level with no already coded anchors, we propose to introduce a
low-dimensional quantized feature as the hyperprior for each anchor, which can
be effectively compressed. Our work pioneers the context model in the anchor
level for 3DGS representation, yielding an impressive size reduction of over
100 times compared to vanilla 3DGS and 15 times compared to the most recent
state-of-the-art work Scaffold-GS, while achieving comparable or even higher
rendering quality. |
This paper proposes ContextGS, a novel autoregressive model for compressing 3D Gaussian Splatting (3DGS) representations by leveraging spatial dependencies among anchor points. |
3DGS enables fast, high-fidelity novel view synthesis but suffers from large storage requirements, necessitating efficient compression techniques. |
ContextGS divides anchors into hierarchical levels, using decoded anchors from coarser levels to predict the distribution of anchors at finer levels. Additionally, it employs a quantized hyperprior feature as an additional prior for each anchor to enhance entropy coding efficiency. |
Achieves an average compression ratio of 15x compared to Scaffold-GS and 100x compared to standard 3DGS.
Maintains comparable or even higher rendering quality compared to previous methods.
Demonstrates the effectiveness of anchor-level context modeling and hyperprior features in reducing spatial redundancy. |
Entropy coding process introduces additional computational costs during training and decompression.
Further exploration of anchor position compression is needed for optimal performance. |
3d gaussian splatting, 3dgs compression, context modeling, autoregressive models, novel view synthesis |
2405.20674
Report |
4Diffusion: Multi-view Video Diffusion Model for 4D Generation |
Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao |
Current 4D generation methods have achieved noteworthy efficacy with the aid
of advanced diffusion generative models. However, these methods lack multi-view
spatial-temporal modeling and encounter challenges in integrating diverse prior
knowledge from multiple diffusion models, resulting in inconsistent temporal
appearance and flickers. In this paper, we propose a novel 4D generation
pipeline, namely 4Diffusion aimed at generating spatial-temporally consistent
4D content from a monocular video. We first design a unified diffusion model
tailored for multi-view video generation by incorporating a learnable motion
module into a frozen 3D-aware diffusion model to capture multi-view
spatial-temporal correlations. After training on a curated dataset, our
diffusion model acquires reasonable temporal consistency and inherently
preserves the generalizability and spatial consistency of the 3D-aware
diffusion model. Subsequently, we propose 4D-aware Score Distillation Sampling
loss, which is based on our multi-view video diffusion model, to optimize 4D
representation parameterized by dynamic NeRF. This aims to eliminate
discrepancies arising from multiple diffusion models, allowing for generating
spatial-temporally consistent 4D content. Moreover, we devise an anchor loss to
enhance the appearance details and facilitate the learning of dynamic NeRF.
Extensive qualitative and quantitative experiments demonstrate that our method
achieves superior performance compared to previous methods. |
This paper proposes 4Diffusion, a novel pipeline for generating 4D content from monocular videos, featuring a unified diffusion model called 4DM for multi-view spatial-temporal consistency. |
Generating high-quality 4D content with spatial-temporal consistency is challenging due to the limitations of integrating knowledge from multiple diffusion models in previous approaches, leading to artifacts like inconsistent appearance and flickers. |
The authors design 4DM by incorporating a learnable motion module into a frozen 3D-aware diffusion model. They then leverage 4DM to optimize dynamic NeRF using a 4D-aware SDS loss and an anchor loss for enhanced appearance details. |
4Diffusion generates 4D content with superior spatial-temporal consistency and motion coherence compared to baseline methods.
The proposed 4DM effectively captures multi-view spatial-temporal correlations even when trained on a small, curated dataset.
Quantitative evaluations using CLIP-I, CLIP-C, FVD, and LPIPS demonstrate the superiority of 4Diffusion over existing techniques. |
The quality of the multi-view video diffusion model is limited by the base model's capability and the scale of the high-quality training data.
The reliance on volumetric rendering in the 4D generation pipeline leads to slow training speeds, demanding exploration of faster 3D and GS techniques. |
4d content generation, diffusion models, multi-view video generation, dynamic nerf, spatial-temporal consistency |
2405.20669
Report |
Fourier123: One Image to High-Quality 3D Object Generation with Hybrid Fourier Score Distillation |
Shuzhou Yang, Yu Wang, Haijie Li, Jiarui Meng, Xiandong Meng, Jian Zhang |
Single image-to-3D generation is pivotal for crafting controllable 3D assets.
Given its underconstrained nature, we leverage geometric priors from a 3D novel
view generation diffusion model and appearance priors from a 2D image
generation method to guide the optimization process. We note that a disparity
exists between the training datasets of 2D and 3D diffusion models, leading to
their outputs showing marked differences in appearance. Specifically, 2D models
tend to deliver more detailed visuals, whereas 3D models produce consistent yet
over-smooth results across different views. Hence, we optimize a set of 3D
Gaussians using 3D priors in spatial domain to ensure geometric consistency,
while exploiting 2D priors in the frequency domain through Fourier transform
for higher visual quality. This 2D-3D hybrid Fourier Score Distillation
objective function (dubbed hy-FSD), can be integrated into existing 3D
generation methods, yielding significant performance improvements. With this
technique, we further develop an image-to-3D generation pipeline to create
high-quality 3D objects within one minute, named Fourier123. Extensive
experiments demonstrate that Fourier123 excels in efficient generation with
rapid convergence speed and visual-friendly generation results. |
This paper proposes Fourier123, an efficient image-to-3D generation pipeline that leverages both spatial and frequency domain information to generate high-quality 3D objects within one minute. |
Single image-to-3D generation is crucial for creating controllable 3D assets, but existing methods struggle to balance efficiency and visual quality. |
The paper introduces hybrid Fourier Score Distillation (hy-FSD) which uses a 3D diffusion model for geometric consistency in the spatial domain and a 2D diffusion model for high-quality appearance in the frequency domain. Fourier123 initializes with a large 3D reconstruction model and then optimizes using hy-FSD. |
hy-FSD significantly improves the performance of existing optimization-based 3D generation methods.
Fourier123 generates high-quality 3D objects with reliable structures and elegant appearances.
Fourier123 achieves a good balance between generation quality and speed, producing results within one minute on a single NVIDIA 4090 GPU. |
The method may occasionally encounter generation failures due to the inherent randomness of the task.
Future work could focus on improving the robustness and generalization ability of the method. |
image-to-3d generation, 3d gaussian splatting, score distillation sampling, diffusion models, frequency domain analysis |
2405.20510
Report |
Physically Compatible 3D Object Modeling from a Single Image |
Minghao Guo, Bohan Wang, Pingchuan Ma, Tianyuan Zhang, Crystal Elaine Owens, Chuang Gan, Joshua B. Tenenbaum, Kaiming He, Wojciech Matusik |
We present a computational framework that transforms single images into 3D
physical objects. The visual geometry of a physical object in an image is
determined by three orthogonal attributes: mechanical properties, external
forces, and rest-shape geometry. Existing single-view 3D reconstruction methods
often overlook this underlying composition, presuming rigidity or neglecting
external forces. Consequently, the reconstructed objects fail to withstand
real-world physical forces, resulting in instability or undesirable deformation
-- diverging from their intended designs as depicted in the image. Our
optimization framework addresses this by embedding physical compatibility into
the reconstruction process. We explicitly decompose the three physical
attributes and link them through static equilibrium, which serves as a hard
constraint, ensuring that the optimized physical shapes exhibit desired
physical behaviors. Evaluations on a dataset collected from Objaverse
demonstrate that our framework consistently enhances the physical realism of 3D
models over existing methods. The utility of our framework extends to practical
applications in dynamic simulations and 3D printing, where adherence to
physical compatibility is paramount. |
This paper proposes a computational framework that reconstructs physically plausible 3D objects from single images by incorporating physical compatibility constraints. |
Existing single-view 3D reconstruction methods often neglect physical principles, leading to objects that exhibit instability or unrealistic deformation under real-world forces. This limits their practical utility in applications like simulation and 3D printing. |
The framework explicitly decomposes the object's geometry into mechanical properties, external forces, and rest-shape geometry, linked through static equilibrium constraints. It then optimizes the rest-shape geometry using implicit differentiation to ensure the object aligns with the input image while adhering to physical laws. |
The method improves the physical compatibility of 3D models generated by various single-view reconstruction techniques.
Objects generated using this framework exhibit enhanced stability, reduced stress, and greater fidelity to the input image under simulated gravity.
The framework enables the generation of objects with diverse physical behaviors from the same image by varying material properties. |
The framework currently relies on predefined material properties and external forces, limiting its automation.
Future work includes exploring differentiable mesh conversion for seamless integration with pre-trained reconstruction models and extending the approach to dynamic object reconstruction from videos. |
3d reconstruction, physical simulation, static equilibrium, implicit differentiation, fabrication-aware design |
2405.20343
Report |
Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image |
Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, Kaisheng Ma |
In this work, we introduce Unique3D, a novel image-to-3D framework for
efficiently generating high-quality 3D meshes from single-view images,
featuring state-of-the-art generation fidelity and strong generalizability.
Previous methods based on Score Distillation Sampling (SDS) can produce
diversified 3D results by distilling 3D knowledge from large 2D diffusion
models, but they usually suffer from long per-case optimization time with
inconsistent issues. Recent works address the problem and generate better 3D
results either by finetuning a multi-view diffusion model or training a fast
feed-forward model. However, they still lack intricate textures and complex
geometries due to inconsistency and limited generated resolution. To
simultaneously achieve high fidelity, consistency, and efficiency in single
image-to-3D, we propose a novel framework Unique3D that includes a multi-view
diffusion model with a corresponding normal diffusion model to generate
multi-view images with their normal maps, a multi-level upscale process to
progressively improve the resolution of generated orthographic multi-views, as
well as an instant and consistent mesh reconstruction algorithm called ISOMER,
which fully integrates the color and geometric priors into mesh results.
Extensive experiments demonstrate that our Unique3D significantly outperforms
other image-to-3D baselines in terms of geometric and textural details. |
Unique3D is a novel image-to-3D framework that efficiently generates high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. |
Previous methods suffer from long optimization times, inconsistencies, and limitations in generated resolution, hindering their ability to produce intricate textures and complex geometries. Unique3D aims to address these challenges and achieve high fidelity, consistency, and efficiency in single-image 3D generation. |
Unique3D uses a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images and normal maps. It then employs a multi-level upscale process to improve resolution and introduces ISOMER, an instant and consistent mesh reconstruction algorithm that integrates color and geometric priors into the final mesh. |
Unique3D significantly outperforms existing image-to-3D baselines in terms of geometric and textural details, as demonstrated through extensive experiments.
The method achieves high resolution and intricate details in both geometry and material, surpassing previous approaches.
Unique3D generates high-fidelity, diverse, and multi-view consistent meshes from single-view wild images within 30 seconds. |
The multi-view prediction model may produce less satisfactory predictions for skewed or non-perspective input images.
The geometric coloring algorithm currently does not support texture maps.
Future work aims to enhance the robustness of the multi-view prediction model by training on a more extensive and diverse dataset and incorporate texture map support in the coloring algorithm. |
image-to-3d, 3d mesh generation, diffusion models, mesh reconstruction, isomer |
2405.20340
Report |
MotionLLM: Understanding Human Behaviors from Human Motions and Videos |
Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang |
This study delves into the realm of multi-modality (i.e., video and motion
modalities) human behavior understanding by leveraging the powerful
capabilities of Large Language Models (LLMs). Diverging from recent LLMs
designed for video-only or motion-only understanding, we argue that
understanding human behavior necessitates joint modeling from both videos and
motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics
and semantics effectively. In light of this, we present MotionLLM, a
straightforward yet effective framework for human motion understanding,
captioning, and reasoning. Specifically, MotionLLM adopts a unified
video-motion training strategy that leverages the complementary advantages of
existing coarse video-text data and fine-grained motion-text data to glean rich
spatial-temporal insights. Furthermore, we collect a substantial dataset,
MoVid, comprising diverse videos, motions, captions, and instructions.
Additionally, we propose the MoVid-Bench, with carefully manual annotations,
for better evaluation of human behavior understanding on video and motion.
Extensive experiments show the superiority of MotionLLM in the caption,
spatial-temporal comprehension, and reasoning ability. |
Introduced MotionLLM, a unified framework to understand human behaviors from both video and motion data, bridging the gap between these modalities and language. |
Existing LLM methods for human behavior understanding focus on either video or motion, failing to leverage the complementary advantages of both. Joint modeling is crucial for capturing nuanced body dynamics and semantics. |
MotionLLM employs a two-stage training strategy: 1) Modality translation to project motion and video data into linguistic space using trainable translators. 2) Motion-video unified instruction tuning to fine-tune both translators and the LLM using a new dataset, MoVid, containing paired video-motion-text data. |
MotionLLM significantly outperforms previous methods in both motion and video understanding benchmarks.
Ablation studies show that integrating motion data improves video understanding, and vice versa, demonstrating the effectiveness of joint modeling.
MotionLLM exhibits strong spatial-temporal comprehension and reasoning abilities for human behaviors, paving the way for applications like fitness coaching. |
The video encoder's limited capacity restricts the amount of video information processed.
Future work could explore higher-capacity video encoders and investigate potential negative impacts of LLM advancements. |
human behavior understanding, large language models, multi-modality learning, video understanding, motion analysis |
2405.20339
Report |
Visual Perception by Large Language Model's Weights |
Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun |
Existing Multimodal Large Language Models (MLLMs) follow the paradigm that
perceives visual information by aligning visual features with the input space
of Large Language Models (LLMs), and concatenating visual tokens with text
tokens to form a unified sequence input for LLMs. These methods demonstrate
promising results on various vision-language tasks but are limited by the high
computational effort due to the extended input sequence resulting from the
involvement of visual tokens. In this paper, instead of input space alignment,
we propose a novel parameter space alignment paradigm that represents visual
information as model weights. For each input image, we use a vision encoder to
extract visual features, convert features into perceptual weights, and merge
the perceptual weights with LLM's weights. In this way, the input of LLM does
not require visual tokens, which reduces the length of the input sequence and
greatly improves efficiency. Following this paradigm, we propose VLoRA with the
perceptual weights generator. The perceptual weights generator is designed to
convert visual features to perceptual weights with low-rank property,
exhibiting a form similar to LoRA. The experimental results show that our VLoRA
achieves comparable performance on various benchmarks for MLLMs, while
significantly reducing the computational costs for both training and inference.
The code and models will be made open-source. |
This paper proposes VLoRA, a novel parameter space alignment paradigm for Multimodal Large Language Models (MLLMs) that enhances efficiency by representing visual information as model weights instead of using visual tokens. |
Existing MLLMs, based on input space alignment with visual tokens, suffer from high computational costs due to increased input sequence length, especially for high-resolution images. VLoRA addresses this inefficiency by eliminating the need for visual tokens in LLM input. |
VLoRA uses a vision encoder to extract visual features from an image and then converts these features into perceptual weights using a perceptual weights generator. These weights, designed with a low-rank property similar to LoRA, are directly merged with the LLM's weights, enabling visual perception without extra input tokens. |
VLoRA achieves comparable performance to state-of-the-art MLLMs on benchmarks like MMBench, ScienceQA, HallusionBench, and MMMU.
It significantly reduces computational overhead, requiring only 8% of the FLOPs of LLaVA-v1.5 for inference.
Ablation studies demonstrate the impact of different components, such as the type of LLM weights integrated and the rank of perceptual weights. |
The current vision encoder, CLIP, might not be optimal for converting features to model weights, demanding exploration of more suitable encoders.
Using separate perceptual weights generators for each weight type may limit inter-weight correlation, suggesting a potential improvement by generating all weight types from a single generator. |
multimodal large language models, parameter space alignment, perceptual weights, low-rank adaptation, computational efficiency |
2405.20337
Report |
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving |
Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu |
Understanding the evolution of 3D scenes is important for effective
autonomous driving. While conventional methods mode scene development with the
motion of individual instances, world models emerge as a generative framework
to describe the general scene dynamics. However, most existing methods adopt an
autoregressive framework to perform next-token prediction, which suffer from
inefficiency in modeling long-term temporal evolutions. To address this, we
propose a diffusion-based 4D occupancy generation model, OccSora, to simulate
the development of the 3D world for autonomous driving. We employ a 4D scene
tokenizer to obtain compact discrete spatial-temporal representations for 4D
occupancy input and achieve high-quality reconstruction for long-sequence
occupancy videos. We then learn a diffusion transformer on the spatial-temporal
representations and generate 4D occupancy conditioned on a trajectory prompt.
We conduct extensive experiments on the widely used nuScenes dataset with Occ3D
occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout
and temporal consistency, demonstrating its ability to understand the spatial
and temporal distributions of driving scenes. With trajectory-aware 4D
generation, OccSora has the potential to serve as a world simulator for the
decision-making of autonomous driving. Code is available at:
https://github.com/wzzheng/OccSora. |
This paper proposes OccSora, a diffusion-based 4D occupancy generation model that simulates the development of 3D worlds for autonomous driving, conditioned on a trajectory prompt. |
Understanding the evolution of 3D scenes is crucial for effective autonomous driving. Existing methods struggle to efficiently model long-term temporal evolutions. |
The approach utilizes a 4D scene tokenizer to compress 4D occupancy data into compact representations. Then, a diffusion transformer learns from these representations and generates 4D occupancy conditioned on trajectory information. |
OccSora achieves high-quality reconstruction for long-sequence occupancy videos.
The model generates realistic 16s-long videos with authentic 3D layout and temporal consistency.
OccSora exhibits the ability to generate diverse scenes conditioned on different input trajectories. |
The granularity of voxel data limits the level of detail in the generated scenes.
Inconsistent details for moving objects suggest a need for larger and more diverse training data. |
autonomous driving, world models, 4d occupancy, diffusion models, trajectory generation |
2405.20336
Report |
RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text |
Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang Gan |
In this work, we introduce a challenging task for simultaneously generating
3D holistic body motions and singing vocals directly from textual lyrics
inputs, advancing beyond existing works that typically address these two
modalities in isolation. To facilitate this, we first collect the RapVerse
dataset, a large dataset containing synchronous rapping vocals, lyrics, and
high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate
the extent to which scaling autoregressive multimodal transformers across
language, audio, and motion can enhance the coherent and realistic generation
of vocals and whole-body human motions. For modality unification, a
vector-quantized variational autoencoder is employed to encode whole-body
motion sequences into discrete motion tokens, while a vocal-to-unit model is
leveraged to obtain quantized audio tokens preserving content, prosodic
information, and singer identity. By jointly performing transformer modeling on
these three modalities in a unified way, our framework ensures a seamless and
realistic blend of vocals and human motions. Extensive experiments demonstrate
that our unified generation framework not only produces coherent and realistic
singing vocals alongside human motions directly from textual inputs but also
rivals the performance of specialized single-modality generation systems,
establishing new benchmarks for joint vocal-motion generation. The project page
is available for research purposes at https://vis-www.cs.umass.edu/RapVerse. |
This paper introduces a novel framework for the simultaneous generation of 3D whole-body motions and singing vocals directly from textual lyrics, aiming to create more immersive and realistic digital interactions. |
This endeavor is crucial for enhancing virtual performances, interactive gaming, and virtual avatar realism by creating a more expressive and nuanced communication of emotions, intentions, and context in digital content. |
The authors introduce 'RapVerse,' a large-scale dataset with lyrics, vocals, and 3D motions. They employ VQVAEs to represent motion as discrete tokens and a Vocal2unit model for quantized audio tokens. A transformer-based architecture then jointly models these modalities for unified generation. |
The proposed framework generates realistic singing vocals and human motions directly from text, achieving temporal alignment between the two modalities.
The model rivals the performance of specialized single-modality generation systems, demonstrating its effectiveness in joint generation.
Using compositional VQVAEs for motion encoding, particularly separate ones for face, body, and hand, is crucial for capturing detailed facial expressions, leading to more realistic motion synthesis. |
The current dataset, 'RapVerse,' is limited to rap music, and expanding to other music genres is left for future work.
Future work could explore multi-performer audio and motion generation, such as virtual live bands, for broader applications. |
text-to-speech, text-to-motion, multimodal generation, deep learning, computer vision |
2405.20334
Report |
VividDream: Generating 3D Scene with Ambient Dynamics |
Yao-Chih Lee, Yi-Ting Chen, Andrew Wang, Ting-Hsuan Liao, Brandon Y. Feng, Jia-Bin Huang |
We introduce VividDream, a method for generating explorable 4D scenes with
ambient dynamics from a single input image or text prompt. VividDream first
expands an input image into a static 3D point cloud through iterative
inpainting and geometry merging. An ensemble of animated videos is then
generated using video diffusion models with quality refinement techniques and
conditioned on renderings of the static 3D scene from the sampled camera
trajectories. We then optimize a canonical 4D scene representation using an
animated video ensemble, with per-video motion embeddings and visibility masks
to mitigate inconsistencies. The resulting 4D scene enables free-view
exploration of a 3D scene with plausible ambient scene dynamics. Experiments
demonstrate that VividDream can provide human viewers with compelling 4D
experiences generated based on diverse real images and text prompts. |
VividDream: a novel method for generating explorable 4D scenes with ambient dynamics from a single input image or text prompt. |
Current research on 4D generation primarily focuses on individual objects, lacking the ability to create comprehensive and immersive 4D scenes with ambient motion. |
The method consists of three stages: 1) Expanding an initial 3D point cloud via iterative inpainting and geometry merging, 2) Generating an ensemble of animated videos using diffusion models conditioned on renderings of the static scene, 3) Optimizing a 4D scene representation using the animated videos, addressing inconsistencies with visibility masking and per-video motion embeddings. |
Generates compelling 4D scene experiences with plausible ambient dynamics from real images and text prompts.
Overcomes the limitations of single-video reconstruction by utilizing multi-view animation and mitigating inconsistencies.
Enables free-view exploration of the generated 4D scenes, offering a more immersive experience compared to static 3D. |
Reliance on a series of successful processes in 3D scene generation and video generation can lead to quality degradation if any stage fails (e.g., inaccurate depth estimation).
Limited control over scene motion generation, particularly for non-realistic images, highlighting the need for more advanced video generation models. |
4d scene generation, ambient dynamics, text-to-3d, video diffusion models, multi-view animation |
2405.20330
Report |
4DHands: Reconstructing Interactive Hands in 4D with Transformers |
Dixuan Lin, Yuxiang Zhang, Mengcheng Li, Yebin Liu, Wei Jing, Qi Yan, Qianying Wang, Hongwen Zhang |
In this paper, we introduce 4DHands, a robust approach to recovering
interactive hand meshes and their relative movement from monocular inputs. Our
approach addresses two major limitations of previous methods: lacking a unified
solution for handling various hand image inputs and neglecting the positional
relationship of two hands within images. To overcome these challenges, we
develop a transformer-based architecture with novel tokenization and feature
fusion strategies. Specifically, we propose a Relation-aware Two-Hand
Tokenization (RAT) method to embed positional relation information into the
hand tokens. In this way, our network can handle both single-hand and two-hand
inputs and explicitly leverage relative hand positions, facilitating the
reconstruction of intricate hand interactions in real-world scenarios. As such
tokenization indicates the relative relationship of two hands, it also supports
more effective feature fusion. To this end, we further develop a
Spatio-temporal Interaction Reasoning (SIR) module to fuse hand tokens in 4D
with attention and decode them into 3D hand meshes and relative temporal
movements. The efficacy of our approach is validated on several benchmark
datasets. The results on in-the-wild videos and real-world scenarios
demonstrate the superior performances of our approach for interactive hand
reconstruction. More video results can be found on the project page:
https://4dhands.github.io. |
4DHands, a robust method for reconstructing interactive hand meshes and their relative motion from monocular images, addressing limitations of previous approaches in handling diverse hand inputs and capturing inter-hand relationships. |
Accurate and stable 4D hand mesh recovery is crucial for applications like VR/AR, HCI, robotics, and embodied AI, particularly in real-world scenarios with complex hand interactions. |
Transformer-based architecture featuring (1) Relation-aware Two-Hand Tokenization (RAT) to embed positional information into hand tokens, enabling unified handling of single/two-hand inputs and capturing relative hand positions; (2) Spatio-temporal Interaction Reasoning (SIR) module for fusing 4D hand features and decoding them into 3D meshes and temporal movements. |
Outperforms state-of-the-art methods on InterHand2.6M and DexYCB datasets for both single and two-hand mesh reconstruction.
Shows superior stability and accuracy on in-the-wild datasets (HIC, ARCTIC, RenderIH) compared to previous methods.
Achieves robust 4D hand recovery even with occlusions and motion blur by effectively fusing temporal information. |
Performance slightly degrades on in-the-wild datasets compared to InterHand2.6M due to the domain gap.
Future work includes exploring hand-object interactions and incorporating hand gestures for more comprehensive understanding. |
4d hand mesh recovery, monocular reconstruction, transformer, hand interaction, spatio-temporal reasoning |
2405.20327
Report |
GECO: Generative Image-to-3D within a SECOnd |
Chen Wang, Jiatao Gu, Xiaoxiao Long, Yuan Liu, Lingjie Liu |
3D generation has seen remarkable progress in recent years. Existing
techniques, such as score distillation methods, produce notable results but
require extensive per-scene optimization, impacting time efficiency.
Alternatively, reconstruction-based approaches prioritize efficiency but
compromise quality due to their limited handling of uncertainty. We introduce
GECO, a novel method for high-quality 3D generative modeling that operates
within a second. Our approach addresses the prevalent issues of uncertainty and
inefficiency in current methods through a two-stage approach. In the initial
stage, we train a single-step multi-view generative model with score
distillation. Then, a second-stage distillation is applied to address the
challenge of view inconsistency from the multi-view prediction. This two-stage
process ensures a balanced approach to 3D generation, optimizing both quality
and efficiency. Our comprehensive experiments demonstrate that GECO achieves
high-quality image-to-3D generation with an unprecedented level of efficiency. |
This paper introduces GECO, a novel method for high-quality 3D generative modeling that operates within a second, addressing uncertainty and inefficiency issues in existing methods. |
Generating 3D assets is crucial for various applications, but existing methods are either time-consuming (score distillation) or compromise quality (reconstruction-based). GECO aims to bridge this gap, enabling fast and high-quality 3D generation. |
GECO utilizes a two-stage distillation approach: 1) Training a single-step multi-view generative model with score distillation from a pre-trained multi-view diffusion model. 2) Addressing view inconsistency through a second-stage distillation, jointly finetuning the multi-view generator and a pretrained 3D reconstruction model. |
GECO achieves high-quality 3D generation within a second, surpassing previous feed-forward baselines in visual quality, particularly in unseen views.
Quantitative comparisons on the GSO dataset demonstrate GECO's superior performance in PSNR, SSIM, and LPIPS compared to existing methods, including those relying on multi-step diffusion sampling.
GECO exhibits diversity, generating varied 3D models from different random seeds for the same input image, highlighting its generative capabilities. |
The training process involves two stages, which could be simplified in future work.
The quality of generated 3D models is limited by the consistency of multi-step sampling results from multi-view diffusion models. |
3d generation, score distillation, gaussian splatting, image-to-3d, generative modeling |
2405.20325
Report |
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion |
Shuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, Xintong Han, Zuxuan Wu, Yu-Gang Jiang |
Despite impressive advancements in diffusion-based video editing models in
altering video attributes, there has been limited exploration into modifying
motion information while preserving the original protagonist's appearance and
background. In this paper, we propose MotionFollower, a lightweight
score-guided diffusion model for video motion editing. To introduce conditional
controls to the denoising process, MotionFollower leverages two of our proposed
lightweight signal controllers, one for poses and the other for appearances,
both of which consist of convolution blocks without involving heavy attention
calculations. Further, we design a score guidance principle based on a
two-branch architecture, including the reconstruction and editing branches,
which significantly enhance the modeling capability of texture details and
complicated backgrounds. Concretely, we enforce several consistency
regularizers and losses during the score estimation. The resulting gradients
thus inject appropriate guidance to the intermediate latents, forcing the model
to preserve the original background details and protagonists' appearances
without interfering with the motion modification. Experiments demonstrate the
competitive motion editing ability of MotionFollower qualitatively and
quantitatively. Compared with MotionEditor, the most advanced motion editing
model, MotionFollower achieves an approximately 80% reduction in GPU memory
while delivering superior motion editing performance and exclusively supporting
large camera movements and actions. |
MotionFollower, a lightweight score-guided diffusion model for video motion editing that transfers motion from a target video to a source video while preserving the source's background, protagonist's appearance, and camera movement. |
Existing video editing models mainly focus on attribute-level editing and struggle to modify motion information while preserving other video details. MotionFollower addresses this gap by enabling motion editing while maintaining fidelity to the source video. |
MotionFollower employs two lightweight signal controllers (Pose Controller and Reference Controller) for efficient pose and appearance control. It also introduces a novel score guidance principle with a two-branch architecture (reconstruction and editing) to enforce consistency and preserve background and foreground details. |
MotionFollower achieves accurate motion editing and appearance preservation, outperforming competitors in qualitative comparisons.
Quantitative results demonstrate superior single-frame quality and video fidelity compared to state-of-the-art methods, with an 80% reduction in GPU memory compared to MotionEditor.
The model effectively handles large camera movements and complex backgrounds, demonstrating robustness and versatility in motion editing. |
MotionFollower struggles with background inpainting when the source video contains occlusions of small, distinct objects.
Future work includes exploring explicit inpainting adaptors to address background recovery in challenging scenarios. |
video motion editing, diffusion models, score guidance, appearance preservation, camera movement |
2405.20324
Report |
Don't drop your samples! Coherence-aware training benefits Conditional diffusion |
Nicolas Dufour, Victor Besnier, Vicky Kalogeiton, David Picard |
Conditional diffusion models are powerful generative models that can leverage
various types of conditional information, such as class labels, segmentation
masks, or text captions. However, in many real-world scenarios, conditional
information may be noisy or unreliable due to human annotation errors or weak
alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a
novel method that integrates coherence in conditional information into
diffusion models, allowing them to learn from noisy annotations without
discarding data. We assume that each data point has an associated coherence
score that reflects the quality of the conditional information. We then
condition the diffusion model on both the conditional information and the
coherence score. In this way, the model learns to ignore or discount the
conditioning when the coherence is low. We show that CAD is theoretically sound
and empirically effective on various conditional generation tasks. Moreover, we
show that leveraging coherence generates realistic and diverse samples that
respect conditional information better than models trained on cleaned datasets
where samples with low coherence have been discarded. |
The paper introduces Coherence-Aware Diffusion (CAD), a novel method for training conditional diffusion models that incorporates a coherence score to address the issue of noisy or unreliable conditional information, leading to improved generation quality and adherence to conditions. |
Training conditional diffusion models often relies on large, noisy datasets with misaligned image-condition pairs. Existing filtering methods discard valuable data, hindering performance. This paper introduces a method to leverage this discarded data to improve generation quality. |
The proposed CAD method estimates a coherence score, reflecting the alignment between an image and its condition. This score is then used to condition the diffusion model alongside the original condition, enabling it to learn from both well-aligned and misaligned pairs. The authors also propose Coherence-Aware Classifier-Free Guidance (CA-CFG), refining CFG using coherence scores for enhanced image quality. |
CAD significantly outperforms baselines in terms of FID, achieving a 15-point improvement in text-to-image generation, while maintaining comparable CLIP scores.
User studies overwhelmingly favor CAD-generated images, indicating superior quality and prompt adherence.
Incorporating coherence scores improves semantic segmentation, enabling better object shape reconstruction and scene understanding, even with noisy or incomplete segmentation maps. |
The success of CAD depends heavily on the quality of coherence score estimation, which needs further investigation.
Future work includes exploring more robust and reliable methods for obtaining coherence scores to enhance CAD's effectiveness and generalizability. |
conditional image generation, diffusion models, coherence score, text-to-image synthesis, semantic segmentation |
2405.20323
Report |
$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving |
Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang |
Photorealistic 3D reconstruction of street scenes is a critical technique for
developing real-world simulators for autonomous driving. Despite the efficacy
of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting
(3DGS) emerges as a promising direction due to its faster speed and more
explicit representation. However, most existing street 3DGS methods require
tracked 3D vehicle bounding boxes to decompose the static and dynamic elements
for effective reconstruction, limiting their applications for in-the-wild
scenarios. To facilitate efficient 3D scene reconstruction without costly
annotations, we propose a self-supervised street Gaussian
($\textit{S}^3$Gaussian) method to decompose dynamic and static elements from
4D consistency. We represent each scene with 3D Gaussians to preserve the
explicitness and further accompany them with a spatial-temporal field network
to compactly model the 4D dynamics. We conduct extensive experiments on the
challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our
$\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic
scenes and achieves the best performance without using 3D annotations. Code is
available at: https://github.com/nnanhuang/S3Gaussian/. |
This paper proposes $S^3$Gaussian, the first self-supervised method to decompose dynamic and static 3D Gaussians in street scenes without manual annotations, for efficient 3D scene reconstruction. |
Photorealistic 3D reconstruction of street scenes is crucial for autonomous driving simulators, and while 3D Gaussian Splatting (3DGS) is promising for its speed and explicitness, existing methods often require costly 3D bounding box annotations. |
The method uses 3D Gaussians and a novel spatial-temporal field network. This network, with a multi-resolution Hexplane encoder and a multi-head Gaussian decoder, captures 4D dynamics and deforms the Gaussians, enabling self-supervised scene decomposition. |
$S^3$Gaussian achieves state-of-the-art rendering quality in scene reconstruction and novel view synthesis on Waymo-Open dataset.
It effectively decomposes static and dynamic scenes without 3D annotations.
The method surpasses previous approaches in reconstructing distant dynamic objects and capturing scene details. |
Modeling objects at high speeds is challenging due to the high variance in deformation fields and sparse views.
Future work includes addressing the limitations in reconstructing high-speed dynamic scenes. |
3d scene reconstruction, autonomous driving, gaussian splatting, self-supervised learning, dynamic scenes |
2405.20320
Report |
Improving the Training of Rectified Flows |
Sangyun Lee, Zinan Lin, Giulia Fanti |
Diffusion models have shown great promise for image and video generation, but
sampling from state-of-the-art models requires expensive numerical integration
of a generative ODE. One approach for tackling this problem is rectified flows,
which iteratively learn smooth ODE paths that are less susceptible to
truncation error. However, rectified flows still require a relatively large
number of function evaluations (NFEs). In this work, we propose improved
techniques for training rectified flows, allowing them to compete with
knowledge distillation methods even in the low NFE setting. Our main insight is
that under realistic settings, a single iteration of the Reflow algorithm for
training rectified flows is sufficient to learn nearly straight trajectories;
hence, the current practice of using multiple Reflow iterations is unnecessary.
We thus propose techniques to improve one-round training of rectified flows,
including a U-shaped timestep distribution and LPIPS-Huber premetric. With
these techniques, we improve the FID of the previous 2-rectified flow by up to
72% in the 1 NFE setting on CIFAR-10. On ImageNet 64$\times$64, our improved
rectified flow outperforms the state-of-the-art distillation methods such as
consistency distillation and progressive distillation in both one-step and
two-step settings and rivals the performance of improved consistency training
(iCT) in FID. Code is available at https://github.com/sangyun884/rfpp. |
This paper introduces improved training techniques for rectified flows, enabling them to achieve competitive performance with knowledge distillation methods in the low function evaluation (NFE) regime, particularly for one- and two-step generation. |
Rectified flows offer advantages over knowledge distillation methods, such as generalizability to arbitrary distributions, support for inversion, likelihood evaluation, and flexible sample quality control, making them a promising alternative. |
The authors observe that the optimal 2-rectified flow generally exhibits near-zero trajectory curvature. Building upon this, they propose improved training techniques including a U-shaped timestep distribution to focus on challenging timesteps and a LPIPS-Huber premetric to enhance perceptual similarity. |
The improved 2-rectified flow++ outperforms state-of-the-art distillation methods in the 1-2 NFE regime on CIFAR-10 and ImageNet 64x64.
2-rectified flow++ achieves substantial FID reductions of up to 72% compared to vanilla 2-rectified flows.
The study demonstrates the potential computational efficiency of Reflow compared to other distillation methods. |
While showing promise, 2-rectified flow++ doesn’t yet outperform the best consistency models like iCT.
The training process for 2-rectified flow++ is slower than previous rectified flows due to the computational overhead from the LPIPS loss. |
rectified flows, diffusion models, generative modeling, knowledge distillation, low function evaluation |
2405.20310
Report |
A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction |
Jianghao Shen, Nan Xue, Tianfu Wu |
Learning 3D scene representation from a single-view image is a long-standing
fundamental problem in computer vision, with the inherent ambiguity in
predicting contents unseen from the input view. Built on the recently proposed
3D Gaussian Splatting (3DGS), the Splatter Image method has made promising
progress on fast single-image novel view synthesis via learning a single 3D
Gaussian for each pixel based on the U-Net feature map of an input image.
However, it has limited expressive power to represent occluded components that
are not observable in the input view. To address this problem, this paper
presents a Hierarchical Splatter Image method in which a pixel is worth more
than one 3D Gaussians. Specifically, each pixel is represented by a parent 3D
Gaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are
learned as done in the vanilla Splatter Image. Child 3D Gaussians are learned
via a lightweight Multi-Layer Perceptron (MLP) which takes as input the
projected image features of a parent 3D Gaussian and the embedding of a target
camera view. Both parent and child 3D Gaussians are learned end-to-end in a
stage-wise way. The joint condition of input image features from eyes of the
parent Gaussians and the target camera position facilitates learning to
allocate child Gaussians to ``see the unseen'', recovering the occluded details
that are often missed by parent Gaussians.
In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D
datasets with state-of-the-art performance obtained, especially showing
promising capabilities of reconstructing occluded contents in the input view. |
This paper introduces Hierarchical Splatter Image, a novel method for single-view 3D reconstruction that enhances the existing Splatter Image method by employing a hierarchy of parent-child 3D Gaussians to represent each pixel. |
The importance stems from addressing the limitations of conventional single-view 3D reconstruction techniques, particularly in representing occluded structures not visible in the input view. This hierarchical representation aims to improve the accuracy and reliability of 3D reconstruction from a single image. |
The methodology involves a two-stage learning process. Initially, parent 3D Gaussians are learned similarly to the vanilla Splatter Image. Subsequently, child 3D Gaussians are learned using lightweight MLPs, taking inputs from the parent Gaussian features and target camera view embeddings to recover occluded details. |
The proposed method achieves state-of-the-art performance on four single-image 3D reconstruction benchmarks (ShapeNet-SRN Chairs & Cars, CO3D Hydrants & Teddybears).
It demonstrates superior reconstruction of occluded content compared to the baseline Splatter Image method.
The approach maintains comparable model complexity to Splatter Image with a negligible increase in computational overhead. |
The performance slightly degrades when using relative camera positions instead of world coordinates.
Future work may explore incorporating richer input view information within the parent Gaussian features to improve relative camera pose handling. |
3d reconstruction, single-view reconstruction, 3d gaussian splatting, novel view synthesis, hierarchical representation |
2405.20305
Report |
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models |
Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, Kwonjoon Lee |
We introduce PlausiVL, a large video-language model for anticipating action
sequences that are plausible in the real-world. While significant efforts have
been made towards anticipating future actions, prior approaches do not take
into account the aspect of plausibility in an action sequence. To address this
limitation, we explore the generative capability of a large video-language
model in our work and further, develop the understanding of plausibility in an
action sequence by introducing two objective functions, a counterfactual-based
plausible action sequence learning loss and a long-horizon action repetition
loss. We utilize temporal logical constraints as well as verb-noun action pair
logical constraints to create implausible/counterfactual action sequences and
use them to train the model with plausible action sequence learning loss. This
loss helps the model to differentiate between plausible and not plausible
action sequences and also helps the model to learn implicit temporal cues
crucial for the task of action anticipation. The long-horizon action repetition
loss puts a higher penalty on the actions that are more prone to repetition
over a longer temporal window. With this penalization, the model is able to
generate diverse, plausible action sequences. We evaluate our approach on two
large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the
task of action anticipation. |
Introduces PlausiVL, a Video-Language Model (VLM) for anticipating plausible future action sequences in videos by incorporating temporal logic and reducing action repetition. |
Action anticipation is crucial for AI agents to understand and react to their environment, but current methods struggle to generate plausible and diverse sequences of actions. |
PlausiVL uses a Q-former to embed videos and align them with text embeddings in a large language model. It is trained with two novel losses: (1) Plausible Action Sequence Learning Loss, which uses counterfactuals based on temporal logic and verb-noun constraints to distinguish plausible sequences, and (2) Long-Horizon Action Repetition Loss, which penalizes repeated actions over longer timespans. |
PlausiVL outperforms existing VLM and other action anticipation methods on Ego4D and EPIC-Kitchens datasets.
Ablation studies confirm that both novel losses contribute to the model's improved performance in generating plausible and diverse action sequences.
PlausiVL demonstrates robustness to long-tail distributions and generalizability to unseen data. |
The model may still hallucinate implausible sequences, which warrants further investigation.
Future work could explore incorporating additional modalities, such as audio, to enhance the model's understanding of the scene. |
action anticipation, video-language models, temporal logic, plausibility, action repetition |
2405.20283
Report |
TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes |
Minghao Guo, Bohan Wang, Kaiming He, Wojciech Matusik |
We present TetSphere splatting, an explicit, Lagrangian representation for
reconstructing 3D shapes with high-quality geometry. In contrast to
conventional object reconstruction methods which predominantly use Eulerian
representations, including both neural implicit (e.g., NeRF, NeuS) and explicit
representations (e.g., DMTet), and often struggle with high computational
demands and suboptimal mesh quality, TetSphere splatting utilizes an underused
but highly effective geometric primitive -- tetrahedral meshes. This approach
directly yields superior mesh quality without relying on neural networks or
post-processing. It deforms multiple initial tetrahedral spheres to accurately
reconstruct the 3D shape through a combination of differentiable rendering and
geometric energy optimization, resulting in significant computational
efficiency. Serving as a robust and versatile geometry representation,
Tet-Sphere splatting seamlessly integrates into diverse applications, including
single-view 3D reconstruction, image-/text-to-3D content generation.
Experimental results demonstrate that TetSphere splatting outperforms existing
representations, delivering faster optimization speed, enhanced mesh quality,
and reliable preservation of thin structures. |
This paper introduces TetSphere Splatting (Tet-Splatting), a novel geometry representation for reconstructing 3D shapes using an explicit, Lagrangian approach based on deforming tetrahedral meshes. |
Existing methods for 3D shape reconstruction, including neural implicit representations and Eulerian approaches, often suffer from high computational demands and suboptimal mesh quality. Tet-Splatting aims to address these limitations by providing fast optimization, enhanced mesh quality, and robust handling of thin structures. |
Tet-Splatting represents 3D shapes using a collection of deformed tetrahedral spheres. It reconstructs the target shape by optimizing the positions of the tetrahedra vertices through differentiable rendering and geometric energy minimization, including bi-harmonic energy for smoothness and local injectivity for element orientation. |
Tet-Splatting achieves superior mesh quality compared to state-of-the-art methods on the Google Scanned Objects dataset.
It demonstrates faster optimization speed and reduced memory usage, particularly beneficial for image-to-3D and text-to-3D generation.
Tet-Splatting effectively handles shapes with complex topologies and thin structures. |
The current implementation does not guarantee topology preservation during the union of tetrahedral spheres.
Future work could explore incorporating direct 3D supervision with volumetric data. |
3d reconstruction, lagrangian representation, tetrahedral mesh, mesh quality, differentiable rendering |
2405.20282
Report |
SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow |
Chaoyang Wang, Xiangtai Li, Lu Qi, Henghui Ding, Yunhai Tong, Ming-Hsuan Yang |
Semantic segmentation and semantic image synthesis are two representative
tasks in visual perception and generation. While existing methods consider them
as two distinct tasks, we propose a unified diffusion-based framework (SemFlow)
and model them as a pair of reverse problems. Specifically, motivated by
rectified flow theory, we train an ordinary differential equation (ODE) model
to transport between the distributions of real images and semantic masks. As
the training object is symmetric, samples belonging to the two distributions,
images and semantic masks, can be effortlessly transferred reversibly. For
semantic segmentation, our approach solves the contradiction between the
randomness of diffusion outputs and the uniqueness of segmentation results. For
image synthesis, we propose a finite perturbation approach to enhance the
diversity of generated results without changing the semantic categories.
Experiments show that our SemFlow achieves competitive results on semantic
segmentation and semantic image synthesis tasks. We hope this simple framework
will motivate people to rethink the unification of low-level and high-level
vision. Project page: https://github.com/wang-chaoyang/SemFlow. |
This paper proposes SemFlow, a unified diffusion-based framework for semantic segmentation and semantic image synthesis, modeling them as a pair of reverse problems using rectified flow. |
This work bridges the gap between traditionally distinct methodologies for semantic segmentation (discriminative models) and semantic image synthesis (generative models). |
SemFlow leverages rectified flow, an ODE framework, to learn the bi-directional mapping between image and semantic mask distributions. It introduces pseudo masks, bi-directional training, and a finite perturbation strategy to enhance synthesis diversity. |
SemFlow achieves competitive semantic segmentation results compared to discriminative models while using fewer inference steps.
It demonstrates promising performance on semantic image synthesis, outperforming some specialist models in FID and LPIPS.
The finite perturbation method enables multi-modal image generation from a single semantic layout. |
There is still a performance gap in semantic segmentation accuracy compared to state-of-the-art discriminative models.
Future work could explore incorporating stronger priors or guidance mechanisms within the unified framework. |
semantic segmentation, semantic image synthesis, diffusion models, rectified flow, deep learning |
2405.20279
Report |
CV-VAE: A Compatible Video VAE for Latent Generative Video Models |
Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, Ying Shan |
Spatio-temporal compression of videos, utilizing networks such as Variational
Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other
video generative models. For instance, many LLM-like video models learn the
distribution of discrete tokens derived from 3D VAEs within the VQVAE
framework, while most diffusion-based video models capture the distribution of
continuous latent extracted by 2D VAEs without quantization. The temporal
compression is simply realized by uniform frame sampling which results in
unsmooth motion between consecutive frames. Currently, there lacks of a
commonly used continuous video (3D) VAE for latent diffusion-based video models
in the research community. Moreover, since current diffusion-based approaches
are often implemented using pre-trained text-to-image (T2I) models, directly
training a video VAE without considering the compatibility with existing T2I
models will result in a latent space gap between them, which will take huge
computational resources for training to bridge the gap even with the T2I models
as initialization. To address this issue, we propose a method for training a
video VAE of latent video models, namely CV-VAE, whose latent space is
compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion
(SD). The compatibility is achieved by the proposed novel latent space
regularization, which involves formulating a regularization loss using the
image VAE. Benefiting from the latent space compatibility, video models can be
trained seamlessly from pre-trained T2I or video models in a truly
spatio-temporally compressed latent space, rather than simply sampling video
frames at equal intervals. With our CV-VAE, existing video models can generate
four times more frames with minimal finetuning. Extensive experiments are
conducted to demonstrate the effectiveness of the proposed video VAE. |
This paper introduces CV-VAE, a novel video Variational Autoencoder (VAE) designed to be compatible with pre-trained image and video models like Stable Diffusion, addressing the lack of a commonly used continuous 3D VAE for latent diffusion-based video models. |
Current video generation models often rely on uniform frame sampling for temporal compression, leading to unsmooth motion. This work aims to enable the generation of smoother, higher-FPS videos by providing a truly spatio-temporally compressed continuous latent space. |
The authors propose a novel latent space regularization method to ensure compatibility between the video VAE and pre-trained models, minimizing distribution shifts. They also introduce an efficient 2D+3D architecture for the video VAE, leveraging pre-trained weights and incorporating 3D convolutions for temporal modeling. |
CV-VAE achieves state-of-the-art image and video reconstruction quality while maintaining compatibility with existing diffusion models.
Integrating CV-VAE into pre-trained video models like SVD significantly improves video generation quality, producing smoother motion and higher FPS with minimal finetuning.
Ablation studies validate the effectiveness of the proposed latent space regularization and mapping functions in improving video reconstruction and generation. |
The performance of CV-VAE is limited by the channel dimension of the latent space, which is constrained by the compatibility requirement with existing models.
Future work could explore higher-dimensional latent spaces and investigate the impact on reconstruction and generation quality. |
video generation, variational autoencoder (vae), latent space, stable diffusion, temporal compression |
2405.20224
Report |
EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images |
Wangbo Yu, Chaoran Feng, Jiye Tang, Xu Jia, Li Yuan, Yonghong Tian |
3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in 3D
scene reconstruction and novel view synthesis. However, its training heavily
depends on high-quality, sharp images and accurate camera poses. Fulfilling
these requirements can be challenging in non-ideal real-world scenarios, where
motion-blurred images are commonly encountered in high-speed moving cameras or
low-light environments that require long exposure times. To address these
challenges, we introduce Event Stream Assisted Gaussian Splatting
(EvaGaussians), a novel approach that integrates event streams captured by an
event camera to assist in reconstructing high-quality 3D-GS from blurry images.
Capitalizing on the high temporal resolution and dynamic range offered by the
event camera, we leverage the event streams to explicitly model the formation
process of motion-blurred images and guide the deblurring reconstruction of
3D-GS. By jointly optimizing the 3D-GS parameters and recovering camera motion
trajectories during the exposure time, our method can robustly facilitate the
acquisition of high-fidelity novel views with intricate texture details. We
comprehensively evaluated our method and compared it with previous
state-of-the-art deblurring rendering methods. Both qualitative and
quantitative comparisons demonstrate that our method surpasses existing
techniques in restoring fine details from blurry images and producing
high-fidelity novel views. |
EvaGaussians is introduced, a novel framework that integrates event streams from an event camera to reconstruct high-quality 3D Gaussian Splats from motion-blurred images, enabling real-time, high-fidelity novel view synthesis. |
3D Gaussian Splatting (3D-GS), while efficient in 3D scene reconstruction and novel view synthesis, heavily relies on sharp images and accurate camera poses, which are often absent in real-world scenarios with motion blur. |
The method leverages event streams to model motion blur and guide deblurring reconstruction. It uses the EDI model for initial camera trajectory and point cloud estimation. Then, it jointly optimizes 3D-GS parameters and camera trajectories during exposure time, guided by blur and event reconstruction losses. |
Outperforms state-of-the-art deblurring rendering methods on synthetic and real-world datasets.
Demonstrates superior performance in recovering intricate details and color accuracy from motion-blurred images.
Enables high-fidelity real-time novel view synthesis. |
May face challenges with extremely intricate textures and severe blur.
Potential for misuse in surveillance applications, raising privacy concerns. |
3d gaussian splatting, event cameras, motion deblurring, novel view synthesis, 3d reconstruction |
2405.20222
Report |
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model |
Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng |
We present MOFA-Video, an advanced controllable image animation method that
generates video from the given image using various additional controllable
signals (such as human landmarks reference, manual trajectories, and another
even provided video) or their combinations. This is different from previous
methods which only can work on a specific motion domain or show weak control
abilities with diffusion prior. To achieve our goal, we design several
domain-aware motion field adapters (\ie, MOFA-Adapters) to control the
generated motions in the video generation pipeline. For MOFA-Adapters, we
consider the temporal motion consistency of the video and generate the dense
motion flow from the given sparse control conditions first, and then, the
multi-scale features of the given image are wrapped as a guided feature for
stable video diffusion generation. We naively train two motion adapters for the
manual trajectories and the human landmarks individually since they both
contain sparse information about the control. After training, the MOFA-Adapters
in different domains can also work together for more controllable video
generation. Project Page: https://myniuuu.github.io/MOFA_Video/ |
Presents MOFA-Video, a controllable image animation method that generates videos from images using various controllable signals (e.g., landmarks, trajectories) or their combinations. |
Overcomes limitations of previous methods that either focus on specific object categories or exhibit weak control abilities with diffusion priors, enabling controllable animation of in-the-wild images. |
Designs domain-aware Motion Field Adapters (MOFA-Adapters) for different motion domains, which generate dense motion fields from sparse control signals and warp image features to guide video diffusion generation. |
Achieves fine-grained control over object and camera motion with handcrafted trajectories, outperforming DragNUWA in controllability and visual quality.
Enables portrait animation from audio using facial landmarks, surpassing StyleHEAT and SadTalker in identity preservation, artifact reduction, and motion naturalness.
Allows combining multiple MOFA-Adapters for complex animations, such as controlling facial expressions and background motion simultaneously. |
Limited ability to generate content significantly different from the input image due to the video diffusion model's training data.
May produce visual artifacts like blurriness or structure loss under large motion guidance. |
image animation, video generation, controllable generation, motion field adaptation, video diffusion models |
2405.20216
Report |
Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback |
Sanghyeon Na, Yonggyu Kim, Hyunjoon Lee |
The generation of high-quality human images through text-to-image (T2I)
methods is a significant yet challenging task. Distinct from general image
generation, human image synthesis must satisfy stringent criteria related to
human pose, anatomy, and alignment with textual prompts, making it particularly
difficult to achieve realistic results. Recent advancements in T2I generation
based on diffusion models have shown promise, yet challenges remain in meeting
human-specific preferences. In this paper, we introduce a novel approach
tailored specifically for human image generation utilizing Direct Preference
Optimization (DPO). Specifically, we introduce an efficient method for
constructing a specialized DPO dataset for training human image generation
models without the need for costly human feedback. We also propose a modified
loss function that enhances the DPO training process by minimizing artifacts
and improving image fidelity. Our method demonstrates its versatility and
effectiveness in generating human images, including personalized text-to-image
generation. Through comprehensive evaluations, we show that our approach
significantly advances the state of human image generation, achieving superior
results in terms of natural anatomies, poses, and text-image alignment. |
Presents HG-DPO, a novel method to enhance human image generation in text-to-image models by leveraging Direct Preference Optimization (DPO) |
Addresses the limitations of existing T2I models in generating high-quality human images that meet complex human preferences regarding anatomy, pose, and alignment with text prompts |
Proposes a two-pronged approach:
1. Constructs a specialized DPO dataset using AI feedback (PickScore metric) to efficiently generate preferred and non-preferred image pairs.
2. Introduces a modified loss function (statistic matching loss) during DPO training to minimize artifacts and improve image fidelity. |
Generates human images with more natural anatomies and poses compared to baselines.
Demonstrates superior alignment with text prompts, effectively capturing user intent.
Adaptable to other human-centric applications, such as personalized text-to-image generation (e.g., improving InstantBooth model). |
Acknowledges a trade-off between increased image quality and potential decrease in diversity.
Limited impact on enhancing fine anatomical details (e.g., fingers). |
text-to-image generation, human image synthesis, direct preference optimization, diffusion models, ai feedback |
2405.20204
Report |
Jina CLIP: Your CLIP Model Is Also Your Text Retriever |
Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao |
Contrastive Language-Image Pretraining (CLIP) is widely used to train models
to align images and texts in a common embedding space by mapping them to
fixed-sized vectors. These models are key to multimodal information retrieval
and related tasks. However, CLIP models generally underperform in text-only
tasks compared to specialized text models. This creates inefficiencies for
information retrieval systems that keep separate embeddings and models for
text-only and multimodal tasks. We propose a novel, multi-task contrastive
training method to address this issue, which we use to train the jina-clip-v1
model to achieve the state-of-the-art performance on both text-image and
text-text retrieval tasks. |
The paper proposes Jina CLIP, a novel multi-task contrastive training method and model that achieves state-of-the-art performance on both text-image and text-text retrieval tasks. |
CLIP models usually underperform in text-only tasks compared to specialized text models, creating inefficiencies for information retrieval systems. Jina CLIP addresses this by enabling a single model to perform well in both modalities. |
The methodology involves a three-stage training process: (1) aligning image and short text representations with text pair training, (2) introducing longer, synthetic image captions, and (3) fine-tuning with hard negatives for improved text encoding. |
Jina CLIP achieves comparable performance to EVA-CLIP on the cross-modal CLIP Benchmark.
The model's text encoder performs on par with specialized text models on MTEB Benchmark tasks.
It significantly outperforms other CLIP models in text-only tasks, demonstrating the effectiveness of the multi-task training. |
The model is currently limited to English-language texts due to resource constraints.
Future work will focus on extending the model to multilingual contexts. |
clip, embeddings, multimodal, retrieval, contrastive learning |
2405.20155
Report |
MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models |
Lukas Uzolas, Elmar Eisemann, Petr Kellnhofer |
Animation techniques bring digital 3D worlds and characters to life. However,
manual animation is tedious and automated techniques are often specialized to
narrow shape classes. In our work, we propose a technique for automatic
re-animation of arbitrary 3D shapes based on a motion prior extracted from a
video diffusion model. Unlike existing 4D generation methods, we focus solely
on the motion, and we leverage an explicit mesh-based representation compatible
with existing computer-graphics pipelines. Furthermore, our utilization of
diffusion features enhances accuracy of our motion fitting. We analyze efficacy
of these features for animation fitting and we experimentally validate our
approach for two different diffusion models and four animation models. Finally,
we demonstrate that our time-efficient zero-shot method achieves a superior
performance re-animating a diverse set of 3D shapes when compared to existing
techniques in a user study. The project website is located at
https://lukas.uzolas.com/MotionDreamer. |
This paper introduces a novel zero-shot method for animating arbitrary 3D meshes using pre-trained video diffusion models (VDMs), leveraging semantic features extracted from the VDMs for accurate motion fitting. |
Manual animation is time-consuming and existing automated methods are limited to specific shapes. This method offers a fast, class-agnostic approach to re-animate static 3D objects. |
The method involves automatically texturing the input mesh, generating motion with a VDM conditioned on the rendered mesh image, and optimizing mesh animation parameters to match the semantic features between the animated mesh and the generated video. |
User study shows a significant preference for the proposed method over existing techniques in terms of motion naturalness, visual quality, and prompt adherence.
Quantitative evaluation on a human motion dataset demonstrates superior pose fitting accuracy compared to using RGB features and competitive performance with a state-of-the-art human pose estimator.
Ablation study confirms the benefits of single-view texturing, semantic feature utilization, and regularization losses. |
The method relies on single-view supervision, limiting its ability to accurately resolve motion-in-depth and handle occlusions.
The quality of motion generated by current VDMs can affect the final animation output, highlighting the need for improved VDMs and potential rejection heuristics. |
3d animation, video diffusion models, motion fitting, semantic features, zero-shot learning |
2405.20141
Report |
OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation |
Gonca Yilmaz, Songyou Peng, Francis Engelmann, Marc Pollefeys, Hermann Blum |
The advent of Vision Language Models (VLMs) transformed image understanding
from closed-set classifications to dynamic image-language interactions,
enabling open-vocabulary segmentation. Despite this flexibility, VLMs often
fall behind closed-set classifiers in accuracy due to their reliance on
ambiguous image captions and lack of domain-specific knowledge. We, therefore,
introduce a new task domain adaptation for open-vocabulary segmentation,
enhancing VLMs with domain-specific priors while preserving their
open-vocabulary nature. Existing adaptation methods, when applied to
segmentation tasks, improve performance on training queries but can reduce VLM
performance on zero-shot text inputs. To address this shortcoming, we propose
an approach that combines parameter-efficient prompt tuning with a
triplet-loss-based training strategy. This strategy is designed to enhance
open-vocabulary generalization while adapting to the visual domain. Our results
outperform other parameter-efficient adaptation strategies in open-vocabulary
segment classification tasks across indoor and outdoor datasets. Notably, our
approach is the only one that consistently surpasses the original VLM on
zero-shot queries. Our adapted VLMs can be plug-and-play integrated into
existing open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU
on ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes
to the methods. |
The paper introduces a new task called "domain adaptation for open-vocabulary segmentation," aiming to improve language-queried object segmentation. |
Current Vision Language Models (VLMs), while enabling open-vocabulary segmentation, lag behind domain-specific models in accuracy. This task is important for applications like robotics, where VLMs need to adapt to specific environments while retaining open-vocabulary understanding. |
The paper proposes OpenDAS, a method combining parameter-efficient prompt tuning with a triplet-loss-based training strategy to adapt CLIP-based models for better text and image crop matching. |
OpenDAS outperforms existing parameter-efficient adaptation strategies in open-vocabulary segment classification tasks.
OpenDAS consistently surpasses the original VLM (CLIP) on zero-shot queries, unlike other methods.
Integration of OpenDAS into existing OVS pipelines improves performance, as demonstrated by a +6.0% mIoU increase on ADE20K and +4.1% AP increase on ScanNet++ Offices. |
All evaluated methods rely on annotated ground-truth segmentation, which can be expensive to obtain.
Prompt tuning, while efficient, shows limitations in generalizing to novel queries compared to robust fine-tuning. |
open-vocabulary segmentation, domain adaptation, prompt tuning, triplet loss, vision language models |
2405.20084
Report |
Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach |
Muhammad Saif Ullah Khan, Dhavalkumar Limbachiya, Didier Stricker, Muhammad Zeshan Afzal |
Human pose estimation is a key task in computer vision with various
applications such as activity recognition and interactive systems. However, the
lack of consistency in the annotated skeletons across different datasets poses
challenges in developing universally applicable models. To address this
challenge, we propose a novel approach integrating multi-teacher knowledge
distillation with a unified skeleton representation. Our networks are jointly
trained on the COCO and MPII datasets, containing 17 and 16 keypoints,
respectively. We demonstrate enhanced adaptability by predicting an extended
set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations,
improving cross-dataset generalization. Our joint models achieved an average
accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a
single dataset and evaluated on both. Moreover, we also evaluate all 21
predicted points by our two models by reporting an AP of 66.84 and 72.75 on the
Halpe dataset. This highlights the potential of our technique to address one of
the most pressing challenges in pose estimation research and application - the
inconsistency in skeletal annotations. |
This paper proposes a novel framework for unifying human pose estimation across different datasets by integrating multi-teacher knowledge distillation with a unified skeleton representation. |
This approach addresses the challenge of inconsistent annotated skeletons across datasets, limiting the development of universally applicable pose estimation models. |
The proposed method utilizes multi-teacher knowledge distillation to train a student network on a unified dataset combining the MPII and COCO datasets. The student network learns to predict a superset of 21 keypoints, encompassing all unique keypoints from both datasets, using a combination of conditional keypoint loss and distillation losses. |
The unified model demonstrates enhanced cross-dataset generalization compared to models trained on individual datasets.
The model successfully predicts an extended set of 21 keypoints, including those not present in the original annotations of each dataset.
Evaluation on the Halpe dataset confirms the model's ability to accurately predict the extended keypoint set. |
Potential performance disparities due to dataset size imbalance and hyperparameter settings.
Future work includes exploring techniques to extend ground-truth annotations using the unified model and investigating active learning strategies. |
human pose estimation, knowledge distillation, cross-dataset learning, unified skeleton representation, keypoint detection |
2405.20067
Report |
N-Dimensional Gaussians for Fitting of High Dimensional Functions |
Stavros Diolatzis, Tobias Zirr, Alexandr Kuznetsov, Georgios Kopanas, Anton Kaplanyan |
In the wake of many new ML-inspired approaches for reconstructing and
representing high-quality 3D content, recent hybrid and explicitly learned
representations exhibit promising performance and quality characteristics.
However, their scaling to higher dimensions is challenging, e.g. when
accounting for dynamic content with respect to additional parameters such as
material properties, illumination, or time. In this paper, we tackle these
challenges for an explicit representations based on Gaussian mixture models.
With our solutions, we arrive at efficient fitting of compact N-dimensional
Gaussian mixtures and enable efficient evaluation at render time: For fast
fitting and evaluation, we introduce a high-dimensional culling scheme that
efficiently bounds N-D Gaussians, inspired by Locality Sensitive Hashing. For
adaptive refinement yet compact representation, we introduce a loss-adaptive
density control scheme that incrementally guides the use of additional capacity
towards missing details. With these tools we can for the first time represent
complex appearance that depends on many input dimensions beyond position or
viewing angle within a compact, explicit representation optimized in minutes
and rendered in milliseconds. |
Presents a novel method for fitting and evaluating compact N-dimensional Gaussian Mixture Models (GMMs) to represent high-dimensional functions in computer graphics. |
Addresses the limitations of existing hybrid and explicit representations in scaling to higher dimensions for complex appearance modeling with many input parameters (e.g., material, lighting, time). |
Introduces an unconstrained N-dimensional adaptive Gaussian mixture representation. Employs a Locality Sensitive Hashing-inspired culling scheme for fast fitting and evaluation. Develops a loss-adaptive density control scheme for optimizer-controlled refinement. |
Achieves high-quality global illumination of synthetic scenes with variable lighting and materials in minutes.
Successfully captures and reconstructs complex view-dependent effects in novel view synthesis.
Outperforms implicit and hybrid neural rendering methods in quality and training time for scenes with high-dimensional anisotropy. |
Overfitting to sparse viewpoints in real-world captures remains a challenge.
Exploring more compact/sparse parameterizations for higher-dimensional data could improve storage efficiency. |
gaussian mixture models, high-dimensional data, rendering, explicit representations, locality sensitive hashing |
2405.20031
Report |
Structure Gaussian SLAM with Manhattan World Hypothesis |
Shuhong Liu, Heng Zhou, Liuzhuozheng Li, Yun Liu, Tianchen Deng, Yiming Zhou, Mingrui Li |
Gaussian SLAM systems have made significant advancements in improving the
efficiency and fidelity of real-time reconstructions. However, these systems
often encounter incomplete reconstructions in complex indoor environments,
characterized by substantial holes due to unobserved geometry caused by
obstacles or limited view angles. To address this challenge, we present
Manhattan Gaussian SLAM (MG-SLAM), an RGB-D system that leverages the Manhattan
World hypothesis to enhance geometric accuracy and completeness. By seamlessly
integrating fused line segments derived from structured scenes, MG-SLAM ensures
robust tracking in textureless indoor areas. Moreover, The extracted lines and
planar surface assumption allow strategic interpolation of new Gaussians in
regions of missing geometry, enabling efficient scene completion. Extensive
experiments conducted on both synthetic and real-world scenes demonstrate that
these advancements enable our method to achieve state-of-the-art performance,
marking a substantial improvement in the capabilities of Gaussian SLAM systems. |
Presents MG-SLAM, a novel RGB-D Gaussian SLAM system that leverages the Manhattan World hypothesis for enhanced geometric accuracy and completeness in complex indoor environments. |
Gaussian SLAM systems often struggle with incomplete reconstructions in complex indoor environments due to unobserved geometry. This paper addresses this challenge by incorporating structural information. |
Integrates fused line segments for robust tracking in textureless areas and utilizes the Manhattan World assumption to interpolate new Gaussians in regions of missing geometry, enabling efficient scene completion. |
Achieves state-of-the-art tracking accuracy with up to 50% lower ATE compared to Gaussian baselines.
Significantly improves scene completeness by effectively filling in gaps and holes in the reconstruction, particularly on structured surfaces like floors and ceilings.
Provides high-fidelity reconstruction, achieving 5dB enhancement in PSNR on real-world scenes, surpassing existing Gaussian SLAM methods. |
Scene completion strategy primarily focuses on large structured surfaces and may not generalize well to complex objects.
Future work includes exploring more sophisticated methods for interpolating unobserved geometry in complex indoor environments. |
slam, gaussian slam, manhattan world assumption, scene completion, line segment features |
2405.19996
Report |
DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild |
Honghao Fu, Yufei Wang, Wenhan Yang, Bihan Wen |
Image quality assessment (IQA) plays a critical role in selecting
high-quality images and guiding compression and enhancement methods in a series
of applications. The blind IQA, which assesses the quality of in-the-wild
images containing complex authentic distortions without reference images, poses
greater challenges. Existing methods are limited to modeling a uniform
distribution with local patches and are bothered by the gap between low and
high-level visions (caused by widely adopted pre-trained classification
networks). In this paper, we propose a novel IQA method called diffusion
priors-based IQA (DP-IQA), which leverages the prior knowledge from the
pre-trained diffusion model with its excellent powers to bridge semantic gaps
in the perception of the visual quality of images. Specifically, we use
pre-trained stable diffusion as the backbone, extract multi-level features from
the denoising U-Net during the upsampling process at a specified timestep, and
decode them to estimate the image quality score. The text and image adapters
are adopted to mitigate the domain gap for downstream tasks and correct the
information loss caused by the variational autoencoder bottleneck. Finally, we
distill the knowledge in the above model into a CNN-based student model,
significantly reducing the parameter to enhance applicability, with the student
model performing similarly or even better than the teacher model surprisingly.
Experimental results demonstrate that our DP-IQA achieves state-of-the-art
results on various in-the-wild datasets with better generalization capability,
which shows the superiority of our method in global modeling and utilizing the
hierarchical feature clues of diffusion for evaluating image quality. |
This paper presents DP-IQA, a novel blind image quality assessment method that leverages diffusion model priors for evaluating in-the-wild images, addressing the limitations of patch-based methods and the lack of low-level priors in previous approaches. |
Blind image quality assessment (BIQA) for in-the-wild images is crucial for various applications but challenging due to the complex and diverse distortions in real-world images. |
DP-IQA utilizes a pre-trained stable diffusion model as its backbone, extracting multi-level features from the denoising U-Net. It incorporates text prompts, text adapters, and image adapters to enhance feature representation and mitigate domain gaps. Furthermore, a student model based on EfficientNet is trained via knowledge distillation to improve efficiency. |
DP-IQA achieves state-of-the-art performance on four in-the-wild IQA datasets (CLIVE, KonIQ, LIVEFB, SPAQ).
The method exhibits superior generalization capability compared to existing BIQA models, as demonstrated by cross-dataset evaluations.
Knowledge distillation from the DP-IQA teacher model to the EfficientNet-based student model effectively reduces parameters while maintaining competitive performance. |
The performance of DP-IQA may be limited for images with ambiguous scenes or objects due to the potential for insufficient training data.
Further investigation is needed to understand the occasional significant deviations between student model predictions and teacher model predictions. |
image quality assessment, blind image quality assessment, diffusion models, knowledge distillation, in-the-wild images |
2405.19957
Report |
PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting |
Qiaowei Miao, Yawei Luo, Yi Yang |
As text-conditioned diffusion models (DMs) achieve breakthroughs in image,
video, and 3D generation, the research community's focus has shifted to the
more challenging task of text-to-4D synthesis, which introduces a temporal
dimension to generate dynamic 3D objects. In this context, we identify Score
Distillation Sampling (SDS), a widely used technique for text-to-3D synthesis,
as a significant hindrance to text-to-4D performance due to its Janus-faced and
texture-unrealistic problems coupled with high computational costs. In this
paper, we propose \textbf{P}ixel-\textbf{L}evel \textbf{A}lignments for
Text-to-\textbf{4D} Gaussian Splatting (\textbf{PLA4D}), a novel method that
utilizes text-to-video frames as explicit pixel alignment targets to generate
static 3D objects and inject motion into them. Specifically, we introduce Focal
Alignment to calibrate camera poses for rendering and GS-Mesh Contrastive
Learning to distill geometry priors from rendered image contrasts at the pixel
level. Additionally, we develop Motion Alignment using a deformation network to
drive changes in Gaussians and implement Reference Refinement for smooth 4D
object surfaces. These techniques enable 4D Gaussian Splatting to align
geometry, texture, and motion with generated videos at the pixel level.
Compared to previous methods, PLA4D produces synthesized outputs with better
texture details in less time and effectively mitigates the Janus-faced problem.
PLA4D is fully implemented using open-source models, offering an accessible,
user-friendly, and promising direction for 4D digital content creation. Our
project page: https://github.com/MiaoQiaowei/PLA4D.github.io. |
This paper introduces PLA4D, a novel text-to-4D generation framework that utilizes text-to-video frames as explicit pixel alignment targets to overcome limitations of Score Distillation Sampling (SDS) in existing methods. |
Text-to-4D synthesis is a challenging task with significant potential in various applications, but existing methods suffer from issues like the Janus-face problem, unrealistic textures, and high computational costs due to reliance on SDS. |
PLA4D employs a three-stage pipeline: (1) text-to-video generation using an open-source model, (2) frame-to-3D generation via Focal Alignment and GS-Mesh Contrastive Learning for texture and geometry alignment, and (3) 3D-to-4D generation using Motion Alignment and Reference Refinement for injecting motion while preserving surface quality. |
PLA4D generates 4D objects with superior texture details, accurate geometry, and coherent motion compared to previous methods.
The framework effectively mitigates the Janus-face problem by aligning geometry and texture with generated video frames at the pixel level.
PLA4D achieves significantly faster generation times compared to SDS-based methods, reducing training time from hours to around ten minutes. |
The motion range of generated 4D objects is limited by the capabilities of current open-source text-to-video generation models.
The image-to-mesh model used for initialization has limitations in reconstructing certain targets. |
text-to-4d synthesis, 3d gaussian splatting, pixel-level alignment, video generation, deformation network |
2405.19931
Report |
Exploring Diffusion Models' Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks |
Xiaoyu Wu, Jiaru Zhang, Yang Hua, Bohan Lyu, Hao Wang, Tao Song, Haibing Guan |
Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement,
significantly reducing training costs and enabling personalized AI
applications. However, we explore the training dynamics of DMs and observe an
unanticipated phenomenon: during the training process, image fidelity initially
improves, then unexpectedly deteriorates with the emergence of noisy patterns,
only to recover later with severe overfitting. We term the stage with generated
noisy patterns as corruption stage. To understand this corruption stage, we
begin by theoretically modeling the one-shot fine-tuning scenario, and then
extend this modeling to more general cases. Through this modeling, we identify
the primary cause of this corruption stage: a narrowed learning distribution
inherent in the nature of few-shot fine-tuning. To tackle this, we apply
Bayesian Neural Networks (BNNs) on DMs with variational inference to implicitly
broaden the learned distribution, and present that the learning target of the
BNNs can be naturally regarded as an expectation of the diffusion loss and a
further regularization with the pretrained DMs. This approach is highly
compatible with current few-shot fine-tuning methods in DMs and does not
introduce any extra inference costs. Experimental results demonstrate that our
method significantly mitigates corruption, and improves the fidelity, quality
and diversity of the generated images in both object-driven and subject-driven
generation tasks. |
This paper identifies and addresses the "corruption stage" phenomenon in few-shot fine-tuning of Diffusion Models (DMs), where image fidelity deteriorates due to noisy patterns during training. |
Few-shot fine-tuning is crucial for personalized AI applications, but the corruption stage hinders its effectiveness. This research improves the quality and diversity of generated images in such settings. |
The paper theoretically models the fine-tuning process, revealing that a narrowed learning distribution causes the corruption. It proposes using Bayesian Neural Networks (BNNs) to implicitly broaden this distribution, enhancing model robustness. |
BNNs significantly mitigate the corruption stage, improving image fidelity and quality as measured by various metrics.
The method enhances generation diversity due to the inherent randomness of BNNs.
Applying BNNs generalizes well across different DM architectures, training iterations, and numbers of training images. |
The added randomness from BNNs might slow down the fine-tuning process.
Learning intricate image details could be slightly hampered with limited training iterations. Future work could explore mitigating these limitations. |
diffusion models, few-shot fine-tuning, bayesian neural networks, image generation, corruption stage |
2405.19899
Report |
Open-Set Domain Adaptation for Semantic Segmentation |
Seun-An Choe, Ah-Hyung Shin, Keon-Hee Park, Jinwoo Choi, Gyeong-Moon Park |
Unsupervised domain adaptation (UDA) for semantic segmentation aims to
transfer the pixel-wise knowledge from the labeled source domain to the
unlabeled target domain. However, current UDA methods typically assume a shared
label space between source and target, limiting their applicability in
real-world scenarios where novel categories may emerge in the target domain. In
this paper, we introduce Open-Set Domain Adaptation for Semantic Segmentation
(OSDA-SS) for the first time, where the target domain includes unknown classes.
We identify two major problems in the OSDA-SS scenario as follows: 1) the
existing UDA methods struggle to predict the exact boundary of the unknown
classes, and 2) they fail to accurately predict the shape of the unknown
classes. To address these issues, we propose Boundary and Unknown Shape-Aware
open-set domain adaptation, coined BUS. Our BUS can accurately discern the
boundaries between known and unknown classes in a contrastive manner using a
novel dilation-erosion-based contrastive loss. In addition, we propose
OpenReMix, a new domain mixing augmentation method that guides our model to
effectively learn domain and size-invariant features for improving the shape
detection of the known and unknown classes. Through extensive experiments, we
demonstrate that our proposed BUS effectively detects unknown classes in the
challenging OSDA-SS scenario compared to the previous methods by a large
margin. The code is available at https://github.com/KHU-AGI/BUS. |
This paper introduces Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS), addressing the problem of adapting models to target domains with unknown classes. |
Current UDA methods assume shared label spaces, limiting real-world applicability where novel categories can emerge in the target domain. |
The paper proposes BUS, a novel method that utilizes a Dilation-Erosion-based Contrastive (DECON) loss to improve boundary prediction and OpenReMix, a domain mixing augmentation for size-invariant feature learning. |
BUS significantly outperforms previous UDA and OSDA methods on benchmark datasets (GTA5→Cityscapes and SYNTHIA→Cityscapes).
DECON loss effectively distinguishes between known and unknown classes at boundaries, improving private class IoU by ~40.79%.
OpenReMix enhances shape prediction for both known and unknown classes, boosting common class mIoU by ~8.65%. |
Performance reliance on pseudo-labeling, potentially leading to degradation if model calibration is poor.
Future work can explore alternative approaches beyond pseudo-labeling to enhance robustness. |
unsupervised domain adaptation, semantic segmentation, open-set learning, domain mixing augmentation, contrastive learning |
2405.19876
Report |
IReNe: Instant Recoloring in Neural Radiance Fields |
Alessio Mazzucchelli, Adrian Garcia-Garcia, Elena Garces, Fernando Rivas-Manzaneque, Francesc Moreno-Noguer, Adrian Penate-Sanchez |
Advances in NERFs have allowed for 3D scene reconstructions and novel view
synthesis. Yet, efficiently editing these representations while retaining
photorealism is an emerging challenge. Recent methods face three primary
limitations: they're slow for interactive use, lack precision at object
boundaries, and struggle to ensure multi-view consistency. We introduce IReNe
to address these limitations, enabling swift, near real-time color editing in
NeRF. Leveraging a pre-trained NeRF model and a single training image with
user-applied color edits, IReNe swiftly adjusts network parameters in seconds.
This adjustment allows the model to generate new scene views, accurately
representing the color changes from the training image while also controlling
object boundaries and view-specific effects. Object boundary control is
achieved by integrating a trainable segmentation module into the model. The
process gains efficiency by retraining only the weights of the last network
layer. We observed that neurons in this layer can be classified into those
responsible for view-dependent appearance and those contributing to diffuse
appearance. We introduce an automated classification approach to identify these
neuron types and exclusively fine-tune the weights of the diffuse neurons. This
further accelerates training and ensures consistent color edits across
different views. A thorough validation on a new dataset, with edited object
colors, shows significant quantitative and qualitative advancements over
competitors, accelerating speeds by 5x to 500x. |
\methodname~presents a novel approach for near real-time color editing of pre-trained NeRFs using a single user-edited image. |
Existing NeRF color editing techniques are slow, lack precision at object boundaries, and struggle to ensure multi-view consistency, limiting their practical application. |
\methodname~achieves fast editing by: 1) Integrating a trainable segmentation module for object boundary control. 2) Selectively fine-tuning only the last layer of the color MLP. 3) Automatically classifying and exclusively fine-tuning diffuse appearance neurons. |
Significantly faster editing compared to state-of-the-art methods (5 seconds vs. 1 minute to 2 hours).
Improved accuracy, particularly at object boundaries, reducing color bleeding.
Enhanced multi-view consistency, ensuring uniform color edits across different viewpoints. |
Reliance on external editing tools like Photoshop for complete editing.
Occasional suboptimal performance of the soft segmentation model.
Future work: Explore in-built editing tools and address indirect illumination from edited objects. |
nerf, color editing, 3d scene editing, neural rendering, interactive editing |
2405.19854
Report |
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection |
Fangyi Chen, Han Zhang, Zhantao Yang, Hao Chen, Kai Hu, Marios Savvides |
Open-vocabulary object detection (OVD) requires solid modeling of the
region-semantic relationship, which could be learned from massive region-text
pairs. However, such data is limited in practice due to significant annotation
costs. In this work, we propose RTGen to generate scalable open-vocabulary
region-text pairs and demonstrate its capability to boost the performance of
open-vocabulary object detection. RTGen includes both text-to-region and
region-to-text generation processes on scalable image-caption data. The
text-to-region generation is powered by image inpainting, directed by our
proposed scene-aware inpainting guider for overall layout harmony. For
region-to-text generation, we perform multiple region-level image captioning
with various prompts and select the best matching text according to CLIP
similarity. To facilitate detection training on region-text pairs, we also
introduce a localization-aware region-text contrastive loss that learns object
proposals tailored with different localization qualities. Extensive experiments
demonstrate that our RTGen can serve as a scalable, semantically rich, and
effective source for open-vocabulary object detection and continue to improve
the model performance when more data is utilized, delivering superior
performance compared to the existing state-of-the-art methods. |
This paper proposes RTGen, a novel framework for generating open-vocabulary region-text pairs from image-caption pairs, to enhance open-vocabulary object detection (OVD). |
Region-text pairs are crucial for training OVD models but are limited and expensive to annotate. RTGen addresses this by providing a scalable method for generating these pairs. |
RTGen employs two processes: 1) Text-to-region generation using a novel Scene-Aware Inpainting Guider (SAIG) and an inpainting model. 2) Region-to-text generation using a captioning model and CLIP similarity for selection. It further introduces a Localization-Aware Region-Text Contrastive Loss (LART) for effective OVD training. |
RTGen effectively boosts OVD performance, achieving state-of-the-art results on OV-COCO and OV-LVIS benchmarks.
The generated region-text pairs demonstrate scalability, with performance consistently improving as more data is used.
SAIG effectively allocates phrases and boxes for inpainting, leading to higher-quality generated data compared to random allocation or grounding methods. |
The current generation pipeline relies on multiple models and processes, which can be computationally intensive.
Future work could explore improving the efficiency of the generation process and applying RTGen to other open-vocabulary tasks. |
open-vocabulary object detection, region-text generation, scene-aware inpainting, contrastive learning, image captioning |
2405.19783
Report |
Instruction-Guided Visual Masking |
Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, Xianyuan Zhan |
Instruction following is crucial in contemporary LLM. However, when extended
to multimodal setting, it often suffers from misalignment between specific
textual instruction and targeted local region of an image. To achieve more
accurate and nuanced multimodal instruction following, we introduce
Instruction-guided Visual Masking (IVM), a new versatile visual grounding model
that is compatible with diverse multimodal models, such as LMM and robot model.
By constructing visual masks for instruction-irrelevant regions, IVM-enhanced
multimodal models can effectively focus on task-relevant image regions to
better align with complex instructions. Specifically, we design a visual
masking data generation pipeline and create an IVM-Mix-1M dataset with 1
million image-instruction pairs. We further introduce a new learning technique,
Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training
that prioritizes high-quality data samples. Experimental results on generic
multimodal tasks such as VQA and embodied robotic control demonstrate the
versatility of IVM, which as a plug-and-play tool, significantly boosts the
performance of diverse multimodal models, yielding new state-of-the-art results
across challenging multimodal benchmarks. Code is available at
https://github.com/2toinf/IVM. |
This paper proposes Instruction-guided Visual Masking (IVM), a plug-and-play visual grounding model to enhance multimodal instruction following by masking out instruction-irrelevant image regions. |
Existing large multimodal models often struggle to accurately localize targeted image regions relevant to specific textual instructions, leading to misinterpretations and hallucinations. |
The authors create the IVM-Mix-1M dataset with 1 million image-instruction pairs using an LLM-empowered Mixture of Expert pipeline and manual annotations. They then train the IVM model using a Discriminator-Weighted Supervised Learning (DWSL) framework to prioritize high-quality data samples. |
IVM significantly improves the performance of both commercial chatbots (e.g., GPT4-V) and open-sourced LMMs (e.g., LLaVA) on challenging multimodal benchmarks like V*Bench, EgoThink, and POPE.
IVM-enhanced models outperform baselines on referring expression comprehension benchmarks, demonstrating strong visual grounding capabilities.
Real-world robotic control experiments show that IVM enhances the robustness and generalization of language-conditioned behavior cloning agents in the presence of distractions. |
IVM introduces additional parameters and computational overhead compared to end-to-end training methods focused solely on downstream tasks.
The quality of the IVM model depends on the accuracy of the generated labels and the effectiveness of the DWSL framework in filtering out inaccuracies. |
visual grounding, multimodal instruction following, large multimodal models, robotic control, discriminator-weighted supervised learning |
2405.19751
Report |
HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization |
Wenxuan Liu, Sai Qian Zhang |
Diffusion Transformers (DiTs) have recently gained substantial attention in
both industrial and academic fields for their superior visual generation
capabilities, outperforming traditional diffusion models that use U-Net.
However,the enhanced performance of DiTs also comes with high parameter counts
and implementation costs, seriously restricting their use on resource-limited
devices such as mobile phones. To address these challenges, we introduce the
Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training
quantization method that utilizes 4-bit floating-point (FP) precision on both
weights and activations for DiT inference. Compared to fixed-point quantization
(e.g., INT8), FP quantization, complemented by our proposed clipping range
selection mechanism, naturally aligns with the data distribution within DiT,
resulting in a minimal quantization error. Furthermore, HQ-DiT also implements
a universal identity mathematical transform to mitigate the serious
quantization error caused by the outliers. The experimental results demonstrate
that DiT can achieve extremely low-precision quantization (i.e., 4 bits) with
negligible impact on performance. Our approach marks the first instance where
both weights and activations in DiTs are quantized to just 4 bits, with only a
0.12 increase in sFID on ImageNet. |
This paper introduces HQ-DiT, an efficient post-training quantization method using 4-bit floating-point precision for both weights and activations in Diffusion Transformers (DiTs), enabling deployment on resource-limited devices. |
DiTs offer superior visual generation but are computationally expensive, hindering their deployment on devices like mobile phones. Model quantization is crucial to reduce these computational demands. |
The authors study data distribution in DiTs and employ random Hadamard transforms to mitigate outlier impact on quantization. They propose a method to select optimal floating-point formats based on data distribution and utilize GPTQ for weight quantization and MinMax for activation. |
HQ-DiT achieves comparable performance to full-precision models with 4-bit quantization.
The method outperforms other quantization approaches like SmoothQuant and FPQ, especially at lower bitwidths (4-bit).
HQ-DiT enables a 5.09x speedup and 2.13x memory saving compared to the full-precision model. |
The paper primarily focuses on image generation and doesn't explore other DiT applications.
Evaluation is limited to ImageNet; further validation on diverse datasets is needed. |
diffusion models, model quantization, floating-point quantization, diffusion transformers, post-training quantization |
2405.19745
Report |
GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis |
Boming Zhao, Yuan Li, Ziyu Sun, Lin Zeng, Yujun Shen, Rui Ma, Yinda Zhang, Hujun Bao, Zhaopeng Cui |
Forecasting future scenarios in dynamic environments is essential for
intelligent decision-making and navigation, a challenge yet to be fully
realized in computer vision and robotics. Traditional approaches like video
prediction and novel-view synthesis either lack the ability to forecast from
arbitrary viewpoints or to predict temporal dynamics. In this paper, we
introduce GaussianPrediction, a novel framework that empowers 3D Gaussian
representations with dynamic scene modeling and future scenario synthesis in
dynamic environments. GaussianPrediction can forecast future states from any
viewpoint, using video observations of dynamic scenes. To this end, we first
propose a 3D Gaussian canonical space with deformation modeling to capture the
appearance and geometry of dynamic scenes, and integrate the lifecycle property
into Gaussians for irreversible deformations. To make the prediction feasible
and efficient, a concentric motion distillation approach is developed by
distilling the scene motion with key points. Finally, a Graph Convolutional
Network is employed to predict the motions of key points, enabling the
rendering of photorealistic images of future scenarios. Our framework shows
outstanding performance on both synthetic and real-world datasets,
demonstrating its efficacy in predicting and rendering future environments. |
This paper introduces GaussianPrediction, a novel framework that leverages 3D Gaussian representations for modeling dynamic scenes and synthesizing future scenarios from arbitrary viewpoints, using video observations. |
Predicting future scenarios in dynamic environments, including dense motion forecasting and visualization from any viewpoint, is crucial for intelligent systems in computer vision and robotics. |
The framework builds a 3D Gaussian canonical space with deformation modeling and lifecycle properties to capture scene dynamics and irreversible deformations. It employs concentric motion distillation with key points to efficiently predict scene motion using a Graph Convolutional Network (GCN). Finally, it renders photorealistic images of future scenarios from novel viewpoints. |
GaussianPrediction outperforms existing NeRF-based and Gaussian-based methods in novel view synthesis of dynamic scenes.
It demonstrates superior performance in short-term future scenario synthesis, showcasing more realistic and coherent predictions.
The framework effectively handles complex motions and irreversible deformations, such as cutting or splitting objects. |
The model's reliance on input observations for motion prediction without pre-training limits its capacity for long-term forecasting.
Inaccuracies in camera poses and timestamps in real-world datasets pose challenges for quantitative evaluation of prediction results. |
novel view synthesis, dynamic scene modeling, motion prediction, 3d gaussian representations, graph convolutional network |
2405.19726
Report |
Streaming Video Diffusion: Online Video Editing with Diffusion Models |
Feng Chen, Zhen Yang, Bohan Zhuang, Qi Wu |
We present a novel task called online video editing, which is designed to
edit \textbf{streaming} frames while maintaining temporal consistency. Unlike
existing offline video editing assuming all frames are pre-established and
accessible, online video editing is tailored to real-life applications such as
live streaming and online chat, requiring (1) fast continual step inference,
(2) long-term temporal modeling, and (3) zero-shot video editing capability. To
solve these issues, we propose Streaming Video Diffusion (SVDiff), which
incorporates the compact spatial-aware temporal recurrence into off-the-shelf
Stable Diffusion and is trained with the segment-level scheme on large-scale
long videos. This simple yet effective setup allows us to obtain a single model
that is capable of executing a broad range of videos and editing each streaming
frame with temporal coherence. Our experiments indicate that our model can edit
long, high-quality videos with remarkable results, achieving a real-time
inference speed of 15.2 FPS at a resolution of 512x512. |
This paper proposes Streaming Video Diffusion (SVDiff), an online video editing method that edits streaming video frames with temporal consistency using a novel compact spatial-aware temporal memory. |
Online video editing, crucial for live streaming and online chat, demands real-time processing of video frames while maintaining temporal coherence. Existing offline editing methods are ill-suited for this task due to their reliance on pre-established frames and limitations in handling long video sequences. |
SVDiff integrates a compact spatial-aware temporal memory into Stable Diffusion. This memory is recursively updated with each incoming frame to capture both spatial and temporal information. The model is trained on long videos by splitting them into short segments while propagating the temporal memory between them. |
SVDiff generates high-quality, long videos with strong adherence to edit prompts while preserving temporal consistency.
It outperforms baseline models adapted for online video editing in terms of both qualitative results and quantitative metrics (CLIP, user study).
The method achieves real-time inference speed (15.2 FPS at 512x512 resolution) due to its efficient memory usage and recurrent design. |
The current implementation might struggle to accurately detect shot changes in videos exceeding 2 minutes due to training-inference discrepancies.
Future work will focus on mitigating the influence of training-inference gap to better handle long videos with complex scene transitions. |
video editing, streaming processing, diffusion models, temporal consistency, real-time |
2405.19712
Report |
HINT: Learning Complete Human Neural Representations from Limited Viewpoints |
Alessandro Sanvito, Andrea Ramazzina, Stefanie Walz, Mario Bijelic, Felix Heide |
No augmented application is possible without animated humanoid avatars. At
the same time, generating human replicas from real-world monocular hand-held or
robotic sensor setups is challenging due to the limited availability of views.
Previous work showed the feasibility of virtual avatars but required the
presence of 360 degree views of the targeted subject. To address this issue, we
propose HINT, a NeRF-based algorithm able to learn a detailed and complete
human model from limited viewing angles. We achieve this by introducing a
symmetry prior, regularization constraints, and training cues from large human
datasets. In particular, we introduce a sagittal plane symmetry prior to the
appearance of the human, directly supervise the density function of the human
model using explicit 3D body modeling, and leverage a co-learned human
digitization network as additional supervision for the unseen angles. As a
result, our method can reconstruct complete humans even from a few viewing
angles, increasing performance by more than 15% PSNR compared to previous
state-of-the-art algorithms. |
HINT, a NeRF-based algorithm that reconstructs a complete, animatable human model from limited viewing angles using symmetry priors, regularization constraints, and cues from large human datasets. |
Crucial for generating realistic human avatars in augmented applications and for creating counterfactual examples in robotics and autonomous navigation, especially when data from limited viewpoints is common. |
Combines a NeRF-based background model with an SDF-based human model. Leverages symmetry constraints, direct SDF supervision using a 3D body model, and a co-trained human digitization network to infer information for occluded areas. |
Reconstructs complete humans even from sparse viewpoints, enabling novel view synthesis and pose generation.
Outperforms state-of-the-art methods by more than 15% PSNR and 34% LPIPS.
Demonstrates the effectiveness of direct SDF supervision over Eikonal loss in limited viewpoint scenarios. |
Relies on pre-trained models (SMPL, segmentation, depth estimation) which might impact performance.
Limited evaluation on highly dynamic scenes with complex occlusions. |
human modeling, neural radiance fields, nerf, view synthesis, data augmentation |
2405.19708
Report |
Text Guided Image Editing with Automatic Concept Locating and Forgetting |
Jia Li, Lijie Hu, Zhixian He, Jingfeng Zhang, Tianhang Zheng, Di Wang |
With the advancement of image-to-image diffusion models guided by text,
significant progress has been made in image editing. However, a persistent
challenge remains in seamlessly incorporating objects into images based on
textual instructions, without relying on extra user-provided guidance. Text and
images are inherently distinct modalities, bringing out difficulties in fully
capturing the semantic intent conveyed through language and accurately
translating that into the desired visual modifications. Therefore, text-guided
image editing models often produce generations with residual object attributes
that do not fully align with human expectations. To address this challenge, the
models should comprehend the image content effectively away from a disconnect
between the provided textual editing prompts and the actual modifications made
to the image. In our paper, we propose a novel method called Locate and Forget
(LaF), which effectively locates potential target concepts in the image for
modification by comparing the syntactic trees of the target prompt and scene
descriptions in the input image, intending to forget their existence clues in
the generated image. Compared to the baselines, our method demonstrates its
superiority in text-guided image editing tasks both qualitatively and
quantitatively. |
This paper presents Locate and Forget (LaF), a novel method for improving text-guided image editing in diffusion models by addressing the challenge of accurately locating and modifying specific concepts within complex image scenes based on textual instructions. |
Existing text-guided image editing models often struggle to accurately align textual instructions with visual modifications, leading to edits that may not fully reflect user intent. LaF aims to overcome this limitation by leveraging scene descriptions to precisely locate target concepts and guide the diffusion model to selectively forget those concepts during the denoising process. |
LaF employs a two-step process: 1) Concept Location: The method generates a scene description of the input image and compares its syntactic tree to the input text prompt to identify the specific concepts targeted for editing. 2) Concept Forgetting: During the denoising steps of the diffusion process, LaF utilizes negative guidance based on the identified concepts, enabling the model to selectively forget or remove those concepts from the generated image. |
LaF demonstrates superior performance in aligning generated images with textual instructions, as evidenced by higher CLIP-T scores compared to baseline methods.
The method exhibits a good balance between editing fidelity and visual quality, achieving competitive Inception Scores while effectively modifying target concepts.
Human preference studies confirm that LaF produces more desirable editing outcomes, with users rating it higher in terms of alignment, fidelity, consistency, and overall preference. |
One limitation is the difficulty in precisely controlling numerical attributes, such as object counts or sizes, during the editing process.
Further research is needed to extend LaF's capabilities to handle more complex editing scenarios, such as multi-object interactions or edits requiring nuanced spatial reasoning. |
text-guided image editing, diffusion models, concept forgetting, scene understanding, multi-modal learning |
2405.19707
Report |
DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark |
Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, Huaxiong Li |
Recently, video generation techniques have advanced rapidly. Given the
popularity of video content on social media platforms, these models intensify
concerns about the spread of fake information. Therefore, there is a growing
demand for detectors capable of distinguishing between fake AI-generated videos
and mitigating the potential harm caused by fake information. However, the lack
of large-scale datasets from the most advanced video generators poses a barrier
to the development of such detectors. To address this gap, we introduce the
first AI-generated video detection dataset, GenVideo. It features the following
characteristics: (1) a large volume of videos, including over one million
AI-generated and real videos collected; (2) a rich diversity of generated
content and methodologies, covering a broad spectrum of video categories and
generation techniques. We conducted extensive studies of the dataset and
proposed two evaluation methods tailored for real-world-like scenarios to
assess the detectors' performance: the cross-generator video classification
task assesses the generalizability of trained detectors on generators; the
degraded video classification task evaluates the robustness of detectors to
handle videos that have degraded in quality during dissemination. Moreover, we
introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to
enhance the detectors by identifying AI-generated videos through the analysis
of inconsistencies in temporal and spatial dimensions. Our extensive
experiments demonstrate DeMamba's superior generalizability and robustness on
GenVideo compared to existing detectors. We believe that the GenVideo dataset
and the DeMamba module will significantly advance the field of AI-generated
video detection. Our code and dataset will be aviliable at
\url{https://github.com/chenhaoxing/DeMamba}. |
This paper introduces GenVideo, the first large-scale dataset for AI-generated video detection, featuring over a million videos and diverse generation techniques. It also proposes DeMamba, a plug-and-play module that enhances video detectors by identifying spatial-temporal inconsistencies. |
The rapid advancement of video generation techniques raises concerns about the spread of misinformation. Existing datasets lack the scale and diversity to train robust detectors, hindering efforts to mitigate potential harm. |
GenVideo is built by collecting real videos from established datasets and generating fake videos using various state-of-the-art techniques. DeMamba leverages a structured state-space model to analyze local inconsistencies across video frames. |
DeMamba significantly improves the performance of existing detectors in cross-generator generalization tasks.
The proposed method exhibits strong robustness against video degradation like compression and watermarking.
Ablation studies confirm the importance of DeMamba's components, zone size, and scanning order. |
The training efficiency of DeMamba remains suboptimal, requiring further exploration for lightweight design.
Future work includes expanding GenVideo with more diverse and challenging AI-generated video content. |
ai-generated video detection, misinformation detection, video generation, dataset, spatial-temporal inconsistency |
2405.19671
Report |
GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction |
Haodong Xiang, Xinghui Li, Xiansong Lai, Wanting Zhang, Zhichao Liao, Kai Cheng, Xueping Liu |
Recently, 3D Gaussian Splatting(3DGS) has revolutionized neural rendering
with its high-quality rendering and real-time speed. However, when it comes to
indoor scenes with a significant number of textureless areas, 3DGS yields
incomplete and noisy reconstruction results due to the poor initialization of
the point cloud and under-constrained optimization. Inspired by the continuity
of signed distance field (SDF), which naturally has advantages in modeling
surfaces, we present a unified optimizing framework integrating neural SDF with
3DGS. This framework incorporates a learnable neural SDF field to guide the
densification and pruning of Gaussians, enabling Gaussians to accurately model
scenes even with poor initialized point clouds. At the same time, the geometry
represented by Gaussians improves the efficiency of the SDF field by piloting
its point sampling. Additionally, we regularize the optimization with normal
and edge priors to eliminate geometry ambiguity in textureless areas and
improve the details. Extensive experiments in ScanNet and ScanNet++ show that
our method achieves state-of-the-art performance in both surface reconstruction
and novel view synthesis. |
This paper introduces GaussianRoom, a novel 3D reconstruction framework that integrates neural Signed Distance Fields (SDF) with 3D Gaussian Splatting (3DGS) to enhance the reconstruction of indoor scenes, particularly in textureless areas. |
Existing methods like 3DGS struggle with incomplete reconstructions in indoor scenes with vast textureless regions, while SDF-based methods, though accurate, are computationally expensive. GaussianRoom addresses these limitations by leveraging the strengths of both approaches. |
The framework employs a mutually beneficial learning strategy: SDF guides the distribution of Gaussian primitives to align with the scene surface, while 3DGS aids in efficient point sampling for the SDF. Additionally, it incorporates monocular normal priors and edge priors to improve geometry reconstruction in textureless areas and enhance detail rendering. |
GaussianRoom outperforms state-of-the-art methods in geometry reconstruction metrics like accuracy, completion, and F-score on both ScanNet and ScanNet++ datasets.
The method exhibits superior rendering quality compared to existing Gaussian-based methods, evident from improvements in SSIM, PSNR, and LPIPS metrics.
Ablation studies confirm the effectiveness of each individual module, particularly the SDF guidance for Gaussian distribution, Gaussian-guided sampling for SDF, and the use of normal and edge priors. |
The neural SDF optimization, although more efficient than some NeRF-based methods, is still computationally more demanding than 3DGS, presenting a bottleneck for training time.
Future work could focus on improving the efficiency of MLP-based neural SDF to accelerate the overall training process. |
3d reconstruction, 3d gaussian splatting, neural signed distance fields, indoor scenes, textureless areas |
2405.19657
Report |
Uncertainty-guided Optimal Transport in Depth Supervised Sparse-View 3D Gaussian |
Wei Sun, Qi Zhang, Yanzhao Zhou, Qixiang Ye, Jianbin Jiao, Yuan Li |
3D Gaussian splatting has demonstrated impressive performance in real-time
novel view synthesis. However, achieving successful reconstruction from RGB
images generally requires multiple input views captured under static
conditions. To address the challenge of sparse input views, previous approaches
have incorporated depth supervision into the training of 3D Gaussians to
mitigate overfitting, using dense predictions from pretrained depth networks as
pseudo-ground truth. Nevertheless, depth predictions from monocular depth
estimation models inherently exhibit significant uncertainty in specific areas.
Relying solely on pixel-wise L2 loss may inadvertently incorporate detrimental
noise from these uncertain areas. In this work, we introduce a novel method to
supervise the depth distribution of 3D Gaussians, utilizing depth priors with
integrated uncertainty estimates. To address these localized errors in depth
predictions, we integrate a patch-wise optimal transport strategy to complement
traditional L2 loss in depth supervision. Extensive experiments conducted on
the LLFF, DTU, and Blender datasets demonstrate that our approach, UGOT,
achieves superior novel view synthesis and consistently outperforms
state-of-the-art methods. |
Introduces UGOT, an Uncertainty-guided Optimal Transport approach for depth supervision in sparse-view 3D Gaussian splatting for novel view synthesis. |
Addresses the challenge of overfitting and geometric inaccuracies in 3D Gaussian splatting with sparse input views, particularly focusing on the limitations of traditional pixel-wise depth supervision. |
Leverages depth priors with integrated uncertainty estimates from generative diffusion models to guide depth optimization and employs a patch-wise optimal transport strategy to align the depth distribution of Gaussian splats with the depth prior. |
Achieves state-of-the-art results on LLFF, DTU, and Blender datasets, demonstrating superior novel view synthesis quality compared to existing methods.
Effectively mitigates the impact of noisy or uncertain depth estimations, leading to more accurate and robust 3D scene reconstruction from sparse views.
Maintains the real-time rendering capabilities of 3D Gaussian splatting while significantly improving the quality of reconstruction in sparse-view scenarios. |
Limited performance improvement in reconstructing untextured backgrounds and voids due to inherent limitations of 3D Gaussian splatting.
Reliance on pre-trained monocular depth estimation models, which may introduce biases or inaccuracies depending on the training data and domain. |
novel view synthesis, 3d gaussian splatting, depth supervision, optimal transport, uncertainty estimation |
2405.19614
Report |
TAMBRIDGE: Bridging Frame-Centered Tracking and 3D Gaussian Splatting for Enhanced SLAM |
Peifeng Jiang, Hong Liu, Xia Li, Ti Wang, Fabian Zhang, Joachim M. Buhmann |
The limited robustness of 3D Gaussian Splatting (3DGS) to motion blur and
camera noise, along with its poor real-time performance, restricts its
application in robotic SLAM tasks. Upon analysis, the primary causes of these
issues are the density of views with motion blur and the cumulative errors in
dense pose estimation from calculating losses based on noisy original images
and rendering results, which increase the difficulty of 3DGS rendering
convergence. Thus, a cutting-edge 3DGS-based SLAM system is introduced,
leveraging the efficiency and flexibility of 3DGS to achieve real-time
performance while remaining robust against sensor noise, motion blur, and the
challenges posed by long-session SLAM. Central to this approach is the Fusion
Bridge module, which seamlessly integrates tracking-centered ORB Visual
Odometry with mapping-centered online 3DGS. Precise pose initialization is
enabled by this module through joint optimization of re-projection and
rendering loss, as well as strategic view selection, enhancing rendering
convergence in large-scale scenes. Extensive experiments demonstrate
state-of-the-art rendering quality and localization accuracy, positioning this
system as a promising solution for real-world robotics applications that
require stable, near-real-time performance. Our project is available at
https://ZeldaFromHeaven.github.io/TAMBRIDGE/ |
This paper introduces a novel 3DGS-based SLAM system called TAMBRIDGE that enhances the convergence of online 3DGS by incorporating a plug-and-play Fusion Bridge module. This module integrates tracking-centered ORB Visual Odometry with mapping-centered online 3DGS, enabling precise pose initialization and strategic viewpoint selection. |
This addresses the limitations of existing 3DGS-based SLAM systems, which struggle with real-time performance and robustness against sensor noise and motion blur, especially in long-duration robotic tasks. |
The system employs a four-module structure: a Tracking-centered Frontend Module, a Tracking-centered Global Optimization Module, a Plug and Play Fusion Bridge Module, and an Online 3DGS Backend Module. The Fusion Bridge module is crucial, filtering keyframes, jointly optimizing rendering poses with border masks, and minimizing reprojection and rendering errors. |
TAMBRIDGE achieves state-of-the-art rendering quality and localization accuracy, comparable to SplaTAM but significantly faster.
The system consistently maintains near real-time performance (>5 FPS) even in long-session robotic tasks, outperforming existing NeRF-based and 3DGS-based methods.
Ablation studies highlight the importance of the Fusion Bridge module in bridging the gap between the tracking and mapping paradigms, thereby improving the accuracy and quality of the reconstruction. |
The Viewpoint Selection in the Fusion Bridge relies on manual thresholds and lacks self-learning, potentially limiting its adaptability.
The evaluation primarily focuses on the TUM RGB-D dataset. Expanding to more datasets and exploring alternative SLAM frontends could further validate its generalizability. |
slam, 3d gaussian splatting, robotics perception, real-time performance, sensor noise robustness |
2405.19609
Report |
SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations |
Yujiao Jiang, Qingmin Liao, Zhaolong Wang, Xiangru Lin, Zongqing Lu, Yuxi Zhao, Hanqing Wei, Jingrui Ye, Yu Zhang, Zhijing Shao |
Recovering photorealistic and drivable full-body avatars is crucial for
numerous applications, including virtual reality, 3D games, and tele-presence.
Most methods, whether reconstruction or generation, require large numbers of
human motion sequences and corresponding textured meshes. To easily learn a
drivable avatar, a reasonable parametric body model with unified topology is
paramount. However, existing human body datasets either have images or textured
models and lack parametric models which fit clothes well. We propose a new
parametric model SMPLX-Lite-D, which can fit detailed geometry of the scanned
mesh while maintaining stable geometry in the face, hand and foot regions. We
present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with
multi-view RGB sequences, keypoints annotations, textured scanned meshes, and
textured SMPLX-Lite-D models. With the SMPLX-Lite dataset, we train a
conditional variational autoencoder model that takes human pose and facial
keypoints as input, and generates a photorealistic drivable human avatar. |
This paper introduces SMPLX-Lite, a comprehensive dataset for photorealistic and drivable avatar research, and proposes SMPLX-Lite-D, a new parametric model optimized for fitting detailed clothing geometry while maintaining facial and hand fidelity. |
Creating realistic, animatable avatars is crucial for various applications, but existing datasets often lack detailed clothing models or parametric representations suitable for driving. |
The authors capture a multi-view dataset of 5 subjects performing 15 actions, reconstruct textured meshes, fit SMPLX-Lite-D models, and train a conditional variational autoencoder (CVAE) to generate avatars from pose and facial keypoints. |
SMPLX-Lite dataset offers multi-view images, 3D keypoints, textured scanned meshes, and fitted SMPLX-Lite-D models with textures, enabling comprehensive research in drivable avatars.
SMPLX-Lite-D model, derived from SMPL-X, simplifies vertex fitting for clothing while retaining high-fidelity facial and hand representation.
The trained CVAE model effectively generates photorealistic avatars driven by pose and facial keypoints, outperforming baselines in novel view and pose synthesis. |
The current dataset, while extensive, could benefit from further diversity in action sequences and clothing styles.
The proposed driving algorithm, while effective, can be improved to enhance generalization capabilities and facial expression control. |
drivable avatars, dataset, 3d human reconstruction, parametric models, conditional variational autoencoder |
2405.19450
Report |
FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining |
Dong Li, Yidi Liu, Xueyang Fu, Senyan Xu, Zheng-Jun Zha |
Image deraining aims to remove rain streaks from rainy images and restore
clear backgrounds. Currently, some research that employs the Fourier transform
has proved to be effective for image deraining, due to it acting as an
effective frequency prior for capturing rain streaks. However, despite there
exists dependency of low frequency and high frequency in images, these
Fourier-based methods rarely exploit the correlation of different frequencies
for conjuncting their learning procedures, limiting the full utilization of
frequency information for image deraining. Alternatively, the recently emerged
Mamba technique depicts its effectiveness and efficiency for modeling
correlation in various domains (e.g., spatial, temporal), and we argue that
introducing Mamba into its unexplored Fourier spaces to correlate different
frequencies would help improve image deraining. This motivates us to propose a
new framework termed FourierMamba, which performs image deraining with Mamba in
the Fourier space. Owning to the unique arrangement of frequency orders in
Fourier space, the core of FourierMamba lies in the scanning encoding of
different frequencies, where the low-high frequency order formats exhibit
differently in the spatial dimension (unarranged in axis) and channel dimension
(arranged in axis). Therefore, we design FourierMamba that correlates Fourier
space information in the spatial and channel dimensions with distinct designs.
Specifically, in the spatial dimension Fourier space, we introduce the zigzag
coding to scan the frequencies to rearrange the orders from low to high
frequencies, thereby orderly correlating the connections between frequencies;
in the channel dimension Fourier space with arranged orders of frequencies in
axis, we can directly use Mamba to perform frequency correlation and improve
the channel information representation. |
Proposes FourierMamba, a novel image deraining framework that leverages Mamba, a type of State Space Model, to correlate different frequencies within the Fourier domain, improving rain streak removal and image restoration. |
Previous Fourier-based deraining methods fail to fully utilize frequency information by neglecting correlations between different frequencies, limiting their effectiveness. |
FourierMamba employs a multi-scale U-Net architecture with Fourier Residual State-Space Blocks (FRSSB). These blocks implement: 1) **Fourier Spatial Interaction SSM:** utilizes zigzag-based scanning methods to correlate frequencies in the spatial dimension of Fourier space, addressing directional sensitivity limitations, and 2) **Fourier Channel Evolution SSM:** directly applies Mamba to correlate ordered frequencies in the channel dimension of Fourier space, improving global feature representation. |
FourierMamba achieves state-of-the-art performance on benchmark datasets (Rain100H, Rain100L, Test2800, Test1200), outperforming existing methods in both PSNR and SSIM metrics.
Ablation studies demonstrate the effectiveness of key components like Fourier Spatial/Channel SSMs, Fourier priors, and proposed zigzag scanning methods.
Qualitative analysis highlights FourierMamba's superior performance in removing rain streaks and restoring image details, particularly in complex and severe rain conditions. |
The reliance on fixed scanning patterns may limit adaptability to varying rain characteristics.
Exploring adaptive scanning strategies based on rain density and direction could further enhance deraining performance. |
image deraining, fourier transform, state space models, mamba, frequency correlation |
2405.19335
Report |
X-VILA: Cross-Modality Alignment for Large Language Model |
Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin |
We introduce X-VILA, an omni-modality model designed to extend the
capabilities of large language models (LLMs) by incorporating image, video, and
audio modalities. By aligning modality-specific encoders with LLM inputs and
diffusion decoders with LLM outputs, X-VILA achieves cross-modality
understanding, reasoning, and generation. To facilitate this cross-modality
alignment, we curate an effective interleaved any-to-any modality
instruction-following dataset. Furthermore, we identify a significant problem
with the current cross-modality alignment method, which results in visual
information loss. To address the issue, we propose a visual alignment mechanism
with a visual embedding highway module. We then introduce a resource-efficient
recipe for training X-VILA, that exhibits proficiency in any-to-any modality
conversation, surpassing previous approaches by large margins. X-VILA also
showcases emergent properties across modalities even in the absence of similar
training data. The project will be made open-source. |
Introduces X-VILA, an omni-modality LLM that integrates image, video, and audio modalities to achieve cross-modality understanding, reasoning, and generation. |
Extends LLM capabilities beyond text, enabling multi-modal conversations and content generation. |
Aligns modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs. Utilizes a novel two-phase alignment mechanism (textual and visual) and introduces a visual embedding highway (VEH) to preserve visual details. |
Achieves any-to-any modality (X-to-X) conversation, surpassing previous approaches.
Demonstrates emergent properties, like long-context cross-modality generation and unseen cross-modality abilities (e.g., image-to-audio).
Shows significant improvements in visual correspondence on X-to-X alignment benchmarks compared to state-of-the-art methods. |
Further performance improvement is possible across VLM benchmarks.
Exploring other techniques beyond VEH to further improve visual alignment. |
multi-modality, large language model, cross-modality alignment, visual embedding highway, x-to-x generation |
2405.19331
Report |
NPGA: Neural Parametric Gaussian Avatars |
Simon Giebenhain, Tobias Kirschstein, Martin Rünz, Lourdes Agapito, Matthias Nießner |
The creation of high-fidelity, digital versions of human heads is an
important stepping stone in the process of further integrating virtual
components into our everyday lives. Constructing such avatars is a challenging
research problem, due to a high demand for photo-realism and real-time
rendering performance. In this work, we propose Neural Parametric Gaussian
Avatars (NPGA), a data-driven approach to create high-fidelity, controllable
avatars from multi-view video recordings. We build our method around 3D
Gaussian Splatting for its highly efficient rendering and to inherit the
topological flexibility of point clouds. In contrast to previous work, we
condition our avatars' dynamics on the rich expression space of neural
parametric head models (NPHM), instead of mesh-based 3DMMs. To this end, we
distill the backward deformation field of our underlying NPHM into forward
deformations which are compatible with rasterization-based rendering. All
remaining fine-scale, expression-dependent details are learned from the
multi-view videos. To increase the representational capacity of our avatars, we
augment the canonical Gaussian point cloud using per-primitive latent features
which govern its dynamic behavior. To regularize this increased dynamic
expressivity, we propose Laplacian terms on the latent features and predicted
dynamics. We evaluate our method on the public NeRSemble dataset, demonstrating
that NPGA significantly outperforms the previous state-of-the-art avatars on
the self-reenactment task by 2.6 PSNR. Furthermore, we demonstrate accurate
animation capabilities from real-world monocular videos. |
This paper proposes Neural Parametric Gaussian Avatars (NPGA), a method for creating high-fidelity, controllable avatars from multi-view videos by leveraging the expressive power of Neural Parametric Head Models (NPHMs). |
Creating realistic and controllable digital avatars is crucial for various applications like VR, AR, and the metaverse. |
The method distills the backward deformation field of a pre-trained NPHM into a forward deformation field compatible with 3D Gaussian Splatting (3DGS). This prior guides the avatar's motion, while per-Gaussian latent features and a detail network capture fine-grained details and appearance changes. The approach uses a cycle-consistency loss for distillation and optimizes the avatar using a photometric loss with Laplacian regularization. |
NPGA outperforms previous state-of-the-art methods on self-reenactment by a significant margin (2.6 PSNR improvement).
The method enables accurate cross-reenactment, transferring expressions from one person to the avatar.
The avatars can be animated from monocular RGB videos, demonstrating applicability outside controlled environments. |
The controllability is limited by the underlying NPHM, restricting animation of regions like the neck and torso.
The data-driven nature limits the avatar to expressions observed in the training data. |
avatar creation, 3d gaussian splatting, neural parametric head model, facial reenactment, multi-view video |
2405.19326
Report |
Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models |
Tianrun Chen, Chunan Yu, Jing Li, Jianqi Zhang, Lanyun Zhu, Deyi Ji, Yong Zhang, Ying Zang, Zejian Li, Lingyun Sun |
In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation
for parts searching and localization for objects, which is a new paradigm to 3D
segmentation that transcends limitations for previous category-specific 3D
semantic segmentation, 3D instance segmentation, and open-vocabulary 3D
segmentation. We design a simple baseline method, Reasoning3D, with the
capability to understand and execute complex commands for (fine-grained)
segmenting specific parts for 3D meshes with contextual awareness and reasoned
answers for interactive segmentation. Specifically, Reasoning3D leverages an
off-the-shelf pre-trained 2D segmentation network, powered by Large Language
Models (LLMs), to interpret user input queries in a zero-shot manner. Previous
research have shown that extensive pre-training endows foundation models with
prior world knowledge, enabling them to comprehend complex commands, a
capability we can harness to "segment anything" in 3D with limited 3D datasets
(source efficient). Experimentation reveals that our approach is generalizable
and can effectively localize and highlight parts of 3D objects (in 3D mesh)
based on implicit textual queries, including these articulated 3d objects and
real-world scanned data. Our method can also generate natural language
explanations corresponding to these 3D models and the decomposition. Moreover,
our training-free approach allows rapid deployment and serves as a viable
universal baseline for future research of part-level 3d (semantic) object
understanding in various fields including robotics, object manipulation, part
assembly, autonomous driving applications, augment reality and virtual reality
(AR/VR), and medical applications. The code, the model weight, the deployment
guide, and the evaluation protocol are: http://tianrun-chen.github.io/Reason3D/ |
Introduces Zero-Shot 3D Reasoning Segmentation for part localization in 3D objects using natural language, going beyond traditional 3D segmentation limitations. |
Enables flexible and intuitive interaction with 3D objects using natural language, beneficial for robotics, AR/VR, and other fields. |
Leverages pre-trained 2D reasoning segmentation networks and LLMs to interpret user queries, rendering 3D objects from multiple viewpoints for 2D processing and fusing the results back into 3D. |
Achieves competitive performance on open-vocabulary 3D segmentation benchmarks.
Successfully segments parts of 3D objects based on implicit textual queries.
Provides natural language explanations for segmentation results. |
Comprehensive benchmarking and user studies are needed for further evaluation.
Optimizing viewpoint selection could improve performance. |
3d segmentation, reasoning segmentation, 3d part understanding, large language models, large vision-language models |
2405.19315
Report |
Matryoshka Query Transformer for Large Vision-Language Models |
Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang |
Large Vision-Language Models (LVLMs) typically encode an image into a fixed
number of visual tokens (e.g., 576) and process these tokens with a language
model. Despite their strong performance, LVLMs face challenges in adapting to
varying computational constraints. This raises the question: can we achieve
flexibility in the number of visual tokens to suit different tasks and
computational resources? We answer this with an emphatic yes. Inspired by
Matryoshka Representation Learning, we introduce the Matryoshka Query
Transformer (MQT), capable of encoding an image into m visual tokens during
inference, where m can be any number up to a predefined maximum. This is
achieved by employing a query transformer with M latent query tokens to
compress the visual embeddings. During each training step, we randomly select m
<= M latent query tokens and train the model using only these first m tokens,
discarding the rest. Combining MQT with LLaVA, we train a single model once,
and flexibly and drastically reduce the number of inference-time visual tokens
while maintaining similar or better performance compared to training
independent models for each number of tokens. Our model, MQT-LLAVA, matches
LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens
instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only
sacrifices the performance by 2.4 points on MMBench. On certain tasks such as
ScienceQA and MMMU, we can even go down to only 2 visual tokens with
performance drops of just 3% and 6% each. Our exploration of the trade-off
between the accuracy and computational cost brought about by the number of
visual tokens facilitates future research to achieve the best of both worlds. |
Introduces Matryoshka Query Transformer (MQT), enabling flexible selection of visual token numbers in Large Vision-Language Models (LVLMs) at inference time, adapting to computational constraints. |
Existing LVLMs use a fixed number of visual tokens, posing challenges for tasks with varying computational resources or requiring different levels of visual granularity. |
Trains a query transformer with a Matryoshka structure, randomly dropping tail tokens during training to enable inference with any number of tokens up to a predefined maximum. |
Achieves comparable or superior performance to fixed-token models.
Matches LLaVA-1.5 performance on 11 benchmarks with less than half the tokens.
Exhibits different token sensitivity across tasks, with some remaining robust even with drastically reduced tokens. |
Current maximum token number at inference limited to 256.
Future work to explore exceeding training token limits during inference. |
vision-language models, matryoshka representation learning, query transformer, elastic inference, computational efficiency |
2405.19237
Report |
ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning |
Ruchika Chavhan, Da Li, Timothy Hospedales |
While large-scale text-to-image diffusion models have demonstrated impressive
image-generation capabilities, there are significant concerns about their
potential misuse for generating unsafe content, violating copyright, and
perpetuating societal biases. Recently, the text-to-image generation community
has begun addressing these concerns by editing or unlearning undesired concepts
from pre-trained models. However, these methods often involve data-intensive
and inefficient fine-tuning or utilize various forms of token remapping,
rendering them susceptible to adversarial jailbreaks. In this paper, we present
a simple and effective training-free approach, ConceptPrune, wherein we first
identify critical regions within pre-trained models responsible for generating
undesirable concepts, thereby facilitating straightforward concept unlearning
via weight pruning. Experiments across a range of concepts including artistic
styles, nudity, object erasure, and gender debiasing demonstrate that target
concepts can be efficiently erased by pruning a tiny fraction, approximately
0.12% of total weights, enabling multi-concept erasure and robustness against
various white-box and black-box adversarial attacks. |
This paper introduces ConceptPrune, a training-free method for concept editing in pre-trained diffusion models by identifying and pruning skilled neurons responsible for generating undesired concepts. |
This method addresses the risks of large-scale text-to-image models generating unsafe content, violating copyright, and perpetuating societal biases by providing a more efficient and robust alternative to existing concept editing and unlearning techniques. |
ConceptPrune identifies skilled neurons in Feed-Forward Networks (FFNs) of diffusion models by comparing the importance scores of neuron activations for target and reference concepts using a pruning strategy inspired by Wanda. These skilled neurons are then pruned to eliminate the undesired concept. |
ConceptPrune effectively erases undesired concepts like artistic styles, nudity, and objects, as demonstrated by quantitative metrics and qualitative examples.
ConceptPrune exhibits strong robustness to both white-box and black-box adversarial attacks aimed at circumventing concept erasure.
Concept-generating neurons are localized to a very compact subspace, suggesting efficient concept editing with minimal impact on overall model performance. |
There might be some degree of interference when erasing fine-grained classes or concepts.
Erasing a large number of objects simultaneously may degrade the overall image generation quality. |
concept editing, diffusion models, model pruning, adversarial robustness, text-to-image generation |
2405.19209
Report |
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos |
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal |
Video-language understanding tasks have focused on short video clips, often
struggling with long-form video understanding tasks. Recently, many long
video-language understanding approaches have leveraged the reasoning
capabilities of Large Language Models (LLMs) to perform long video QA,
transforming videos into densely sampled frame captions, and asking LLMs to
respond to text queries over captions. However, the frames used for captioning
are often redundant and contain irrelevant information, making dense sampling
inefficient, and ignoring the fact that video QA requires varying levels of
granularity, with some video segments being highly relevant to the question
(needing more fine-grained detail) while others being less relevant. Thus,
these LLM-based approaches are prone to missing information and operate on
large numbers of irrelevant captions, lowering both performance and efficiency.
To address these issues, we introduce VideoTree, a query-adaptive and
hierarchical framework for long-video understanding with LLMs. VideoTree
dynamically extracts query-related information from a video and builds a
tree-based representation for LLM reasoning. First, VideoTree adaptively
selects frames for captioning by iteratively clustering frames based on their
visual features and scoring clusters using their relevance to the query.
Second, it organizes visual clusters into a query-adaptive and hierarchical
tree structure; the tree encodes varying levels of granularity, with higher
resolution on relevant segments. Finally, VideoTree produces an answer by
traversing the tree's keyframes and passing their captions to an LLM answerer.
Our method improves both reasoning accuracy and efficiency compared to existing
methods: VideoTree achieves a 7.0%, 2.2%, and 2.7% accuracy gain over baselines
on the EgoSchema, NExT-QA, and IntentQA benchmarks, respectively, while
reducing inference time by 40%. |
This paper introduces AdaTree, a query-adaptive and hierarchical framework that enhances long-video understanding with Large Language Models (LLMs) by dynamically building a tree-based video representation. |
Existing LLM-based long-video understanding methods suffer from inefficiencies and inaccuracies due to redundant frame information, lack of query adaptation, and inability to capture hierarchical video structure. |
AdaTree uses a three-step process: (1) Adaptive Breadth Expansion: Clusters frames based on visual features and relevance to the query, (2) Relevance-Guided Depth Expansion: Explores relevant clusters in a coarse-to-fine manner to extract detailed information, (3) LLM-based Reasoning: Leverages captions from selected keyframes for question answering. |
AdaTree achieves state-of-the-art accuracy on EgoSchema, NExT-QA, and IntentQA benchmarks, outperforming previous methods by significant margins.
The method demonstrates both improved accuracy and efficiency, requiring fewer captions than uniform sampling baselines to achieve comparable or better performance.
Qualitative analysis reveals AdaTree's ability to effectively identify and focus on query-relevant video segments while filtering out irrelevant information. |
The effectiveness of AdaTree relies on the accuracy of the VLM captioner used.
While training-free, AdaTree includes hyperparameters, though experiments demonstrate its robustness even with suboptimal settings. |
long-form video understanding, large language models, query-adaptive representation, hierarchical video representation, video question answering |
2405.19035
Report |
A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation |
Niclas Vödisch, Kürsat Petek, Markus Käppeler, Abhinav Valada, Wolfram Burgard |
A key challenge for the widespread application of learning-based models for
robotic perception is to significantly reduce the required amount of annotated
training data while achieving accurate predictions. This is essential not only
to decrease operating costs but also to speed up deployment time. In this work,
we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by
exploiting the groundwork paved by visual foundation models. We leverage
descriptive image features from such a model to train two lightweight network
heads for semantic segmentation and object boundary detection, using very few
annotated training samples. We then merge their predictions via a novel fusion
module that yields panoptic maps based on normalized cut. To further enhance
the performance, we utilize self-training on unlabeled images selected by a
feature-driven similarity scheme. We underline the relevance of our approach by
employing PASTEL to important robot perception use cases from autonomous
driving and agricultural robotics. In extensive experiments, we demonstrate
that PASTEL significantly outperforms previous methods for label-efficient
segmentation even when using fewer annotations. The code of our work is
publicly available at http://pastel.cs.uni-freiburg.de. |
This paper introduces PASTEL, a novel approach for label-efficient panoptic segmentation that leverages the descriptive image features from the DINOv2 foundation model. |
Reducing the dependency on large, densely annotated datasets for training segmentation models is crucial for lowering operational costs and speeding up deployment, particularly in robotics. |
PASTEL employs a frozen DINOv2 backbone for feature extraction and trains lightweight heads for semantic segmentation and object boundary detection using very few annotated images. It then merges the predictions through a novel fusion module based on normalized cut and refines performance via self-training on unlabeled, visually similar images. |
PASTEL achieves state-of-the-art performance on Cityscapes, Pascal VOC, and PhenoBench datasets using significantly fewer labeled images than previous methods (as few as 10 annotated images).
The method effectively leverages self-training on unlabeled data to further improve segmentation accuracy.
PASTEL can be used as a plugin to generate pseudo-labels, rendering conventional densely supervised models label-efficient. |
The current method struggles to assign the same instance ID to different parts of the same object when occlusion is present, leading to over-segmentation.
All semantic classes must be present in the few labeled training images, limiting applicability in some scenarios. |
panoptic segmentation, label-efficient learning, foundation models, dinov2, self-training |
2405.18991
Report |
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture |
Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, Jun Huang |
This paper presents EasyAnimate, an advanced method for video generation that
leverages the power of transformer architecture for high-performance outcomes.
We have expanded the DiT framework originally designed for 2D image synthesis
to accommodate the complexities of 3D video generation by incorporating a
motion module block. It is used to capture temporal dynamics, thereby ensuring
the production of consistent frames and seamless motion transitions. The motion
module can be adapted to various DiT baseline methods to generate video with
different styles. It can also generate videos with different frame rates and
resolutions during both training and inference phases, suitable for both images
and videos. Moreover, we introduce slice VAE, a novel approach to condense the
temporal axis, facilitating the generation of long duration videos. Currently,
EasyAnimate exhibits the proficiency to generate videos with 144 frames. We
provide a holistic ecosystem for video production based on DiT, encompassing
aspects such as data pre-processing, VAE training, DiT models training (both
the baseline model and LoRA model), and end-to-end video inference. Code is
available at: https://github.com/aigc-apps/EasyAnimate. We are continuously
working to enhance the performance of our method. |
Introduces EasyAnimate, a high-performance AI video generation pipeline based on transformer architecture, featuring a motion module for smooth transitions and adaptable frame/resolution settings. |
Addresses limitations in existing video generation models like poor quality, limited length, and unnatural movement. |
Expands the DiT framework with a motion module, slice VAE for long video generation, and a three-stage training process using image and video data. |
Achieves high-performance video generation with consistent frames and smooth motion.
Generates videos with different frame rates and resolutions, suitable for both images and videos.
Enables long-duration video generation (up to 144 frames currently). |
Video quality still being improved.
Further exploration on motion module design for enhanced motion generation. |
video generation, transformer, motion module, slice vae, dit |
2405.18937
Report |
Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding |
Junjie Fei, Mahmoud Ahmed, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny |
While 3D MLLMs have achieved significant progress, they are restricted to
object and scene understanding and struggle to understand 3D spatial structures
at the part level. In this paper, we introduce Kestrel, representing a novel
approach that empowers 3D MLLMs with part-aware understanding, enabling better
interpretation and segmentation grounding of 3D objects at the part level.
Despite its significance, the current landscape lacks tasks and datasets that
endow and assess this capability. Therefore, we propose two novel tasks: (1)
Part-Aware Point Grounding, the model is tasked with directly predicting a
part-level segmentation mask based on user instructions, and (2) Part-Aware
Point Grounded Captioning, the model provides a detailed caption that includes
part-level descriptions and their corresponding masks. To support learning and
evaluating for these tasks, we introduce 3DCoMPaT Grounded Instructions Dataset
(3DCoMPaT-GRIN). 3DCoMPaT-GRIN Vanilla, comprising 789k part-aware point
cloud-instruction-segmentation mask triplets, is used to evaluate MLLMs'
ability of part-aware segmentation grounding. 3DCoMPaT-GRIN Grounded Caption,
containing 107k part-aware point cloud-instruction-grounded caption triplets,
assesses both MLLMs' part-aware language comprehension and segmentation
grounding capabilities. Our introduced tasks, dataset, and Kestrel represent a
preliminary effort to bridge the gap between human cognition and 3D MLLMs,
i.e., the ability to perceive and engage with the environment at both global
and part levels. Extensive experiments on the 3DCoMPaT-GRIN show that Kestrel
can generate user-specified segmentation masks, a capability not present in any
existing 3D MLLM. Kestrel thus established a benchmark for evaluating the
part-aware language comprehension and segmentation grounding of 3D objects.
Project page at https://feielysia.github.io/Kestrel.github.io/ |
This paper introduces Kestrel, a 3D Multimodal Large Language Model (MLLM) that understands and grounds objects at the part level. |
Existing 3D MLLMs struggle to understand 3D structures at the part level, limiting their ability to interact with the environment in a nuanced way, like humans. |
The paper introduces two new tasks: (1) part-aware point grounding - predicting part-level segmentation masks based on user instructions, and (2) part-aware point grounded captioning - generating detailed captions with part-level descriptions and corresponding segmentation masks. A new dataset, 3DCoMPaT-GRIN, is created for these tasks. Kestrel incorporates a 3D segmentation grounding module to enable part-level understanding. |
Kestrel significantly outperforms baseline models in part-aware point grounding, demonstrating accurate part and material localization.
Kestrel excels in part-aware point grounded captioning, generating detailed descriptions and accurately grounding mentioned parts.
Ablation studies show the importance of LoRA rank and the choice of projection layer in Kestrel's performance. |
Current annotation in 3DCoMPaT-GRIN is limited to part and material masks and could be extended to include more part-level attributes.
Future work aims to extend the part-aware segmentation grounding capability beyond single objects to enhance interaction with the 3D world. |
3d vision-language models, part-aware understanding, segmentation grounding, 3d point cloud understanding, multimodal learning |
2405.18897
Report |
MLAE: Masked LoRA Experts for Parameter-Efficient Fine-Tuning |
Junjie Wang, Guangjing Yang, Wentao Chen, Huahui Yi, Xiaohu Wu, Qicheng Lao |
In response to the challenges posed by the extensive parameter updates
required for full fine-tuning of large-scale pre-trained models,
parameter-efficient fine-tuning (PEFT) methods, exemplified by Low-Rank
Adaptation (LoRA), have emerged. LoRA simplifies the fine-tuning process but
may still struggle with a certain level of redundancy in low-rank matrices and
limited effectiveness from merely increasing their rank. To address these
issues, a natural idea is to enhance the independence and diversity of the
learning process for the low-rank matrices. Therefore, we propose Masked LoRA
Experts (MLAE), an innovative approach that applies the concept of masking to
PEFT. Our method incorporates a cellular decomposition strategy that transforms
a low-rank matrix into independent rank-1 submatrices, or ``experts'', thus
enhancing independence. Additionally, we introduce a binary mask matrix that
selectively activates these experts during training to promote more diverse and
anisotropic learning, based on expert-level dropout strategies. Our
investigations reveal that this selective activation not only enhances
performance but also fosters a more diverse acquisition of knowledge with a
marked decrease in parameter similarity among MLAE, significantly boosting the
quality of the model while barely increasing the parameter count. Remarkably,
MLAE achieves new SOTA performance with an average accuracy score of 78.8% on
the VTAB-1k benchmark and 90.9% on the FGVC benchmark, demonstrating superior
performance. Our code is available at https://github.com/jie040109/MLAE. |
The paper proposes Masked LoRA Experts (MLAE), a novel parameter-efficient fine-tuning method that applies masking to enhance the independence and diversity of learning in low-rank matrices. |
Existing parameter-efficient fine-tuning methods, particularly those based on Low-Rank Adaptation (LoRA), struggle with redundancy and limited effectiveness in improving model quality. MLAE addresses these limitations by promoting diverse and independent learning in low-rank matrices. |
MLAE decomposes low-rank matrices into rank-1 submatrices, treating them as independent experts. It then introduces a mask matrix with adaptive coefficients, applying it to the decomposed matrix to selectively activate experts during training. This selective activation, implemented through expert-level dropout, enhances diversity and reduces redundancy. |
MLAE achieves state-of-the-art performance on the VTAB-1k benchmark with an average accuracy of 78.8% and on the FGVC benchmark with 90.9% accuracy.
The method demonstrates significantly reduced parameter similarity compared to vanilla LoRA, indicating enhanced independence among learned experts.
Feature attention map visualizations reveal that different MLAE experts focus on distinct feature areas within the same block, highlighting the diversity and complementarity of their representations. |
The optimal probability of stochastic masking varies across datasets, necessitating dataset-specific tuning.
Future work could explore metrics to determine optimal masking probabilities based on dataset characteristics or training performance, and investigate layer-wise optimal probabilities. |
parameter-efficient fine-tuning, low-rank adaptation (lora), masking strategies, vision transformers (vit), transfer learning |
2405.18852
Report |
LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping |
Nikhil Gosala, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Paulo Drews-Jr, Wolfram Burgard, Abhinav Valada |
Semantic Bird's Eye View (BEV) maps offer a rich representation with strong
occlusion reasoning for various decision making tasks in autonomous driving.
However, most BEV mapping approaches employ a fully supervised learning
paradigm that relies on large amounts of human-annotated BEV ground truth data.
In this work, we address this limitation by proposing the first unsupervised
representation learning approach to generate semantic BEV maps from a monocular
frontal view (FV) image in a label-efficient manner. Our approach pretrains the
network to independently reason about scene geometry and scene semantics using
two disjoint neural pathways in an unsupervised manner and then finetunes it
for the task of semantic BEV mapping using only a small fraction of labels in
the BEV. We achieve label-free pretraining by exploiting spatial and temporal
consistency of FV images to learn scene geometry while relying on a novel
temporal masked autoencoder formulation to encode the scene representation.
Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that
our approach performs on par with the existing state-of-the-art approaches
while using only 1% of BEV labels and no additional labeled data. |
\net~is the first unsupervised representation learning framework to predict semantic Bird's Eye View (BEV) maps from monocular front-view images in a label-efficient manner. |
Semantic BEV maps are essential for autonomous driving but most current approaches rely heavily on large, annotated datasets which are difficult and time-consuming to create. |
The framework uses two disentangled neural pathways: one for scene geometry modeling using implicit fields and another for scene representation learning using a novel temporal masked autoencoder. These pathways are pretrained in an unsupervised manner and then fine-tuned for semantic BEV mapping using a small fraction of labeled data. |
\net~outperforms most existing fully-supervised and self-supervised methods on KITTI-360 using only 1% of BEV labels.
On the nuScenes dataset, \net~achieves comparable performance to most fully-supervised baselines despite the challenge of dynamic scenes.
Ablation studies demonstrate the contributions of individual components, including the importance of pretraining and the effectiveness of the temporal masked autoencoder. |
The implicit field formulation assumes a static scene, limiting performance in highly dynamic environments.
The reliance on photometric loss for supervision makes the model sensitive to varying lighting and occlusions. |
unsupervised representation learning, semantic bev mapping, scene understanding, autonomous driving, label-efficient learning |
2405.18842
Report |
Descriptive Image Quality Assessment in the Wild |
Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Tianfan Xue, Chao Dong |
With the rapid advancement of Vision Language Models (VLMs), VLM-based Image
Quality Assessment (IQA) seeks to describe image quality linguistically to
align with human expression and capture the multifaceted nature of IQA tasks.
However, current methods are still far from practical usage. First, prior works
focus narrowly on specific sub-tasks or settings, which do not align with
diverse real-world applications. Second, their performance is sub-optimal due
to limitations in dataset coverage, scale, and quality. To overcome these
challenges, we introduce Depicted image Quality Assessment in the Wild
(DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that
encompasses both assessment and comparison tasks, brief and detailed responses,
full-reference and non-reference scenarios. We introduce a
ground-truth-informed dataset construction approach to enhance data quality,
and scale up the dataset to 495K under the brief-detail joint framework.
Consequently, we construct a comprehensive, large-scale, and high-quality
dataset, named DQ-495K. We also retain image resolution during training to
better handle resolution-related quality issues, and estimate a confidence
score that is helpful to filter out low-quality responses. Experimental results
demonstrate that DepictQA-Wild significantly outperforms traditional
score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in
distortion identification, instant rating, and reasoning tasks. Our advantages
are further confirmed by real-world applications including assessing the
web-downloaded images and ranking model-processed images. Datasets and codes
will be released in https://depictqa.github.io/depictqa-wild/. |
This paper introduces DepictQA-Wild, a multi-functional VLM-based Image Quality Assessment (IQA) model that handles a wide range of IQA tasks and overcomes limitations of previous models in functionality and performance. |
Existing VLM-based IQA models are limited to specific sub-tasks and exhibit sub-optimal performance due to limitations in dataset coverage, scale, and quality. DepictQA-Wild addresses these limitations to provide a more practical and versatile IQA solution. |
The authors define a multi-functional IQA task paradigm encompassing assessment and comparison, brief and detailed responses, and full-reference and non-reference scenarios. They construct a large-scale, high-quality dataset, DQ-495K, using a ground-truth-informed generation approach. The model is trained while retaining image resolution and incorporates confidence estimation. |
DepictQA-Wild significantly outperforms traditional score-based IQA methods, prior VLM-based IQA models, and GPT-4V in various tasks, including distortion identification, instant rating, and reasoning.
The model shows strong generalization ability, achieving high accuracy even in out-of-distribution settings.
DepictQA-Wild demonstrates its practicality in real-world applications, such as assessing web-downloaded images and ranking model-processed images. |
The model's fine-grained abilities requiring high-level perception skills need further improvement.
The task paradigm can be extended to include comparisons among images with different contents and incorporate image aesthetics. |
image quality assessment, vision language models, multi-functional iqa, large-scale dataset, deep learning |
2405.18840
Report |
Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation |
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yaoming Wang, Lingxi Xie, Qi Tian, Wei Shen |
Open-vocabulary semantic segmentation seeks to label each pixel in an image
with arbitrary text descriptions. Vision-language foundation models, especially
CLIP, have recently emerged as powerful tools for acquiring open-vocabulary
capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction
ability often suffers three issues: 1) high computational cost, 2) misalignment
between the two inherent modalities of CLIP, and 3) degraded generalization
ability on unseen categories. To address these issues, we propose H-CLIP a
symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in
hyperspherical space for both of the two CLIP modalities. Specifically, the
PEFT strategy is achieved by a series of efficient block-diagonal learnable
transformation matrices and a dual cross-relation communication module among
all learnable matrices. Since the PEFT strategy is conducted symmetrically to
the two CLIP modalities, the misalignment between them is mitigated.
Furthermore, we apply an additional constraint to PEFT on the CLIP text encoder
according to the hyperspherical energy principle, i.e., minimizing
hyperspherical energy during fine-tuning preserves the intrinsic structure of
the original parameter space, to prevent the destruction of the generalization
ability offered by the CLIP text encoder. Extensive evaluations across various
benchmarks show that H-CLIP achieves new SOTA open-vocabulary semantic
segmentation results while only requiring updating approximately 4% of the
total parameters of CLIP. |
This paper introduces H-CLIP, a symmetric parameter-efficient fine-tuning (PEFT) strategy for CLIP, enhancing open-vocabulary semantic segmentation by addressing limitations of existing methods. |
Fine-tuning CLIP for pixel-level prediction often leads to high computational costs, misalignment between CLIP's modalities, and reduced generalization ability on unseen categories. H-CLIP aims to tackle these challenges. |
H-CLIP utilizes a partial orthogonal fine-tuning strategy in hyperspherical space, employing block-diagonal learnable transformation matrices. Orthogonal constraints are applied to CLIP's text encoder to preserve generalization. A dual cross-relation communication module facilitates alignment between modalities and layers. |
H-CLIP achieves state-of-the-art open-vocabulary semantic segmentation results on multiple benchmarks.
It achieves this while only updating approximately 4% of CLIP's total parameters, demonstrating its efficiency.
Ablation studies confirm the individual contributions of partial orthogonal fine-tuning and dual cross-relation communication. |
The performance of H-CLIP is still dependent on the design of the block dimension.
Further exploration of more effective communication mechanisms within H-CLIP is a potential avenue for improvement. |
open-vocabulary semantic segmentation, clip, parameter-efficient fine-tuning, hyperspherical energy, dual cross-relation communication |
2405.18831
Report |
Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks |
Simranjit Singh, Georgios Pavlakos, Dimitrios Stamoulis |
As interest in "reformulating" the 3D Visual Question Answering (VQA) problem
in the context of foundation models grows, it is imperative to assess how these
new paradigms influence existing closed-vocabulary datasets. In this case
study, we evaluate the zero-shot performance of foundational models (GPT-4
Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and
ScanQA. We provide an investigation to contextualize the performance of
GPT-based agents relative to traditional modeling approaches. We find that
GPT-based agents without any fine-tuning perform on par with the closed
vocabulary approaches. Our findings corroborate recent results that "blind"
models establish a surprisingly strong baseline in closed-vocabulary settings.
We demonstrate that agents benefit significantly from scene-specific vocabulary
via in-context textual grounding. By presenting a preliminary comparison with
previous baselines, we hope to inform the community's ongoing efforts to refine
multi-modal 3D benchmarks. |
This paper presents a case study evaluating the zero-shot performance of GPT-4 Vision and GPT-4 on established 3D VQA benchmarks (3D-VQA and ScanQA) to understand how these foundational models impact existing closed-vocabulary datasets. |
With growing interest in adapting 3D VQA for foundation models, it's crucial to understand how these models perform on existing benchmarks and how they compare to traditional approaches. |
The study uses GPT-4V for captioning scene meshes, GPT-4 Turbo to answer questions based on these captions, and compares their performance to existing baselines on ScanQA and 3D-VQA. They investigate different captioning schemes (open-vocabulary and vocabulary-grounded) and analyze the impact of different parameters like frame sample rate and batch size. |
Finetuning-free GPT agents perform surprisingly well, achieving scores within 10% of meticulously crafted DNN-based baselines on ScanQA.
Blind GPT agents (without visual input) demonstrate surprisingly robust performance, highlighting the power of language priors and 'common sense'.
GPT-V benefits significantly from scene-specific vocabulary during captioning, indicating the importance of grounded language descriptions. |
The study primarily focuses on zero-shot performance and doesn't explore finetuning GPT models on these specific datasets.
While the study analyzes the impact of several parameters, more comprehensive exploration of prompt engineering and visual grounding techniques could further improve results. |
3d visual question answering, gpt-4, gpt-4 vision, foundation models, zero-shot learning |
2405.18801
Report |
SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation |
Zhenbei Wu, Qiang Wang, Jie Yang |
The scarcity of free-hand sketch presents a challenging problem. Despite the
emergence of some large-scale sketch datasets, these datasets primarily consist
of sketches at the single-object level. There continues to be a lack of
large-scale paired datasets for scene sketches. In this paper, we propose a
self-supervised method for scene sketch generation that does not rely on any
existing scene sketch, enabling the transformation of single-object sketches
into scene sketches. To accomplish this, we introduce a method for vector
sketch captioning and sketch semantic expansion. Additionally, we design a
sketch generation network that incorporates a fusion of multi-modal perceptual
constraints, suitable for application in zero-shot image-to-sketch downstream
task, demonstrating state-of-the-art performance through experimental
validation. Finally, leveraging our proposed sketch-to-sketch generation
method, we contribute a large-scale dataset centered around scene sketches,
comprising highly semantically consistent "text-sketch-image" triplets. Our
research confirms that this dataset can significantly enhance the capabilities
of existing models in sketch-based image retrieval and sketch-controlled image
synthesis tasks. We will make our dataset and code publicly available. |
This paper proposes a self-supervised method for generating scene sketches from single-object sketches, without relying on existing scene sketch datasets. |
Scene sketches are crucial for understanding human visual comprehension and fine-grained design, but current datasets are limited. This method addresses the scarcity of scene sketch data. |
The method uses vector sketch captioning to extract semantic information from single-object sketches, expands it using a large image description dataset, and then generates scene sketches using a multi-modal fusion approach with text, image, and sketch constraints. |
The method successfully generates scene sketches from single-object sketches, outperforming existing methods in zero-shot image-to-sketch generation.
A large-scale dataset "SketchTriplet" is created, containing 1,000,000 "text-sketch-image" triplets with high semantic consistency.
Retraining existing models with SketchTriplet significantly improves performance in sketch-based image retrieval and sketch-controlled image synthesis tasks. |
The current method doesn't offer control over transparency in the generated sketches.
The generated sketches are limited to a single style. |
scene sketch generation, sketch-to-sketch, self-supervised learning, multi-modal fusion, dataset creation |
2405.18784
Report |
LP-3DGS: Learning to Prune 3D Gaussian Splatting |
Zhaoliang Zhang, Tianchen Song, Yongjae Lee, Li Yang, Cheng Peng, Rama Chellappa, Deliang Fan |
Recently, 3D Gaussian Splatting (3DGS) has become one of the mainstream
methodologies for novel view synthesis (NVS) due to its high quality and fast
rendering speed. However, as a point-based scene representation, 3DGS
potentially generates a large number of Gaussians to fit the scene, leading to
high memory usage. Improvements that have been proposed require either an
empirical and preset pruning ratio or importance score threshold to prune the
point cloud. Such hyperparamter requires multiple rounds of training to
optimize and achieve the maximum pruning ratio, while maintaining the rendering
quality for each scene. In this work, we propose learning-to-prune 3DGS
(LP-3DGS), where a trainable binary mask is applied to the importance score
that can find optimal pruning ratio automatically. Instead of using the
traditional straight-through estimator (STE) method to approximate the binary
mask gradient, we redesign the masking function to leverage the Gumbel-Sigmoid
method, making it differentiable and compatible with the existing training
process of 3DGS. Extensive experiments have shown that LP-3DGS consistently
produces a good balance that is both efficient and high quality. |
This paper proposes LP-3DGS, a method for learning to prune Gaussian points in 3D Gaussian Splatting (3DGS) for novel view synthesis. |
Existing 3DGS pruning techniques require manual tuning of the pruning ratio, which is time-consuming and potentially suboptimal. LP-3DGS aims to automate this process and find the optimal pruning ratio for each scene. |
LP-3DGS utilizes a trainable binary mask, activated by the Gumbel-Sigmoid function, to determine which Gaussians to prune. This mask is applied to existing importance scores or directly to Gaussian parameters. The method integrates this mask learning into the 3DGS training process. |
LP-3DGS automatically finds optimal pruning ratios for various scenes, eliminating the need for manual parameter sweeping.
The method achieves comparable or better rendering quality with significantly smaller model sizes compared to baselines.
LP-3DGS, using a Gumbel-Sigmoid activated mask, outperforms STE-based mask techniques in terms of pruning ratio and rendering quality. |
The final rendering quality depends on the effectiveness of the chosen importance score.
Future work could explore alternative importance metrics or combine multiple metrics for better pruning. |
novel view synthesis, 3d gaussian splatting, model compression, pruning, gumbel-sigmoid |
2405.18762
Report |
Inpaint Biases: A Pathway to Accurate and Unbiased Image Generation |
Jiyoon Myung, Jihyeon Park |
This paper examines the limitations of advanced text-to-image models in
accurately rendering unconventional concepts which are scarcely represented or
absent in their training datasets. We identify how these limitations not only
confine the creative potential of these models but also pose risks of
reinforcing stereotypes. To address these challenges, we introduce the Inpaint
Biases framework, which employs user-defined masks and inpainting techniques to
enhance the accuracy of image generation, particularly for novel or
inaccurately rendered objects. Through experimental validation, we demonstrate
how this framework significantly improves the fidelity of generated images to
the user's intent, thereby expanding the models' creative capabilities and
mitigating the risk of perpetuating biases. Our study contributes to the
advancement of text-to-image models as unbiased, versatile tools for creative
expression. |
This paper introduces the Inpaint Biases framework to improve the accuracy of text-to-image models in rendering unconventional concepts. |
Current text-to-image models struggle to depict concepts not well-represented in their training data, limiting creativity and potentially reinforcing stereotypes. |
The framework utilizes user-defined masks, the Segment Anything Model (SAM) for segmentation, Large Language Models (LLMs) for prompt refinement, and inpainting techniques to correct specific areas of generated images. |
The framework successfully rendered unconventional concepts like a chocolate river and a polka-dotted cat.
Quantitative analysis using CLIP scores confirmed improved alignment between inpainted images and the desired prompts.
The framework demonstrates potential in mitigating bias and enhancing the creative capacity of text-to-image models. |
The framework currently requires user intervention for mask generation, limiting its autonomy.
Future research could explore automated bias detection and correction by the model itself. |
text-to-image synthesis, bias mitigation, inpainting, generative ai, segment anything model (sam) |
2405.18750
Report |
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback |
Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang |
Diffusion-based text-to-video (T2V) models have achieved significant success
but continue to be hampered by the slow sampling speed of their iterative
sampling processes. To address the challenge, consistency models have been
proposed to facilitate fast inference, albeit at the cost of sample quality. In
this work, we aim to break the quality bottleneck of a video consistency model
(VCM) to achieve $\textbf{both fast and high-quality video generation}$. We
introduce T2V-Turbo, which integrates feedback from a mixture of differentiable
reward models into the consistency distillation (CD) process of a pre-trained
T2V model. Notably, we directly optimize rewards associated with single-step
generations that arise naturally from computing the CD loss, effectively
bypassing the memory constraints imposed by backpropagating gradients through
an iterative sampling process. Remarkably, the 4-step generations from our
T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and
Pika. We further conduct human evaluations to corroborate the results,
validating that the 4-step generations from our T2V-Turbo are preferred over
the 50-step DDIM samples from their teacher models, representing more than a
tenfold acceleration while improving video generation quality. |
The paper introduces T2V-Turbo, a text-to-video model that integrates reward feedback from a mixture of differentiable reward models, including a video-text model, during consistency distillation to achieve both fast and high-quality video generation. |
This work addresses the limitations of existing diffusion-based text-to-video models, which are often slow and struggle to align with human preferences. |
T2V-Turbo leverages reward feedback from an image-text reward model and a video-text reward model during the consistency distillation process, optimizing single-step generations to improve visual quality and text-video alignment. |
Achieves state-of-the-art results on the VBench benchmark with only 4 inference steps, surpassing even proprietary models like Gen-2 and Pika.
Human evaluations show a preference for 4-step T2V-Turbo generations over 50-step samples from teacher models, indicating significant acceleration and quality improvement.
Ablation studies demonstrate the importance of both image-text and video-text reward models in enhancing video generation. |
Limited availability of open-sourced video-text reward models specifically trained to reflect human preferences.
Potential for misuse of realistic synthetic videos, requiring safeguards and ethical guidelines for responsible development and deployment. |
text-to-video generation, diffusion models, consistency distillation, reward models, human evaluation |
2405.18715
Report |
NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild |
Weining Ren, Zihan Zhu, Boyang Sun, Jiaqi Chen, Marc Pollefeys, Songyou Peng |
Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing
photorealistic views from multi-view images of static scenes, but face
challenges in dynamic, real-world environments with distractors like moving
objects, shadows, and lighting changes. Existing methods manage controlled
environments and low occlusion ratios but fall short in render quality,
especially under high occlusion scenarios. In this paper, we introduce NeRF
On-the-go, a simple yet effective approach that enables the robust synthesis of
novel views in complex, in-the-wild scenes from only casually captured image
sequences. Delving into uncertainty, our method not only efficiently eliminates
distractors, even when they are predominant in captures, but also achieves a
notably faster convergence speed. Through comprehensive experiments on various
scenes, our method demonstrates a significant improvement over state-of-the-art
techniques. This advancement opens new avenues for NeRF in diverse and dynamic
real-world applications. |
This paper introduces NeRF On-the-go, a method for robustly synthesizing novel views from casually captured images in dynamic scenes by effectively removing distractors. |
Existing NeRF methods struggle with dynamic, real-world environments containing distractors (moving objects, changing lighting, etc.), limiting their practical applications. |
The method leverages pre-trained DINOv2 features for uncertainty prediction, utilizes a structural similarity loss to enhance uncertainty optimization, and incorporates the predicted uncertainty into a decoupled NeRF training strategy. |
NeRF On-the-go achieves high-fidelity novel view synthesis even in complex, in-the-wild scenes with varying distractor ratios.
The method significantly outperforms state-of-the-art techniques on both synthetic and real-world datasets.
NeRF On-the-go demonstrates significantly faster convergence speed compared to prior art. |
The method faces challenges in predicting accurate uncertainty for regions with strong view-dependent effects.
The performance degrades with sparse training views. |
neural radiance fields, novel view synthesis, distractor removal, uncertainty estimation, dinov2 features |
2405.18679
Report |
Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain |
Juntao Zhang, Kun Bian, Peng Cheng, Wenbo An, Jianning Liu, Jun Zhou |
In recent years, State Space Models (SSMs) with efficient hardware-aware
designs, known as the Mamba deep learning models, have made significant
progress in modeling long sequences such as language understanding. Therefore,
building efficient and general-purpose visual backbones based on SSMs is a
promising direction. Compared to traditional convolutional neural networks
(CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM)
methods is not yet fully competitive. To enable SSMs to process image data,
ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D
local dependencies, thereby weakening the model's ability to interpret spatial
relationships from a global perspective. We use Fast Fourier Transform (FFT) to
obtain the spectrum of the feature map and add it to the original feature map,
enabling ViM to model a unified visual representation in both frequency and
spatial domains. The introduction of frequency domain information enables ViM
to have a global receptive field during scanning. We propose a novel model
called Vim-F, which employs pure Mamba encoders and scans in both the frequency
and spatial domains. Moreover, we question the necessity of position embedding
in ViM and remove it accordingly in Vim-F, which helps to fully utilize the
efficient long-sequence modeling capability of ViM. Finally, we redesign a
patch embedding for Vim-F, leveraging a convolutional stem to capture more
local correlations, further improving the performance of Vim-F. Code is
available at: \url{https://github.com/yws-wxs/Vim-F}. |
This paper proposes Vim-F(H), a novel visual backbone based on State Space Models (SSMs) that incorporates frequency domain scanning and a hybrid patch embedding to enhance the model's ability to capture global spatial relationships and local dependencies. |
Vision Mamba (ViM) methods, while promising for modeling long sequences, are not yet fully competitive with traditional CNNs and ViTs due to their limitations in processing 2D image data and capturing global spatial relationships. |
The authors introduce frequency domain scanning using Fast Fourier Transform (FFT) to provide a global receptive field during scanning. Additionally, they design a hybrid patch embedding with overlapping and non-overlapping convolutions for better capturing local correlations. These improvements are implemented based on the Vim model, resulting in Vim-F(H). |
Vim-F(H) significantly outperforms the baseline Vim model on ImageNet-1K classification, achieving 1.3% and 0.8% higher accuracy for Vim-Ti-F(H) and Vim-S-F(H) respectively.
The frequency domain scanning effectively reduces the model's reliance on positional embeddings while maintaining a global receptive field.
Vim-F(H) achieves competitive results compared to advanced CNNs, ViTs, and ViMs on object detection and instance segmentation tasks using Mask R-CNN on the COCO dataset. |
The effectiveness of the proposed method for ViMs with hybrid encoders has not been fully studied.
Further investigation is needed to explore more complex spatial relationships in the frequency domain. |
vision mamba, state space models, frequency domain scanning, patch embedding, computer vision |
2405.18677
Report |
Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering |
Ido Sobol, Chenfeng Xu, Or Litany |
Generating realistic images from arbitrary views based on a single source
image remains a significant challenge in computer vision, with broad
applications ranging from e-commerce to immersive virtual experiences. Recent
advancements in diffusion models, particularly the Zero-1-to-3 model, have been
widely adopted for generating plausible views, videos, and 3D models. However,
these models still struggle with inconsistencies and implausibility in new
views generation, especially for challenging changes in viewpoint. In this
work, we propose Zero-to-Hero, a novel test-time approach that enhances view
synthesis by manipulating attention maps during the denoising process of
Zero-1-to-3. By drawing an analogy between the denoising process and stochastic
gradient descent (SGD), we implement a filtering mechanism that aggregates
attention maps, enhancing generation reliability and authenticity. This process
improves geometric consistency without requiring retraining or significant
computational resources. Additionally, we modify the self-attention mechanism
to integrate information from the source view, reducing shape distortions.
These processes are further supported by a specialized sampling schedule.
Experimental results demonstrate substantial improvements in fidelity and
consistency, validated on a diverse set of out-of-distribution objects. |
Zero-to-Hero, a test-time technique to address view synthesis artifacts in Zero-1-to-3 through attention map manipulation, enhancing realism and consistency. |
Generating realistic images from single source images at arbitrary views is challenging, and existing diffusion models like Zero-1-to-3 have limitations in generating plausible and consistent novel views. |
Draws an analogy between denoising and SGD, implementing an attention map filtering mechanism (iterative aggregation and averaging) for robust view generation, enhanced by mutual self-attention for shape guidance and a specialized sampling schedule. |
Substantial improvement in fidelity and consistency of generated novel views.
Significant improvement across appearance and shape evaluation metrics (PSNR, SSIM, LPIPS, IoU).
Robustness to random noise and ability to mitigate artifacts observed in the baseline model. |
Performance limited by the pre-trained model's capabilities.
Attention filtering, while enhancing realism, may limit generation diversity. |
novel view synthesis, diffusion models, attention mechanism, test-time refinement, computer vision |
2405.18654
Report |
Mitigating Object Hallucination via Data Augmented Contrastive Tuning |
Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan Ö. Arık, Tomas Pfister |
Despite their remarkable progress, Multimodal Large Language Models (MLLMs)
tend to hallucinate factually inaccurate information. In this work, we address
object hallucinations in MLLMs, where information is offered about an object
that is not present in the model input. We introduce a contrastive tuning
method that can be applied to a pretrained off-the-shelf MLLM for mitigating
hallucinations while preserving its general vision-language capabilities. For a
given factual token, we create a hallucinated token through generative data
augmentation by selectively altering the ground-truth information. The proposed
contrastive tuning is applied at the token level to improve the relative
likelihood of the factual token compared to the hallucinated one. Our thorough
evaluation confirms the effectiveness of contrastive tuning in mitigating
hallucination. Moreover, the proposed contrastive tuning is simple, fast, and
requires minimal training with no additional overhead at inference. |
Introduces a contrastive tuning method for mitigating object hallucinations in Multimodal Large Language Models (MLLMs) while preserving their general vision-language capabilities. |
Object hallucination, where MLLMs generate descriptions of objects not present in the input, hinders their reliability and widespread use. |
Generative data augmentation is used to create hallucinated responses by altering ground-truth objects. Contrastive tuning is then applied at the token level to improve the likelihood of factual tokens compared to hallucinated ones. A KL-divergence constraint ensures the MLLM retains its original performance in general vision-language tasks. |
HALVA substantially reduces hallucination in image descriptions compared to the base LLaVA model, matching or exceeding the performance of other methods.
HALVA significantly improves performance on discriminative tasks related to object attributes, presence, and relations, surpassing existing methods.
Contrastive tuning retains or improves the performance of the base LLaVA model on standard vision-language benchmarks, unlike other methods that degrade general task ability. |
The current work primarily focuses on mitigating object hallucinations. More research is needed to address other forms of hallucinations in MLLMs.
Future work includes generalizing the proposed generative data augmentation and contrastive tuning to other foundation models with accessible weights. |
multimodal large language models, hallucination mitigation, contrastive tuning, generative data augmentation, vision-language tasks |
2405.18616
Report |
Wavelet-Based Image Tokenizer for Vision Transformers |
Zhenhai Zhu, Radu Soricut |
Non-overlapping patch-wise convolution is the default image tokenizer for all
state-of-the-art vision Transformer (ViT) models. Even though many ViT variants
have been proposed to improve its efficiency and accuracy, little research on
improving the image tokenizer itself has been reported in the literature. In
this paper, we propose a new image tokenizer based on wavelet transformation.
We show that ViT models with the new tokenizer achieve both higher training
throughput and better top-1 precision for the ImageNet validation set. We
present a theoretical analysis on why the proposed tokenizer improves the
training throughput without any change to ViT model architecture. Our analysis
suggests that the new tokenizer can effectively handle high-resolution images
and is naturally resistant to adversarial attack. Furthermore, the proposed
image tokenizer offers a fresh perspective on important new research directions
for ViT-based model design, such as image tokens on a non-uniform grid for
image understanding. |
This paper proposes a novel image tokenizer for Vision Transformer (ViT) models based on wavelet transformation, replacing the conventional patch-wise convolution. |
This is crucial as it addresses the limitations of existing patch-convolution tokenizers in handling high-resolution images and their vulnerability to adversarial attacks. The proposed method offers higher efficiency and improved accuracy. |
The method leverages the wavelet transformation's ability to compress redundant image information. It introduces pixel-space token embedding using wavelet coefficients and utilizes block sparse projection to map them to semantically meaningful lower-dimensional embeddings. |
ViT models with the wavelet-based tokenizer achieve higher training throughput due to reduced embedding size and efficient handling of high-resolution images.
The models demonstrate better top-1 precision on the ImageNet validation set compared to those using patch-convolution tokenizers.
The inherent properties of wavelet transformation make the tokenizer naturally resistant to adversarial attacks. |
The paper primarily focuses on image classification, and further investigation is needed to evaluate the tokenizer's performance on other vision tasks like object detection and semantic segmentation.
Future work includes exploring the use of non-uniform image partitioning guided by the sparsity of wavelet coefficients to further enhance the tokenizer's efficiency. |
vision transformer, image tokenizer, wavelet transformation, image compression, adversarial robustness |
2405.18525
Report |
REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment |
Haonan Han, Rui Yang, Huan Liao, Jiankai Xing, Zunnan Xu, Xiaoming Yu, Junwei Zha, Xiu Li, Wanhua Li |
Traditional image-to-3D models often struggle with scenes containing multiple
objects due to biases and occlusion complexities. To address this challenge, we
present REPARO, a novel approach for compositional 3D asset generation from
single images. REPARO employs a two-step process: first, it extracts individual
objects from the scene and reconstructs their 3D meshes using off-the-shelf
image-to-3D models; then, it optimizes the layout of these meshes through
differentiable rendering techniques, ensuring coherent scene composition. By
integrating optimal transport-based long-range appearance loss term and
high-level semantic loss term in the differentiable rendering, REPARO can
effectively recover the layout of 3D assets. The proposed method can
significantly enhance object independence, detail accuracy, and overall scene
coherence. Extensive evaluation of multi-object scenes demonstrates that our
REPARO offers a comprehensive approach to address the complexities of
multi-object 3D scene generation from single images. |
REPARO is a novel two-step approach for generating compositional 3D assets from single images by first reconstructing individual objects and then refining their layout through differentiable rendering with a long-range appearance loss and a high-level semantic loss. |
Existing image-to-3D models struggle to accurately reconstruct multi-object scenes due to center bias and occlusion complexities, making it challenging to generate interactive and realistic multi-object environments. |
REPARO first extracts and reconstructs 3D meshes for individual objects from a single image. Then, it uses differentiable rendering with optimal transport-based long-range appearance loss and high-level semantic loss to optimize the layout of these meshes, ensuring a coherent scene composition. |
REPARO significantly outperforms existing image-to-3D models in generating compositional 3D scenes, as demonstrated by quantitative metrics (CLIP score, PSNR, SSIM, LPIPS) on the GSO dataset.
The use of long-range appearance loss with optimal transport enables REPARO to effectively align the layout of reconstructed objects with the reference image.
A user study confirms that REPARO generates more realistic multi-object 3D assets compared to baseline models, as evidenced by its higher preference score. |
The evaluation of multi-object 3D assets reveals a discrepancy between quantitative and qualitative results, suggesting the need for improved evaluation methods in future research.
The current implementation of REPARO relies on pre-trained 2D foundation models for segmentation, inpainting, and depth estimation, which could potentially limit its generalization ability. |
3d scene generation, compositional 3d assets, differentiable rendering, layout alignment, optimal transport |
2405.18524
Report |
Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures |
Hongjun Wu, Li Xiao, Xingkuo Zhang, Yining Miao |
Knowledge distillation is commonly employed to compress neural networks,
reducing the inference costs and memory footprint. In the scenario of
homogenous architecture, feature-based methods have been widely validated for
their effectiveness. However, in scenarios where the teacher and student models
are of heterogeneous architectures, the inherent differences in feature
representation significantly degrade the performance of these methods. Recent
studies have highlighted that low-frequency components constitute the majority
of image features. Motivated by this, we propose a Low-Frequency
Components-based Contrastive Knowledge Distillation (LFCC) framework that
significantly enhances the performance of feature-based distillation between
heterogeneous architectures. Specifically, we designe a set of multi-scale
low-pass filters to extract the low-frequency components of intermediate
features from both the teacher and student models, aligning them in a compact
space to overcome architectural disparities. Moreover, leveraging the intrinsic
pairing characteristic of the teacher-student framework, we design an
innovative sample-level contrastive learning framework that adeptly
restructures the constraints of within-sample feature similarity and
between-sample feature divergence into a contrastive learning task. This
strategy enables the student model to capitalize on intra-sample feature
congruence while simultaneously enhancing the discrimination of features among
disparate samples. Consequently, our LFCC framework accurately captures the
commonalities in feature representation across heterogeneous architectures.
Extensive evaluations and empirical analyses across three architectures (CNNs,
Transformers, and MLPs) demonstrate that LFCC achieves superior performance on
the challenging benchmarks of ImageNet-1K and CIFAR-100. All codes will be
publicly available. |
This paper proposes LFCC, a Low-Frequency Components-based Contrastive Knowledge Distillation framework to improve feature-based knowledge distillation in heterogeneous architectures. |
Feature-based distillation often underperforms in heterogeneous settings due to significant differences in feature representations between architectures. This limits the potential teacher-student pairings and hinders knowledge transfer. |
LFCC uses multi-scale low-pass filters to extract and align low-frequency components of teacher and student features in a compact space. It also employs sample-level contrastive learning to enhance feature discrimination between different samples. |
LFCC outperforms state-of-the-art methods on ImageNet-1K and CIFAR-100 for most teacher-student pairings.
The method effectively identifies commonalities in feature representations across diverse architectures.
Ablation studies confirm the contribution of each component in LFCC. |
Logit-based methods still outperform feature-based methods on small datasets like CIFAR-100, especially with Transformer or MLP students.
Future work could explore alternative low-pass filter designs and contrastive learning strategies. |
knowledge distillation, heterogeneous architectures, low-frequency components, contrastive learning, feature alignment |
2405.18515
Report |
Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication |
Yunuo Chen, Tianyi Xie, Zeshun Zong, Xuan Li, Feng Gao, Yin Yang, Ying Nian Wu, Chenfanfu Jiang |
Existing diffusion-based text-to-3D generation methods primarily focus on
producing visually realistic shapes and appearances, often neglecting the
physical constraints necessary for downstream tasks. Generated models
frequently fail to maintain balance when placed in physics-based simulations or
3D printed. This balance is crucial for satisfying user design intentions in
interactive gaming, embodied AI, and robotics, where stable models are needed
for reliable interaction. Additionally, stable models ensure that 3D-printed
objects, such as figurines for home decoration, can stand on their own without
requiring additional supports. To fill this gap, we introduce Atlas3D, an
automatic and easy-to-implement method that enhances existing Score
Distillation Sampling (SDS)-based text-to-3D tools. Atlas3D ensures the
generation of self-supporting 3D models that adhere to physical laws of
stability under gravity, contact, and friction. Our approach combines a novel
differentiable simulation-based loss function with physically inspired
regularization, serving as either a refinement or a post-processing module for
existing frameworks. We verify Atlas3D's efficacy through extensive generation
tasks and validate the resulting 3D models in both simulated and real-world
environments. |
Atlas3D is a novel method that integrates physics-based constraints into existing text-to-3D generation models, enabling the generation of self-supporting 3D models suitable for simulation and 3D printing. |
Existing text-to-3D methods often neglect physical plausibility, resulting in models that lack standability. This is crucial for applications like gaming, robotics, and 3D printing where stability is essential. |
Atlas3D incorporates differentiable physics simulations and physically-inspired regularizations into the generation process. It leverages standability and stable equilibrium loss functions during training to ensure generated models are self-supporting. This method can be integrated into existing text-to-3D frameworks as a refinement or post-processing step. |
Atlas3D generates self-supporting 3D models that remain stable in physics simulations, outperforming baseline models in stability tests.
The generated models exhibit robustness to perturbations, successfully standing even with small initial rotations.
The method's effectiveness is validated through real-world 3D printing, with printed models demonstrating superior standability compared to those generated without physics constraints. |
The optimization process currently allows for a large degree of freedom in mesh vertex adjustments, potentially leading to undesirable distortions.
The current framework focuses on SDS-based methods. Future work could explore generalizing to non-SDS or non-diffusion based methods. |
text-to-3d generation, physics-based simulation, 3d printing, stable equilibrium, differentiable rendering |
2405.18428
Report |
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention |
Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, Xinggang Wang |
Diffusion models with large-scale pre-training have achieved significant
success in the field of visual content generation, particularly exemplified by
Diffusion Transformers (DiT). However, DiT models have faced challenges with
scalability and quadratic complexity efficiency. In this paper, we aim to
leverage the long sequence modeling capability of Gated Linear Attention (GLA)
Transformers, expanding its applicability to diffusion models. We introduce
Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable
solution with minimal parameter overhead, following the DiT design, but
offering superior efficiency and effectiveness. In addition to better
performance than DiT, DiG-S/2 exhibits $2.5\times$ higher training speed than
DiT-S/2 and saves $75.7\%$ GPU memory at a resolution of $1792 \times 1792$.
Moreover, we analyze the scalability of DiG across a variety of computational
complexity. DiG models, with increased depth/width or augmentation of input
tokens, consistently exhibit decreasing FID. We further compare DiG with other
subquadratic-time diffusion models. With the same model size, DiG-XL/2 is
$4.2\times$ faster than the recent Mamba-based diffusion model at a $1024$
resolution, and is $1.8\times$ faster than DiT with CUDA-optimized
FlashAttention-2 under the $2048$ resolution. All these results demonstrate its
superior efficiency among the latest diffusion models. Code is released at
https://github.com/hustvl/DiG. |
Introduces Diffusion Gated Linear Attention Transformers (DiG), a more efficient and effective alternative to Diffusion Transformers (DiT) for visual content generation. |
Addresses the scalability and quadratic complexity limitations of existing Diffusion Transformer (DiT) models in image generation. |
Leverages the long sequence modeling capability of Gated Linear Attention (GLA) Transformers within a diffusion model framework, closely following the design of DiT. |
Achieves better performance than DiT with significantly faster training (2.5x) and reduced memory footprint (75.7% reduction).
Demonstrates strong scalability with consistent FID improvement as model depth/width or input tokens increase.
Outperforms other subquadratic-time diffusion models in terms of speed, being 4.2x faster than Mamba-based models and 1.8x faster than DiT with FlashAttention-2. |
Exploration of alternative linear attention mechanisms for further efficiency gains.
Investigating the application of DiG to other generative modeling tasks beyond image generation. |
diffusion models, image generation, gated linear attention, transformers, scalability |
2405.18425
Report |
ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention |
Bencheng Liao, Xinggang Wang, Lianghui Zhu, Qian Zhang, Chang Huang |
Recently, linear complexity sequence modeling networks have achieved modeling
capabilities similar to Vision Transformers on a variety of computer vision
tasks, while using fewer FLOPs and less memory. However, their advantage in
terms of actual runtime speed is not significant. To address this issue, we
introduce Gated Linear Attention (GLA) for vision, leveraging its superior
hardware-awareness and efficiency. We propose direction-wise gating to capture
1D global context through bidirectional modeling and a 2D gating locality
injection to adaptively inject 2D local details into 1D global context. Our
hardware-aware implementation further merges forward and backward scanning into
a single kernel, enhancing parallelism and reducing memory cost and latency.
The proposed model, ViG, offers a favorable trade-off in accuracy, parameters,
and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer
and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only
27% of the parameters and 20% of the FLOPs, running 2$\times$ faster on
$224\times224$ images. At $1024\times1024$ resolution, ViG-T uses 5.2$\times$
fewer FLOPs, saves 90% GPU memory, runs 4.8$\times$ faster, and achieves 20.7%
higher top-1 accuracy than DeiT-T. These results position ViG as an efficient
and scalable solution for visual representation learning. Code is available at
\url{https://github.com/hustvl/ViG}. |
This paper introduces ViG, a novel vision backbone network leveraging Gated Linear Attention (GLA) for efficient and accurate visual representation learning. |
Existing methods like Vision Transformers, while effective, suffer from quadratic complexity, hindering their application to high-resolution images. Linear complexity alternatives often lack global context or face practical efficiency challenges. ViG addresses these limitations by combining the efficiency of linear complexity with global receptive field capture. |
ViG introduces three key innovations: a) a Bidirectional Gated Linear Attention (BiGLA) layer for capturing global 1D context, b) a direction-wise gating mechanism within BiGLA to select context from different directions, and c) a 2D gating locality injection mechanism to integrate 2D local information. A hardware-aware implementation further boosts efficiency. |
ViG achieves superior accuracy and parameter efficiency compared to non-hierarchical and hierarchical models on ImageNet.
In downstream tasks like object detection and semantic segmentation, ViG consistently outperforms ViT and VRWKV with lower computational cost.
ViG exhibits superior resolution extrapolation capability, outperforming ViT, Vim, VRWKV, and ResNet50 in accuracy as image resolution increases. |
While ViG with hardware-aware implementation demonstrates efficiency improvements, it remains marginally slower than DeiT for small 224x224 images, requiring further optimization.
Future work will explore adapting and extending ViG for other vision tasks beyond classification, detection, and segmentation. |
vision transformer, gated linear attention, linear complexity, global receptive field, visual representation learning |
2405.18424
Report |
3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting |
Qihang Zhang, Yinghao Xu, Chaoyang Wang, Hsin-Ying Lee, Gordon Wetzstein, Bolei Zhou, Ceyuan Yang |
Scene image editing is crucial for entertainment, photography, and
advertising design. Existing methods solely focus on either 2D individual
object or 3D global scene editing. This results in a lack of a unified approach
to effectively control and manipulate scenes at the 3D level with different
levels of granularity. In this work, we propose 3DitScene, a novel and unified
scene editing framework leveraging language-guided disentangled Gaussian
Splatting that enables seamless editing from 2D to 3D, allowing precise control
over scene composition and individual objects. We first incorporate 3D
Gaussians that are refined through generative priors and optimization
techniques. Language features from CLIP then introduce semantics into 3D
geometry for object disentanglement. With the disentangled Gaussians, 3DitScene
allows for manipulation at both the global and individual levels,
revolutionizing creative expression and empowering control over scenes and
objects. Experimental results demonstrate the effectiveness and versatility of
3DitScene in scene image editing. Code and online demo can be found at our
project homepage: https://zqh0253.github.io/3DitScene/. |
\method is a novel scene editing framework leveraging language-guided disentangled Gaussian Splatting, enabling seamless 2D-to-3D editing and granular control over scene composition and objects. |
Existing scene editing methods are limited to either 2D object or 3D global scene manipulation, lacking a unified approach for precise control at different levels. |
\method refines 3D Gaussians projected from a single image using generative priors and optimization, then distills CLIP language features for object disentanglement. |
Enables simultaneous 2D and 3D editing, including object manipulation and novel view synthesis.
Outperforms baselines in user studies and qualitative comparisons regarding editing flexibility and consistency.
Disentangled scene representation improves optimization by allowing object-level layout augmentation. |
Object manipulation evaluation is challenging due to varying coordinate systems across methods.
Further exploration of user interaction methods for intuitive scene manipulation. |
image editing, 3d scene generation, gaussian splatting, language guidance, scene disentanglement |
2405.18416
Report |
3D StreetUnveiler with Semantic-Aware 2DGS |
Jingwei Xu, Yikai Wang, Yiqun Zhao, Yanwei Fu, Shenghua Gao |
Unveiling an empty street from crowded observations captured by in-car
cameras is crucial for autonomous driving. However, removing all temporary
static objects, such as stopped vehicles and standing pedestrians, presents a
significant challenge. Unlike object-centric 3D inpainting, which relies on
thorough observation in a small scene, street scenes involve long trajectories
that differ from previous 3D inpainting tasks. The camera-centric moving
environment of captured videos further complicates the task due to the limited
degree and time duration of object observation. To address these obstacles, we
introduce StreetUnveiler to reconstruct an empty street. StreetUnveiler learns
a 3D representation of the empty street from crowded observations. Our
representation is based on the hard-label semantic 2D Gaussian Splatting (2DGS)
for its scalability and ability to identify Gaussians to be removed. We inpaint
rendered image after removing unwanted Gaussians to provide pseudo-labels and
subsequently re-optimize the 2DGS. Given its temporal continuous movement, we
divide the empty street scene into observed, partial-observed, and unobserved
regions, which we propose to locate through a rendered alpha map. This
decomposition helps us to minimize the regions that need to be inpainted. To
enhance the temporal consistency of the inpainting, we introduce a novel
time-reversal framework to inpaint frames in reverse order and use later frames
as references for earlier frames to fully utilize the long-trajectory
observations. Our experiments conducted on the street scene dataset
successfully reconstructed a 3D representation of the empty street. The mesh
representation of the empty street can be extracted for further applications.
Project page and more visualizations can be found at:
https://streetunveiler.github.io |
StreetUnveiler, a novel method for reconstructing an empty 3D street scene from in-car camera videos by removing temporary static objects like cars and pedestrians. |
Crucial for autonomous driving by providing realistic simulations of empty street environments, which is seldom studied due to the challenges in handling long camera trajectories and the lack of ground-truth data for training. |
Uses hard-label semantic 2D Gaussian Splatting (2DGS) for scene representation and proposes a time-reversal inpainting framework to maintain consistency across different viewpoints in long video sequences. |
Achieves accurate removal of static objects from street scenes, reconstructing empty environments with high fidelity.
Outperforms existing 3D inpainting methods in terms of appearance quality, as measured by LPIPS and FID scores.
Successfully extracts a clean and realistic mesh of the empty street scene using TSDF fusion. |
Relies on the accuracy of the 2D semantic segmentation model for reliable object removal.
Computational cost grows linearly with the number of video frames due to the per-frame inpainting process. |
3d scene reconstruction, street scene understanding, object removal, gaussian splatting, time-reversal inpainting |
2405.18415
Report |
Why are Visually-Grounded Language Models Bad at Image Classification? |
Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy |
Image classification is one of the most fundamental capabilities of machine
vision intelligence. In this work, we revisit the image classification task
using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We
find that existing proprietary and public VLMs, despite often using CLIP as a
vision encoder and having many more parameters, significantly underperform CLIP
on standard image classification benchmarks like ImageNet. To understand the
reason, we explore several hypotheses concerning the inference algorithms,
training objectives, and data processing in VLMs. Our analysis reveals that the
primary cause is data-related: critical information for image classification is
encoded in the VLM's latent space but can only be effectively decoded with
enough training data. Specifically, there is a strong correlation between the
frequency of class exposure during VLM training and instruction-tuning and the
VLM's performance in those classes; when trained with sufficient data, VLMs can
match the accuracy of state-of-the-art classification models. Based on these
findings, we enhance a VLM by integrating classification-focused datasets into
its training, and demonstrate that the enhanced classification performance of
the VLM transfers to its general capabilities, resulting in an improvement of
11.8% on the newly collected ImageWikiQA dataset. |
This paper investigates the use of visually-grounded language models (VLMs) for image classification and finds that they significantly underperform compared to specialized image classifiers like CLIP, despite often using CLIP as a vision encoder. |
Image classification is a fundamental aspect of machine vision, and understanding why VLMs struggle with this task is crucial for improving their overall visual intelligence and enabling them to tackle more complex visual tasks like visual question answering. |
The authors evaluate various VLMs on standard image classification benchmarks and analyze their performance in different settings, exploring hypotheses related to inference algorithms, training objectives, and data used during VLM training. |
VLMs exhibit significantly lower accuracy in image classification compared to CLIP models, even when provided with class names as context.
The primary cause of poor classification performance in VLMs is attributed to the training data, specifically the insufficient exposure to diverse classes and lack of classification-focused data.
Integrating classification-focused datasets into VLM training enhances both their classification accuracy and general capabilities, leading to improved performance on tasks like visual question answering. |
The study is limited by the computational cost of evaluating all possible VLMs and datasets, focusing on two representative VLM architectures and four datasets.
Future work could explore zero-shot methods to decode the classification information encoded in the VLM's latent space without extensive fine-tuning. |
visually-grounded language models, image classification, vlm analysis, data-centric ai, visual question answering |
2405.18407
Report |
Phased Consistency Model |
Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Hongsheng Li, Xiaogang Wang |
The consistency model (CM) has recently made significant progress in
accelerating the generation of diffusion models. However, its application to
high-resolution, text-conditioned image generation in the latent space (a.k.a.,
LCM) remains unsatisfactory. In this paper, we identify three key flaws in the
current design of LCM. We investigate the reasons behind these limitations and
propose the Phased Consistency Model (PCM), which generalizes the design space
and addresses all identified limitations. Our evaluations demonstrate that PCM
significantly outperforms LCM across 1--16 step generation settings. While PCM
is specifically designed for multi-step refinement, it achieves even superior
or comparable 1-step generation results to previously state-of-the-art
specifically designed 1-step methods. Furthermore, we show that PCM's
methodology is versatile and applicable to video generation, enabling us to
train the state-of-the-art few-step text-to-video generator. More details are
available at https://g-u-n.github.io/projects/pcm/. |
This paper proposes Phased Consistency Model (PCM), generalizing the design of Latent Consistency Models (LCM) to accelerate high-resolution text-conditioned image and video generation in latent diffusion models. |
LCMs, aimed at accelerating diffusion model generation, are limited in quality and efficiency for high-resolution, text-conditioned synthesis. PCM tackles these limitations. |
PCM phases the ODE trajectory into sub-trajectories, enforcing self-consistency within each, allowing for deterministic sampling. It removes CFG from distillation to improve controllability and introduces an adversarial loss to enhance low-step generation. |
PCM significantly outperforms LCM across 1-16 step generation settings, achieving state-of-the-art few-step generation.
PCM achieves superior or comparable 1-step generation quality compared to specialized 1-step methods.
PCM's methodology is successfully applied to video generation, resulting in state-of-the-art few-step text-to-video synthesis. |
While improved, generation quality can be unstable at very low step counts (e.g., one-step).
Future work includes exploring architectural improvements for enhanced efficiency and control. |
generative models, diffusion models, consistency models, text-to-image synthesis, text-to-video synthesis |
2405.18406
Report |
RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives |
Jaehong Yoon, Shoubin Yu, Mohit Bansal |
Recent video generative models primarily rely on carefully written text
prompts for specific tasks, like inpainting or style editing. They require
labor-intensive textual descriptions for input videos, hindering their
flexibility to adapt personal/raw videos to user specifications. This paper
proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video
generative framework that supports multiple video editing capabilities such as
removal, addition, and modification, through a unified pipeline. RACCooN
consists of two principal stages: Video-to-Paragraph (V2P) and
Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video
scenes in well-structured natural language, capturing both the holistic context
and focused object details. Subsequently, in the P2V stage, users can
optionally refine these descriptions to guide the video diffusion model,
enabling various modifications to the input video, such as removing, changing
subjects, and/or adding new objects. The proposed approach stands out from
other methods through several significant contributions: (1) RACCooN suggests a
multi-granular spatiotemporal pooling strategy to generate well-structured
video descriptions, capturing both the broad context and object details without
requiring complex human annotations, simplifying precise video content editing
based on text for users. (2) Our video generative model incorporates
auto-generated narratives or instructions to enhance the quality and accuracy
of the generated content. It supports the addition of video objects,
inpainting, and attribute modification within a unified framework, surpassing
existing video editing and inpainting benchmarks. The proposed framework
demonstrates impressive versatile capabilities in video-to-paragraph
generation, video content editing, and can be incorporated into other SoTA
video generative models for further enhancement. |
Presents RACCOON, a user-friendly video-to-paragraph-to-video framework enabling video content editing (removal, addition, modification) via auto-generated narratives, eliminating the need for manual text prompts. |
Existing video editing models require labor-intensive textual descriptions of videos, limiting flexibility and user-friendliness for personal video editing. |
Two stages: (1) Video-to-Paragraph (V2P): Uses a multimodal LLM with multi-granular spatiotemporal pooling to generate detailed video descriptions capturing holistic context and object details. (2) Paragraph-to-Video (P2V): Leverages user-modified auto-generated descriptions to guide a video diffusion model for editing (adding, removing, changing objects) via inpainting. |
Achieves up to 9.4% improvement in human evaluations for V2P compared to baselines, demonstrating superior video description quality.
Outperforms previous video editing methods with a relative 49.7% improvement in FVD, indicating better video quality and adherence to textual instructions.
Demonstrates the ability to enhance state-of-the-art video generation models by providing detailed auto-generated prompts. |
Performance depends on the quality of employed pre-trained backbones (LLM, inpainting model, video diffusion model).
Potential for inaccuracies or hallucinations in generated text outputs, inheriting biases from training data. |
video editing, video generation, video captioning, multimodal learning, large language models |
2405.18361
Report |
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? |
Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang |
Rapid advancements in Autonomous Driving (AD) tasks turned a significant
shift toward end-to-end fashion, particularly in the utilization of
vision-language models (VLMs) that integrate robust logical reasoning and
cognitive abilities to enable comprehensive end-to-end planning. However, these
VLM-based approaches tend to integrate 2D vision tokenizers and a large
language model (LLM) for ego-car planning, which lack 3D geometric priors as a
cornerstone of reliable planning. Naturally, this observation raises a critical
concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our
evaluation of current VLM-based methods across 3D object detection, vectorized
map construction, and environmental caption suggests that the answer is,
unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable
autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D
tokenizers, which connect LLM with a one-layer linear projector. This simple
yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D
physical world, enabling it to simultaneously process high-resolution
multi-view images and employ spatiotemporal modeling. Despite its simplicity,
Atlas demonstrates superior performance in both 3D detection and ego planning
tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable
autonomous driving. The code and datasets will be released. |
This paper introduces Atlas, a 3D-tokenized Large Language Model (LLM) framework for reliable autonomous driving, which addresses the limitations of 2D-tokenized LLMs in accurately perceiving 3D environments. |
Accurately perceiving the 3D environment is crucial for reliable autonomous driving planning. Existing VLM-based approaches often rely on 2D tokenizers, which lack the inherent 3D geometric priors. |
The authors replace 2D vision tokenizers with DETR-style 3D perception models (StreamPETR and TopoMLP) as 3D tokenizers, connecting them to an LLM (Vicuna) via a linear projector. The model is evaluated on the nuScenes dataset for various tasks like 3D detection, lane detection, and planning. |
2D-tokenized LLMs show significantly lower performance than task-specific models in 3D perception tasks like object detection and lane detection.
Atlas, with 3D tokenizers, achieves superior performance in both 3D perception and open-loop planning, surpassing state-of-the-art BEV-based methods.
The study highlights the importance of 3D priors, resolution, and temporal modeling in autonomous driving, demonstrating the effectiveness of using 3D tokenizers in VLM-based approaches. |
The model is only evaluated on the open-loop nuScenes dataset and needs further testing in closed-loop environments.
The paper lacks direct performance comparison with other VLM-based AD methods due to the unavailability of their code. |
autonomous driving, vision-language models, 3d perception, motion planning, large language models |
2405.18326
Report |
VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers |
Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, Xiaodan Liang |
Video try-on stands as a promising area for its tremendous real-world
potential. Prior works are limited to transferring product clothing images onto
person videos with simple poses and backgrounds, while underperforming on
casually captured videos. Recently, Sora revealed the scalability of Diffusion
Transformer (DiT) in generating lifelike videos featuring real-world scenarios.
Inspired by this, we explore and propose the first DiT-based video try-on
framework for practical in-the-wild applications, named VITON-DiT.
Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal
denoising DiT, and an identity preservation ControlNet. To faithfully recover
the clothing details, the extracted garment features are fused with the
self-attention outputs of the denoising DiT and the ControlNet. We also
introduce novel random selection strategies during training and an Interpolated
Auto-Regressive (IAR) technique at inference to facilitate long video
generation. Unlike existing attempts that require the laborious and restrictive
construction of a paired training dataset, severely limiting their scalability,
VITON-DiT alleviates this by relying solely on unpaired human dance videos and
a carefully designed multi-stage training strategy. Furthermore, we curate a
challenging benchmark dataset to evaluate the performance of casual video
try-on. Extensive experiments demonstrate the superiority of VITON-DiT in
generating spatio-temporal consistent try-on results for in-the-wild videos
with complicated human poses. |
This paper proposes VITON-DiT, the first Diffusion Transformer (DiT)-based video try-on network capable of generating temporally consistent try-on videos in real-world scenarios with complex poses and backgrounds. |
Existing video try-on methods are limited to product images, short video generation, and struggle with complex scenes. This work leverages the power of DiT for realistic and scalable video try-on. |
VITON-DiT integrates a spatio-temporal denoising DiT, a garment extractor, and an ID ControlNet connected by an attention fusion mechanism. It also employs a random selection training strategy and an Interpolated Auto-Regressive (IAR) technique for long video generation. |
VITON-DiT outperforms previous state-of-the-art methods in terms of spatio-temporal consistency on a challenging benchmark dataset.
The proposed attention fusion mechanism effectively preserves garment details during video generation.
The model demonstrates strong data scalability, with performance improving as the quantity and quality of unpaired training data increases. |
While demonstrating strong performance on complex scenes, VITON-DiT's quantitative scores on product clothing images are slightly lower than some baselines trained on similar paired datasets.
The computational cost of DiT-based models remains high, presenting challenges for real-time applications. |
video try-on, diffusion models, diffusion transformers, unpaired learning, computer vision |
2405.18304
Report |
Multi-modal Generation via Cross-Modal In-Context Learning |
Amandeep Kumar, Muzammal Naseer, Sanath Narayan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal |
In this work, we study the problem of generating novel images from complex
multimodal prompt sequences. While existing methods achieve promising results
for text-to-image generation, they often struggle to capture fine-grained
details from lengthy prompts and maintain contextual coherence within prompt
sequences. Moreover, they often result in misaligned image generation for
prompt sequences featuring multiple objects. To address this, we propose a
Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that
generates novel images from complex multimodal prompt sequences by leveraging
the combined capabilities of large language models (LLMs) and diffusion models.
Our MGCC comprises a novel Cross-Modal Refinement module to explicitly learn
cross-modal dependencies between the text and image in the LLM embedding space,
and a contextual object grounding module to generate object bounding boxes
specifically targeting scenes with multiple objects. Our MGCC demonstrates a
diverse range of multimodal capabilities, like novel image generation, the
facilitation of multimodal dialogue, and generation of texts. Experimental
evaluations on two benchmark datasets, demonstrate the effectiveness of our
method. On Visual Story Generation (VIST) dataset with multimodal inputs, our
MGCC achieves a CLIP Similarity score of $0.652$ compared to SOTA GILL $0.641$.
Similarly, on Visual Dialogue Context (VisDial) having lengthy dialogue
sequences, our MGCC achieves an impressive CLIP score of $0.660$, largely
outperforming existing SOTA method scoring $0.645$. Code:
https://github.com/VIROBO-15/MGCC |
This paper introduces MGCC, a novel method for generating images from complex multimodal prompt sequences, addressing limitations of existing text-to-image models in capturing fine-grained details and maintaining contextual coherence. |
Existing methods struggle to generate accurate images from lengthy prompts or sequences, particularly in capturing fine-grained details and maintaining context, especially with multiple objects. MGCC aims to overcome these limitations. |
MGCC leverages LLMs and diffusion models with two key components: a Cross-Modal Refinement Module to learn cross-modal dependencies in LLM embedding space, and a contextual object grounding module to generate bounding boxes for precise object control in generated images. |
MGCC achieves state-of-the-art performance on VIST and VisDial datasets, demonstrating its ability to handle lengthy, complex multimodal prompts.
The Cross-Modal Refinement Module significantly improves image quality and alignment with prompts by learning cross-modal dependencies.
The contextual object grounding module enhances object details and count accuracy in generated images. |
The model's performance with very short dialogues needs improvement.
Future work includes exploring alternative prompting strategies for contextual object grounding to further enhance control and flexibility |
multimodal generation, cross-modal learning, in-context learning, object grounding, text-to-image synthesis |
2405.18295
Report |
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention |
Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan |
In real-life scenarios, humans seek out objects in the 3D world to fulfill
their daily needs or intentions. This inspires us to introduce 3D intention
grounding, a new task in 3D object detection employing RGB-D, based on human
intention, such as "I want something to support my back". Closely related, 3D
visual grounding focuses on understanding human reference. To achieve detection
based on human intention, it relies on humans to observe the scene, reason out
the target that aligns with their intention ("pillow" in this case), and
finally provide a reference to the AI system, such as "A pillow on the couch".
Instead, 3D intention grounding challenges AI agents to automatically observe,
reason and detect the desired target solely based on human intention. To tackle
this challenge, we introduce the new Intent3D dataset, consisting of 44,990
intention texts associated with 209 fine-grained classes from 1,042 scenes of
the ScanNet dataset. We also establish several baselines based on different
language-based 3D object detection models on our benchmark. Finally, we propose
IntentNet, our unique approach, designed to tackle this intention-based
detection problem. It focuses on three key aspects: intention understanding,
reasoning to identify object candidates, and cascaded adaptive learning that
leverages the intrinsic priority logic of different losses for multiple
objective optimization. |
This paper introduces 3D Intention Grounding (3D-IG), a new task for detecting desired objects in 3D scenes using human intention expressed in free-form text, moving beyond traditional 3D visual grounding reliant on specific object references. |
3D-IG addresses the limitations of existing 3D object detection methods that rely on explicit object references, aiming to enable AI agents to automatically reason and detect targets based solely on human intention, crucial for real-world scenarios where providing specific instructions might be challenging. |
The authors create a new dataset, Intent3D, containing 44,990 intention texts linked to 209 object classes from 1,042 ScanNet scenes. They establish baselines using existing language-based 3D object detection methods and propose a novel method, IntentNet, incorporating candidate box matching, verb-object alignment, and cascaded adaptive learning for improved intention understanding and object detection. |
IntentNet significantly outperforms all baseline methods on the Intent3D benchmark, demonstrating the effectiveness of its components in understanding intention and detecting targets.
Existing expert models, primarily designed for referential language, struggle with the nuanced nature of intention language.
LLM-based models, while showing potential, currently face challenges in 3D visual grounding and exhibit limitations due to hallucinations and data scarcity. |
The reliance on GPT-4 for intention text generation introduces potential subjectivity based on its training data and limits the scalability of dataset creation.
The current work focuses on single-intention scenarios. Future work could explore grounding multiple intentions within a single scene, increasing task complexity. |
3d object detection, intention grounding, visual grounding, 3d vision, language and vision |
2405.18172
Report |
AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario |
Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, Bingbing Ni |
While image-based virtual try-on has made significant strides, emerging
approaches still fall short of delivering high-fidelity and robust fitting
images across various scenarios, as their models suffer from issues of
ill-fitted garment styles and quality degrading during the training process,
not to mention the lack of support for various combinations of attire.
Therefore, we first propose a lightweight, scalable, operator known as Hydra
Block for attire combinations. This is achieved through a parallel attention
mechanism that facilitates the feature injection of multiple garments from
conditionally encoded branches into the main network. Secondly, to
significantly enhance the model's robustness and expressiveness in real-world
scenarios, we evolve its potential across diverse settings by synthesizing the
residuals of multiple models, as well as implementing a mask region boost
strategy to overcome the instability caused by information leakage in existing
models. Equipped with the above design, AnyFit surpasses all baselines on
high-resolution benchmarks and real-world data by a large gap, excelling in
producing well-fitting garments replete with photorealistic and rich details.
Furthermore, AnyFit's impressive performance on high-fidelity virtual try-ons
in any scenario from any image, paves a new path for future research within the
fashion community. |
AnyFit, a novel image-based virtual try-on method that excels in generating high-fidelity, robust outfit combinations across diverse scenarios. |
Existing VTON methods fall short in producing realistic and detailed try-on images, especially for multiple garments and complex real-world scenes. |
AnyFit introduces HydraNet with parallelized attention for multi-garment encoding and employs Prior Model Evolution (merging weights of multiple pre-trained models) and Adaptive Mask Boost (mask augmentation and adaptive elongation) for enhanced robustness. |
AnyFit significantly outperforms previous state-of-the-art methods on benchmarks like VITON-HD and DressCode, as well as on a challenging proprietary dataset.
HydraNet enables accurate and scalable multi-garment try-ons, effectively handling transitions between garments.
Prior Model Evolution and Adaptive Mask Boost significantly improve the robustness of the generated try-on images, particularly in complex real-world scenarios. |
AnyFit may exhibit instability in generating complex hand structures, reflecting limitations of the underlying text-to-image model.
Text-based control of try-on style, while showing promise, remains an area for further development. |
virtual try-on, vton, diffusion models, image generation, multi-condition generation |
2405.18163
Report |
NegGS: Negative Gaussian Splatting |
Artur Kasymov, Bartosz Czekaj, Marcin Mazur, Jacek Tabor, Przemysław Spurek |
One of the key advantages of 3D rendering is its ability to simulate
intricate scenes accurately. One of the most widely used methods for this
purpose is Gaussian Splatting, a novel approach that is known for its rapid
training and inference capabilities. In essence, Gaussian Splatting involves
incorporating data about the 3D objects of interest into a series of Gaussian
distributions, each of which can then be depicted in 3D in a manner analogous
to traditional meshes. It is regrettable that the use of Gaussians in Gaussian
Splatting is currently somewhat restrictive due to their perceived linear
nature. In practice, 3D objects are often composed of complex curves and highly
nonlinear structures. This issue can to some extent be alleviated by employing
a multitude of Gaussian components to reflect the complex, nonlinear structures
accurately. However, this approach results in a considerable increase in time
complexity. This paper introduces the concept of negative Gaussians, which are
interpreted as items with negative colors. The rationale behind this approach
is based on the density distribution created by dividing the probability
density functions (PDFs) of two Gaussians, which we refer to as Diff-Gaussian.
Such a distribution can be used to approximate structures such as donut and
moon-shaped datasets. Experimental findings indicate that the application of
these techniques enhances the modeling of high-frequency elements with rapid
color transitions. Additionally, it improves the representation of shadows. To
the best of our knowledge, this is the first paper to extend the simple
elipsoid shapes of Gaussian Splatting to more complex nonlinear structures. |
Introduces Negative Gaussian Splatting (NegGS), using "negative Gaussians" with negative colors to represent complex 3D scenes, enhancing detail and shadow representation in Gaussian Splatting. |
Addresses limitations of Gaussian Splatting in modeling intricate curves and non-linear structures present in real-world 3D objects, enhancing rendering quality, particularly in scenes with small elements and varying lighting. |
Extends the Gaussian Splatting algorithm by incorporating negative Gaussians into the optimization process, allowing for more complex shapes by strategically canceling out portions of positive Gaussians. |
NegGS achieves superior rendering quality compared to existing methods on datasets with complex lighting and small details (e.g., Tanks and Temples).
It effectively models high-frequency elements with rapid color and light transitions, as seen in results on synthetic datasets.
The method accurately approximates shadows, particularly for smaller elements, thanks to the use of negative Gaussian components. |
While effective for specific regions with complex details, NegGS yields comparable results to Gaussian Splatting on simpler shapes.
The study doesn't directly employ Diff-Gaussian distributions, instead integrating negative Gaussians separately, leaving room for further exploration of direct Diff-Gaussian implementation. |
3d rendering, gaussian splatting, negative gaussians, diff-gaussian distribution, shadow rendering |
2405.18156
Report |
VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation |
Qilin Wang, Zhengkai Jiang, Chengming Xu, Jiangning Zhang, Yabiao Wang, Xinyi Zhang, Yun Cao, Weijian Cao, Chengjie Wang, Yanwei Fu |
Human image animation involves generating a video from a static image by
following a specified pose sequence. Current approaches typically adopt a
multi-stage pipeline that separately learns appearance and motion, which often
leads to appearance degradation and temporal inconsistencies. To address these
issues, we propose VividPose, an innovative end-to-end pipeline based on Stable
Video Diffusion (SVD) that ensures superior temporal stability. To enhance the
retention of human identity, we propose an identity-aware appearance controller
that integrates additional facial information without compromising other
appearance details such as clothing texture and background. This approach
ensures that the generated videos maintain high fidelity to the identity of
human subject, preserving key facial features across various poses. To
accommodate diverse human body shapes and hand movements, we introduce a
geometry-aware pose controller that utilizes both dense rendering maps from
SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and
shape in the generated videos, providing a robust framework capable of handling
a wide range of body shapes and dynamic hand movements. Extensive qualitative
and quantitative experiments on the UBCFashion and TikTok benchmarks
demonstrate that our method achieves state-of-the-art performance. Furthermore,
VividPose exhibits superior generalization capabilities on our proposed
in-the-wild dataset. Codes and models will be available. |
VividPose, a novel end-to-end human image animation pipeline based on Stable Video Diffusion (SVD), that enhances temporal consistency and handles diverse body shapes and hand movements. |
Existing methods often lead to appearance degradation, temporal inconsistencies, and shape misalignment in generated videos. |
VividPose leverages SVD with an identity-aware appearance controller (integrating facial information for identity retention) and a geometry-aware pose controller (using dense rendering maps from SMPL-X and sparse skeleton maps for accurate pose and shape alignment). |
VividPose achieves state-of-the-art results in temporal consistency, visual fidelity, and generalization ability on UBCFashion and TikTok benchmarks.
The identity-aware appearance controller significantly improves facial identity retention during animation.
The geometry-aware pose controller ensures accurate body shape generation and effectively handles complex hand movements. |
The reliance on pretrained models like SVD and SMPL-X may limit the flexibility in handling novel or highly stylized human appearances.
Future work includes exploring more efficient training and inference strategies to enhance the practical applicability of VividPose. |
human image animation, stable video diffusion, identity-aware appearance control, geometry-aware pose control, smpl-x |
2405.18132
Report |
EG4D: Explicit Generation of 4D Object without Score Distillation |
Qi Sun, Zhiyang Guo, Ziyu Wan, Jing Nathan Yan, Shengming Yin, Wengang Zhou, Jing Liao, Houqiang Li |
In recent years, the increasing demand for dynamic 3D assets in design and
gaming applications has given rise to powerful generative pipelines capable of
synthesizing high-quality 4D objects. Previous methods generally rely on score
distillation sampling (SDS) algorithm to infer the unseen views and motion of
4D objects, thus leading to unsatisfactory results with defects like
over-saturation and Janus problem. Therefore, inspired by recent progress of
video diffusion models, we propose to optimize a 4D representation by
explicitly generating multi-view videos from one input image. However, it is
far from trivial to handle practical challenges faced by such a pipeline,
including dramatic temporal inconsistency, inter-frame geometry and texture
diversity, and semantic defects brought by video generation results. To address
these issues, we propose DG4D, a novel multi-stage framework that generates
high-quality and consistent 4D assets without score distillation. Specifically,
collaborative techniques and solutions are developed, including an attention
injection strategy to synthesize temporal-consistent multi-view videos, a
robust and efficient dynamic reconstruction method based on Gaussian Splatting,
and a refinement stage with diffusion prior for semantic restoration. The
qualitative results and user preference study demonstrate that our framework
outperforms the baselines in generation quality by a considerable margin. Code
will be released at \url{https://github.com/jasongzy/EG4D}. |
This paper proposes EG4D, a novel multi-stage framework that explicitly generates 4D videos from a single image and then reconstructs consistent and high-quality 4D assets without relying on score distillation sampling. |
Previous 4D generation methods suffer from issues like over-saturation and Janus problem due to their reliance on score distillation sampling. EG4D overcomes these limitations by leveraging the power of video diffusion models for explicit 4D video generation and reconstruction. |
EG4D employs a three-stage pipeline: 1) View and Dynamic Generation: utilizes Stable Video Diffusion (SVD) and SV3D with an attention injection mechanism to generate temporally consistent multi-view videos. 2) Coarse Reconstruction: optimizes a 4D Gaussian Splatting (4D-GS) representation with color transformation to address texture inconsistencies. 3) Diffusion Refinement: leverages image-to-image diffusion models to enhance semantic details and refine the 4D representation. |
EG4D generates 4D assets with superior image-4D alignment and more realistic 3D appearance compared to baselines.
Quantitative results demonstrate that EG4D achieves the highest CLIP-I score, indicating higher semantic similarity between rendered images and the reference image.
User study confirms an overwhelming preference for 4D objects generated by EG4D, highlighting its advantage in overall quality, view consistency, 3D appearance, and motion realism. |
Limited capability of base image-to-video models and the consistency-motion trade-off in attention injection restrict the generation of high-dynamic motions.
Inaccurate camera pose conditioning in the multi-view diffusion model impacts reconstruction quality. Future work can explore advanced video diffusion models and adaptive camera pose techniques. |
4d generation, video diffusion models, gaussian splatting, attention injection, diffusion refinement |
2405.18029
Report |
Are Image Distributions Indistinguishable to Humans Indistinguishable to Classifiers? |
Zebin You, Xinyu Zhang, Hanzhong Guo, Jingdong Wang, Chongxuan Li |
The ultimate goal of generative models is to characterize the data
distribution perfectly. For image generation, common metrics of visual quality
(e.g., FID), and the truthlikeness of generated images to the human eyes seem
to suggest that we are close to achieving it. However, through distribution
classification tasks, we find that, in the eyes of classifiers parameterized by
neural networks, the strongest diffusion models are still far from this goal.
Specifically, classifiers consistently and effortlessly distinguish between
real and generated images in various settings. Further, we observe an
intriguing discrepancy: classifiers can identify differences between diffusion
models with similar performance (e.g., U-ViT-H vs. DiT-XL), but struggle to
differentiate between the smallest and largest models in the same family (e.g.,
EDM2-XS vs. EDM2-XXL), whereas humans exhibit the opposite tendency. As an
explanation, our comprehensive empirical study suggests that, unlike humans,
classifiers tend to classify images through edge and high-frequency components.
We believe that our methodology can serve as a probe to understand how
generative models work and inspire further thought on how existing models can
be improved and how the abuse of such models can be prevented. |
This paper investigates the discrepancy between the perceived high quality of images generated by diffusion models and their actual distribution mismatch with real images, as revealed by neural network classifiers. |
This work is important because it challenges the assumption that low FID scores and visually appealing results equate to accurate distribution learning in generative models. |
The authors propose "distribution classification tasks" where classifiers are trained to distinguish between real images and those generated by various diffusion models. They analyze classification accuracy across different model architectures, dataset combinations, cropping strategies, and frequency components. |
Classifiers consistently achieve high accuracy in distinguishing real from generated images across various settings, even with limited training data and when using self-supervised features.
Classifiers are more sensitive to inductive biases of different diffusion models than humans, excelling at distinguishing models with similar FID scores but different architectures, while struggling with models within the same family.
Classifiers primarily rely on edge information and high-frequency components for classification, maintaining high accuracy even when only a small portion of the image or specific frequency bands are available. |
Findings are based on specific datasets and model architectures, potentially limiting generalizability.
The paper might unintentionally encourage the development of more sophisticated image generation techniques that could be misused for creating harder-to-detect deepfakes. |
diffusion models, generative models, image generation, distribution classification, frequency analysis |
2405.18025
Report |
Unveiling the Power of Diffusion Features For Personalized Segmentation and Retrieval |
Dvir Samuel, Rami Ben-Ari, Matan Levy, Nir Darshan, Gal Chechik |
Personalized retrieval and segmentation aim to locate specific instances
within a dataset based on an input image and a short description of the
reference instance. While supervised methods are effective, they require
extensive labeled data for training. Recently, self-supervised foundation
models have been introduced to these tasks showing comparable results to
supervised methods. However, a significant flaw in these models is evident:
they struggle to locate a desired instance when other instances within the same
class are presented. In this paper, we explore text-to-image diffusion models
for these tasks. Specifically, we propose a novel approach called PDM for
Personalized Features Diffusion Matching, that leverages intermediate features
of pre-trained text-to-image models for personalization tasks without any
additional training. PDM demonstrates superior performance on popular retrieval
and segmentation benchmarks, outperforming even supervised methods. We also
highlight notable shortcomings in current instance and segmentation datasets
and propose new benchmarks for these tasks. |
This paper presents PDM, a novel zero-shot approach leveraging pre-trained Stable Diffusion features for personalized image retrieval and segmentation. |
Personalized retrieval and segmentation are important for various applications, but existing methods struggle to differentiate instances within the same class. This work explores the untapped potential of text-to-image diffusion models for these tasks. |
PDM extracts both appearance and semantic features from a specific layer and block within Stable Diffusion. Appearance similarity is calculated using a dot product between masked reference and target feature maps, while semantic similarity utilizes a score map between the class name token and target semantic features. These similarities are combined for retrieval ranking and segmentation. |
PDM outperforms state-of-the-art self-supervised and supervised methods on personalized image segmentation benchmarks, demonstrating its ability to accurately segment specific instances.
For personalized retrieval, PDM surpasses existing self-supervised and weakly-supervised techniques, achieving comparable results to supervised approaches, even on challenging benchmarks with multiple instances per class.
The authors introduce new benchmarks (PerMIR and PerMIS) for personalized retrieval and segmentation with multiple instances from the same object class, addressing limitations in current datasets. |
PDM relies on image inversion for feature extraction, making its performance dependent on the quality of image reconstruction.
Future work can explore optimizing the speed and efficiency of the feature extraction process |
personalized image retrieval, personalized image segmentation, text-to-image diffusion models, stable diffusion, zero-shot learning |
2405.17991
Report |
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections |
Roy Miles, Pradyumna Reddy, Ismail Elezi, Jiankang Deng |
Large language models (LLMs) have recently emerged as powerful tools for
tackling many language-processing tasks. Despite their success, training and
fine-tuning these models is still far too computationally and memory intensive.
In this paper, we identify and characterise the important components needed for
effective model convergence using gradient descent. In doing so we find that
the intermediate activations used to implement backpropagation can be
excessively compressed without incurring any degradation in performance. This
result leads us to a cheap and memory-efficient algorithm for both fine-tuning
and pre-training LLMs. The proposed algorithm simply divides the tokens up into
smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace
during the forward pass. These features are then coarsely reconstructed during
the backward pass to implement the update rules. We confirm the effectiveness
of our algorithm as being complimentary to many state-of-the-art PEFT methods
on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for
fine-tuning LLaMA and show competitive performance against other
memory-efficient pre-training methods on the large-scale C4 dataset. |
This paper proposes VeLoRA, a novel memory-efficient training and fine-tuning algorithm for large neural networks, especially LLMs, by compressing intermediate activations using fixed rank-1 projections of sub-tokens. |
Training and fine-tuning large language models (LLMs) demand significant computational and memory resources, hindering broader accessibility and research. This work addresses this bottleneck by compressing intermediate activations for efficient gradient computation, enabling training with limited memory. |
VeLoRA divides input tokens into smaller sub-tokens and projects them onto a fixed one-dimensional subspace during the forward pass using a single, cheaply initialized projection vector. During backpropagation, a coarse reconstruction is performed for gradient calculation, significantly reducing the memory footprint. |
VeLoRA improves performance on VTAB-1k by 1.5 percentage points while lowering memory requirements compared to full fine-tuning and outperforms existing PEFT methods in terms of memory efficiency and/or accuracy.
On the GLUE benchmark using RoBERTa-Base, VeLoRA achieves the best overall results with significant memory improvements, outperforming both LoRA and GaLore.
VeLoRA demonstrates superior performance compared to QLoRA when fine-tuning LLaMA models on the Alpaca dataset, achieving higher accuracy while further reducing the memory footprint. |
The current study primarily focuses on Transformer models. Further research is needed to assess VeLoRA's applicability and effectiveness on other deep learning architectures, such as CNNs, RNNs, and SSMs.
While VeLoRA significantly reduces the memory footprint, the training time remains a challenge. Future work could explore techniques to further accelerate the training process without compromising accuracy. |
large language models, memory-efficient training, parameter-efficient fine-tuning (peft), activation compression, gradient sparsification |
2405.17965
Report |
AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization |
Junjie Shentu, Matthew Watson, Noura Al Moubayed |
With the unprecedented performance being achieved by text-to-image (T2I)
diffusion models, T2I customization further empowers users to tailor the
diffusion model to new concepts absent in the pre-training dataset, termed
subject-driven generation. Moreover, extracting several new concepts from a
single image enables the model to learn multiple concepts, and simultaneously
decreases the difficulties of training data preparation, urging the
disentanglement of multiple concepts to be a new challenge. However, existing
models for disentanglement commonly require pre-determined masks or retain
background elements. To this end, we propose an attention-guided method,
AttenCraft, for multiple concept disentanglement. In particular, our method
leverages self-attention and cross-attention maps to create accurate masks for
each concept within a single initialization step, omitting any required mask
preparation by humans or other models. The created masks are then applied to
guide the cross-attention activation of each target concept during training and
achieve concept disentanglement. Additionally, we introduce Uniform sampling
and Reweighted sampling schemes to alleviate the non-synchronicity of feature
acquisition from different concepts, and improve generation quality. Our method
outperforms baseline models in terms of image-alignment, and behaves comparably
on text-alignment. Finally, we showcase the applicability of AttenCraft to more
complicated settings, such as an input image containing three concepts. The
project is available at https://github.com/junjie-shentu/AttenCraft. |
This paper introduces AttenCraft, a novel method for disentangling multiple concepts from a single image in text-to-image customization, enabling subject-driven generation with multiple concepts learned from a single image. |
Current subject-driven text-to-image models primarily focus on images with a single new concept, neglecting the efficiency offered by extracting multiple concepts from a single image. AttenCraft addresses this limitation, facilitating customization with reduced data preparation demands. |
AttenCraft utilizes self-attention and cross-attention maps to generate accurate masks for each concept within a single initialization step, eliminating the need for manual labeling or specialized segmentation models. These masks guide cross-attention during training to disentangle concepts. The paper further introduces Uniform and Reweighted sampling schemes to enhance feature learning synchronicity across concepts. |
AttenCraft achieves superior image-alignment scores compared to baseline models, demonstrating effective concept disentanglement.
The method maintains comparable text-alignment scores with other disentangling models, indicating its ability to balance image reconstruction and editability.
AttenCraft's applicability extends to more complex scenarios, effectively disentangling up to three concepts from a single input image. |
The reliance on attention maps for mask creation makes AttenCraft susceptible to feature omission, especially if the pre-trained model struggles to differentiate visually similar concepts.
Future work could explore incorporating techniques to refine mask creation, minimizing the risk of feature omission and further enhancing the disentanglement capability. |
text-to-image generation, subject-driven generation, concept disentanglement, attention mechanism, diffusion models |
2405.17958
Report |
FreeSplat: Generalizable 3D Gaussian Splatting Towards Free-View Synthesis of Indoor Scenes |
Yunsong Wang, Tianxin Huang, Hanlin Chen, Gim Hee Lee |
Empowering 3D Gaussian Splatting with generalization ability is appealing.
However, existing generalizable 3D Gaussian Splatting methods are largely
confined to narrow-range interpolation between stereo images due to their heavy
backbones, thus lacking the ability to accurately localize 3D Gaussian and
support free-view synthesis across wide view range. In this paper, we present a
novel framework FreeSplat that is capable of reconstructing geometrically
consistent 3D scenes from long sequence input towards free-view
synthesis.Specifically, we firstly introduce Low-cost Cross-View Aggregation
achieved by constructing adaptive cost volumes among nearby views and
aggregating features using a multi-scale structure. Subsequently, we present
the Pixel-wise Triplet Fusion to eliminate redundancy of 3D Gaussians in
overlapping view regions and to aggregate features observed across multiple
views. Additionally, we propose a simple but effective free-view training
strategy that ensures robust view synthesis across broader view range
regardless of the number of views. Our empirical results demonstrate
state-of-the-art novel view synthesis peformances in both novel view rendered
color maps quality and depth maps accuracy across different numbers of input
views. We also show that FreeSplat performs inference more efficiently and can
effectively reduce redundant Gaussians, offering the possibility of
feed-forward large scene reconstruction without depth priors. |
Presents FreeSplat, a novel framework for generalizable 3D Gaussian splatting that reconstructs geometrically consistent 3D scenes from long image sequences, enabling free view synthesis. |
Existing generalizable 3D Gaussian splatting methods are limited to narrow-range interpolation between stereo images, lacking the ability to accurately localize 3D Gaussians and support free view synthesis across wide view ranges. |
Introduces Low-cost Cross-View Aggregation for efficient feature extraction and matching using CNNs and adaptive cost volumes. Employs Pixel-wise Triplet Fusion to eliminate redundant 3D Gaussians and aggregate multi-view features. Proposes a Free-View Training strategy for robust view synthesis across broader view ranges. |
Achieves state-of-the-art novel view synthesis performance on ScanNet, outperforming previous methods in color image quality and depth map accuracy.
Demonstrates efficient inference and significant reduction in redundant Gaussians, enabling large scene reconstruction.
Shows superior zero-shot transfer results on Replica for view interpolation and depth estimation. |
GPU requirements become expensive when inputting extremely long image sequences.
Unsupervised depth estimation scheme leads to a gap in 3D reconstruction accuracy compared to methods with 3D supervision or RGB-D input. |
3d gaussian splatting, novel view synthesis, free view synthesis, indoor scene reconstruction, unsupervised depth estimation |
2405.17933
Report |
ToonCrafter: Generative Cartoon Interpolation |
Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, Tien-Tsin Wong |
We introduce ToonCrafter, a novel approach that transcends traditional
correspondence-based cartoon video interpolation, paving the way for generative
interpolation. Traditional methods, that implicitly assume linear motion and
the absence of complicated phenomena like dis-occlusion, often struggle with
the exaggerated non-linear and large motions with occlusion commonly found in
cartoons, resulting in implausible or even failed interpolation results. To
overcome these limitations, we explore the potential of adapting live-action
video priors to better suit cartoon interpolation within a generative
framework. ToonCrafter effectively addresses the challenges faced when applying
live-action video motion priors to generative cartoon interpolation. First, we
design a toon rectification learning strategy that seamlessly adapts
live-action video priors to the cartoon domain, resolving the domain gap and
content leakage issues. Next, we introduce a dual-reference-based 3D decoder to
compensate for lost details due to the highly compressed latent prior spaces,
ensuring the preservation of fine details in interpolation results. Finally, we
design a flexible sketch encoder that empowers users with interactive control
over the interpolation results. Experimental results demonstrate that our
proposed method not only produces visually convincing and more natural
dynamics, but also effectively handles dis-occlusion. The comparative
evaluation demonstrates the notable superiority of our approach over existing
competitors. |
ToonCrafter, a novel generative cartoon interpolation framework that leverages live-action video priors to overcome limitations of traditional correspondence-based methods. |
Traditional methods struggle with exaggerated, non-linear motions and dis-occlusion common in cartoons, resulting in implausible or inaccurate interpolation. |
The framework adapts a pre-trained image-conditioned video diffusion model using: (1) toon rectification learning to bridge the domain gap, (2) a dual-reference 3D decoder to enhance detail preservation, and (3) a sketch encoder for user control. |
Significantly outperforms state-of-the-art cartoon interpolation methods in quantitative and qualitative comparisons.
Effectively handles challenging cases with large non-linear motions and dis-occlusions.
Allows for user control over interpolation through sparse sketch input. |
Reliance on a pre-trained video diffusion model limits flexibility.
Future work includes exploring higher-resolution generation and more sophisticated user control mechanisms. |
cartoon animation, video interpolation, generative models, diffusion models, motion synthesis |
2405.17927
Report |
The Evolution of Multimodal Model Architectures |
Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, Eugenio Culurciello |
This work uniquely identifies and characterizes four prevalent multimodal
model architectural patterns in the contemporary multimodal landscape.
Systematically categorizing models by architecture type facilitates monitoring
of developments in the multimodal domain. Distinct from recent survey papers
that present general information on multimodal architectures, this research
conducts a comprehensive exploration of architectural details and identifies
four specific architectural types. The types are distinguished by their
respective methodologies for integrating multimodal inputs into the deep neural
network model. The first two types (Type A and B) deeply fuses multimodal
inputs within the internal layers of the model, whereas the following two types
(Type C and D) facilitate early fusion at the input stage. Type-A employs
standard cross-attention, whereas Type-B utilizes custom-designed layers for
modality fusion within the internal layers. On the other hand, Type-C utilizes
modality-specific encoders, while Type-D leverages tokenizers to process the
modalities at the model's input stage. The identified architecture types aid
the monitoring of any-to-any multimodal model development. Notably, Type-C and
Type-D are currently favored in the construction of any-to-any multimodal
models. Type-C, distinguished by its non-tokenizing multimodal model
architecture, is emerging as a viable alternative to Type-D, which utilizes
input-tokenizing techniques. To assist in model selection, this work highlights
the advantages and disadvantages of each architecture type based on data and
compute requirements, architecture complexity, scalability, simplification of
adding modalities, training objectives, and any-to-any multimodal generation
capability. |
This paper identifies and characterizes four prevalent multimodal model architectural patterns (Type A, B, C, and D) in the contemporary multimodal landscape. |
Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain and aids in model selection for various tasks. |
The authors conduct a comprehensive exploration of architectural details in existing multimodal models, focusing on their methodologies for integrating multimodal inputs into deep neural networks. They categorize these methods into four distinct types based on the fusion strategy (deep or early) and the specific mechanisms employed. |
Type-C and Type-D are currently favored in the construction of any-to-any multimodal models.
Type-C, distinguished by its non-tokenizing approach, is emerging as a viable alternative to Type-D, which relies on input tokenization.
The choice between different types depends on factors like data and compute requirements, architecture complexity, scalability, and any-to-any modality generation capability. |
The list of models provided, while comprehensive, is not exhaustive.
Future work can investigate the potential of State Space Models (SSMs) for any-to-any multimodal tasks. |
multimodal learning, model architectures, deep fusion, early fusion, any-to-any modality |
2405.17913
Report |
OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision |
Junjie Wang, Bin Chen, Bin Kang, Yulin Li, YiChi Chen, Weizhi Xian, Huifeng Chang |
Open-Vocabulary Detection (OVD) aims to detect objects from novel categories
beyond the base categories on which the detector is trained. However, existing
open-vocabulary detectors trained on known category data tend to assign higher
confidence to trained categories and confuse novel categories with background.
To resolve this, we propose OV-DQUO, an \textbf{O}pen-\textbf{V}ocabulary DETR
with \textbf{D}enoising text \textbf{Q}uery training and open-world
\textbf{U}nknown \textbf{O}bjects supervision. Specifically, we introduce a
wildcard matching method that enables the detector to learn from pairs of
unknown objects recognized by the open-world detector and text embeddings with
general semantics, mitigating the confidence bias between base and novel
categories. Additionally, we propose a denoising text query training strategy
that synthesizes additional noisy query-box pairs from open-world unknown
objects to trains the detector through contrastive learning, enhancing its
ability to distinguish novel objects from the background. We conducted
extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks,
achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel
categories respectively, without the need for additional training data. Models
and code are released at https://github.com/xiaomoguhz/OV-DQUO |
This paper presents OV-DQUO, an open-vocabulary object detection framework that leverages open-world unknown object supervision and denoising text query training to address the confidence bias issue in detecting novel categories. |
Existing open-vocabulary detectors, while performing well on known categories, exhibit lower confidence when detecting novel categories, often confusing them with background. This significantly limits their ability to generalize to unseen objects. |
OV-DQUO uses an open-world detector to generate proposals for potential unknown objects. It then leverages wildcard matching to associate these proposals with general semantic embeddings, enabling the detector to learn from them. Further, it employs a denoising text query training strategy with synthesized noisy data to improve distinguishing novel objects from the background. Lastly, it introduces a region of query interest selection mechanism that combines objectness and region-text similarity for improved proposal selection. |
OV-DQUO achieves state-of-the-art results on the OV-COCO and OV-LVIS benchmarks, surpassing existing methods by a significant margin.
The framework effectively mitigates the confidence bias issue, demonstrating a more balanced confidence distribution between base and novel categories.
OV-DQUO shows strong cross-dataset generalization capabilities, as demonstrated by its performance on the Objects365 dataset. |
The integration of open-world detection and open-vocabulary detection within a unified end-to-end framework remains underexplored and presents an avenue for future work.
Further investigation is needed to address the issue of false positive detections arising from similarities between category text embeddings. |
open-vocabulary detection, open-world detection, confidence bias, wildcard matching, denoising text query training |
2405.17891
Report |
A Refined 3D Gaussian Representation for High-Quality Dynamic Scene Reconstruction |
Bin Zhang, Bi Zeng, Zexin Peng |
In recent years, Neural Radiance Fields (NeRF) has revolutionized
three-dimensional (3D) reconstruction with its implicit representation.
Building upon NeRF, 3D Gaussian Splatting (3D-GS) has departed from the
implicit representation of neural networks and instead directly represents
scenes as point clouds with Gaussian-shaped distributions. While this shift has
notably elevated the rendering quality and speed of radiance fields but
inevitably led to a significant increase in memory usage. Additionally,
effectively rendering dynamic scenes in 3D-GS has emerged as a pressing
challenge. To address these concerns, this paper purposes a refined 3D Gaussian
representation for high-quality dynamic scene reconstruction. Firstly, we use a
deformable multi-layer perceptron (MLP) network to capture the dynamic offset
of Gaussian points and express the color features of points through hash
encoding and a tiny MLP to reduce storage requirements. Subsequently, we
introduce a learnable denoising mask coupled with denoising loss to eliminate
noise points from the scene, thereby further compressing 3D Gaussian model.
Finally, motion noise of points is mitigated through static constraints and
motion consistency constraints. Experimental results demonstrate that our
method surpasses existing approaches in rendering quality and speed, while
significantly reducing the memory usage associated with 3D-GS, making it highly
suitable for various tasks such as novel view synthesis, and dynamic mapping. |
This paper introduces a novel dynamic scene rendering framework that leverages a hybrid representation of hash encoding, deformation fields, and 3D Gaussians, along with denoising masks and motion consistency constraints to mitigate noise and improve rendering quality. |
Accurate and efficient rendering of dynamic scenes is crucial for various applications like AR, VR, and 3D content creation. Existing methods struggle to balance rendering quality, speed, and memory usage, particularly for dynamic scenes. |
The framework employs deformation fields to model dynamic offsets of Gaussian points, utilizes hash encoding with a tiny MLP for compact color representation, and introduces a learnable denoising mask to filter out noise points. Static and motion consistency constraints are incorporated to ensure accurate learning of dynamic offsets and consistent motion. |
The method achieves state-of-the-art performance on the NeRF-DS dataset for dynamic scene rendering.
It significantly reduces memory usage compared to existing 3D Gaussian Splatting-based methods while maintaining high rendering quality.
The framework demonstrates superior performance compared to NeRF-based approaches on synthetic datasets, particularly in preserving structural details and achieving higher PSNR and SSIM values. |
The combination of hash encoding and a tiny MLP might not fully capture high-frequency color details, potentially leading to less-detailed rendering in certain cases.
Inaccuracies in pose estimation within real-world datasets could result in blurring artifacts in rendered images. |
dynamic scene rendering, 3d gaussian splatting, deformation fields, hash encoding, denoising mask |
2405.17873
Report |
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization |
Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, Yu Wang |
Diffusion models have achieved significant visual generation quality.
However, their significant computational and memory costs pose challenge for
their application on resource-constrained mobile devices or even desktop GPUs.
Recent few-step diffusion models reduces the inference time by reducing the
denoising steps. However, their memory consumptions are still excessive. The
Post Training Quantization (PTQ) replaces high bit-width FP representation with
low-bit integer values (INT4/8) , which is an effective and efficient technique
to reduce the memory cost. However, when applying to few-step diffusion models,
existing quantization methods face challenges in preserving both the image
quality and text alignment. To address this issue, we propose an
mixed-precision quantization framework - MixDQ. Firstly, We design specialized
BOS-aware quantization method for highly sensitive text embedding quantization.
Then, we conduct metric-decoupled sensitivity analysis to measure the
sensitivity of each layer. Finally, we develop an integer-programming-based
method to conduct bit-width allocation. While existing quantization methods
fall short at W8A8, MixDQ could achieve W8A8 without performance loss, and W4A8
with negligible visual degradation. Compared with FP16, we achieve 3-4x
reduction in model size and memory cost, and 1.45x latency speedup. |
This paper introduces MixDQ, a mixed-precision quantization method for memory-efficient few-step text-to-image diffusion models, addressing limitations of existing methods in preserving visual quality and text alignment. |
Few-step diffusion models, while fast, have large memory footprints, hindering deployment on memory-constrained devices. Existing quantization methods struggle to maintain quality and alignment in these models, especially in the challenging one-step setting. |
MixDQ employs three key components: (1) BOS-aware quantization to handle outlier values in text embeddings, (2) Metric-decoupled sensitivity analysis to separately assess impact on content and quality, (3) Integer-programming-based bit-width allocation for optimal mixed-precision configuration. |
MixDQ achieves W3.66A16 and W4A8 quantization for one-step SDXL-turbo with negligible performance degradation, while baselines struggle at W8A8.
It achieves 3-4x reduction in model size and memory, and 1.5x latency speedup compared to FP16 on Nvidia GPUs.
Ablation studies demonstrate the effectiveness of each component, with MixDQ outperforming baselines across fidelity (FID), alignment (CLIP Score), and human preference (ImageReward). |
MixDQ can be further improved by exploring specialized quantization techniques for other sensitive layers.
Future work can explore combining MixDQ with advanced quantization techniques like Adaround and quantization-aware training. |
diffusion models, quantization, text-to-image generation, mixed precision, model compression |
2405.17871
Report |
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment |
Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo |
Existing image-text modality alignment in Vision Language Models (VLMs)
treats each text token equally in an autoregressive manner. Despite being
simple and effective, this method results in sub-optimal cross-modal alignment
by over-emphasizing the text tokens that are less correlated with or even
contradictory with the input images. In this paper, we advocate for assigning
distinct contributions for each text token based on its visual correlation.
Specifically, we present by contrasting image inputs, the difference in
prediction logits on each text token provides strong guidance of visual
correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet
effective re-weighting strategy that prioritizes training visually correlated
tokens. Our experimental results demonstrate that CAL consistently improves
different types of VLMs across different resolutions and model sizes on various
benchmark datasets. Importantly, our method incurs minimal additional
computational overhead, rendering it highly efficient compared to alternative
data scaling strategies. Codes are available at
https://github.com/foundation-multimodal-models/CAL. |
This paper introduces Contrastive Alignment (CAL), a simple yet effective token re-weighting strategy for Vision Language Models (VLMs) that prioritizes training on visually correlated text tokens, leading to enhanced image-text modality alignment. |
Existing VLMs treat all text tokens equally during alignment, leading to sub-optimal performance due to the presence of visually irrelevant or contradictory tokens in training data. |
CAL leverages contrastive learning by analyzing the difference in prediction logits of text tokens with and without image inputs. This difference guides the re-weighting process, prioritizing visually correlated tokens during training. |
CAL consistently improves the performance of various VLMs (LLaVA, MiniGemini) across different model sizes and resolutions.
Significant performance gains are observed on various benchmarks, including visual question answering, image captioning, and grounding.
CAL effectively mitigates the negative impact of noisy labels in training data, leading to more robust VLM performance. |
The paper lacks a clear quantitative discrepancy measure between the three kinds of label tokens (visually correlated, irrelevant, contradictory).
The selection of lower and upper bounds for clamping in CAL is currently empirical and could be explored further for adaptability. |
vision language models, image-text alignment, contrastive learning, token re-weighting, multimodal understanding |
2405.17825
Report |
Diffusion Model Patching via Mixture-of-Prompts |
Seokil Ham, Sangmin Woo, Jin-Young Kim, Hyojun Go, Byeongjun Park, Changick Kim |
We present Diffusion Model Patching (DMP), a simple method to boost the
performance of pre-trained diffusion models that have already reached
convergence, with a negligible increase in parameters. DMP inserts a small,
learnable set of prompts into the model's input space while keeping the
original model frozen. The effectiveness of DMP is not merely due to the
addition of parameters but stems from its dynamic gating mechanism, which
selects and combines a subset of learnable prompts at every step of the
generative process (e.g., reverse denoising steps). This strategy, which we
term "mixture-of-prompts", enables the model to draw on the distinct expertise
of each prompt, essentially "patching" the model's functionality at every step
with minimal yet specialized parameters. Uniquely, DMP enhances the model by
further training on the same dataset on which it was originally trained, even
in a scenario where significant improvements are typically not expected due to
model convergence. Experiments show that DMP significantly enhances the
converged FID of DiT-L/2 on FFHQ 256x256 by 10.38%, achieved with only a 1.43%
parameter increase and 50K additional training iterations. |
Presents Diffusion Model Patching (DMP), a method to enhance pre-trained and converged diffusion models by inserting learnable prompts into the input space and dynamically combining them based on noise levels. |
Addresses the limitations of traditional fine-tuning for converged models and improves performance by introducing stage-specific capabilities. |
Utilizes learnable prompts added to the input space and a dynamic gating mechanism to select and combine prompts based on noise levels during denoising. |
DMP significantly improves FID scores on FFHQ, ImageNet, and MS-COCO datasets compared to baselines.
Further training a converged DiT-L/2 model with DMP achieves a 10.38% FID gain on FFHQ with minimal parameter increase.
Analysis reveals that DMP's success stems from its dynamic gating mechanism, which enables stage-specific prompt utilization. |
The fixed number of input patches limits the flexibility in the number of prompts.
Exploring alternative prompt integration methods while maintaining stable training is a potential future direction. |
diffusion models, prompt tuning, parameter-efficient fine-tuning, image generation, stage-specificity |
2405.17815
Report |
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model |
Haogeng Liu, Quanzeng You, Xiaotian Han, Yongfei Liu, Huaibo Huang, Ran He, Hongxia Yang |
In the realm of Multimodal Large Language Models (MLLMs), vision-language
connector plays a crucial role to link the pre-trained vision encoders with
Large Language Models (LLMs). Despite its importance, the vision-language
connector has been relatively less explored. In this study, we aim to propose a
strong vision-language connector that enables MLLMs to achieve high accuracy
while maintain low computation cost. We first reveal the existence of the
visual anchors in Vision Transformer and propose a cost-effective search
algorithm to extract them. Building on these findings, we introduce the Anchor
Former (AcFormer), a novel vision-language connector designed to leverage the
rich prior knowledge obtained from these visual anchors during pretraining,
guiding the aggregation of information. Through extensive experimentation, we
demonstrate that the proposed method significantly reduces computational costs
by nearly two-thirds compared with baseline, while simultaneously outperforming
baseline methods. This highlights the effectiveness and efficiency of AcFormer. |
This paper introduces Anchor Former (AcFormer), a novel vision-language connector for Multimodal Large Language Models (MLLMs) that leverages visual anchors for efficient and accurate information aggregation. |
Existing vision-language connectors in MLLMs either suffer from high computational costs due to redundant visual tokens or exhibit decreased accuracy when using learnable queries as aggregators. AcFormer aims to address these limitations by identifying and utilizing more effective information aggregators. |
The authors analyze visual feature maps and attention maps from pre-trained Vision Transformers to reveal the existence of "visual anchors" crucial for information aggregation. They propose a cost-effective progressive search algorithm to extract these anchors. AcFormer then employs these anchors as Information Aggregators within a cross-attention module to generate a dense visual representation for LLM input. |
AcFormer achieves comparable or superior performance to baseline models with significantly fewer visual tokens (e.g., 145 or 257 compared to 577 in LLaVA-1.5), resulting in reduced computational cost and increased speed.
Ablation studies validate the efficacy of using visual anchors as Information Aggregators compared to pooling, learnable queries, or randomly selected tokens.
Experiments on various benchmarks, including those requiring fine-grained visual perception, demonstrate AcFormer's effectiveness across different tasks. |
The study is limited by computational resources, preventing exploration of larger training datasets and model sizes.
Further theoretical analysis is needed to better understand the emergence and properties of visual anchors. |
multimodal large language models, vision-language connectors, information aggregation, visual anchors, computational efficiency |
2405.17811
Report |
Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh |
Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, Wenbo Hu, Chaopeng Zhang, Yao Yao, Ying Shan, Long Quan |
Neural 3D representations such as Neural Radiance Fields (NeRF), excel at
producing photo-realistic rendering results but lack the flexibility for
manipulation and editing which is crucial for content creation. Previous works
have attempted to address this issue by deforming a NeRF in canonical space or
manipulating the radiance field based on an explicit mesh. However,
manipulating NeRF is not highly controllable and requires a long training and
inference time. With the emergence of 3D Gaussian Splatting (3DGS), extremely
high-fidelity novel view synthesis can be achieved using an explicit
point-based 3D representation with much faster training and rendering speed.
However, there is still a lack of effective means to manipulate 3DGS freely
while maintaining rendering quality. In this work, we aim to tackle the
challenge of achieving manipulable photo-realistic rendering. We propose to
utilize a triangular mesh to manipulate 3DGS directly with self-adaptation.
This approach reduces the need to design various algorithms for different types
of Gaussian manipulation. By utilizing a triangle shape-aware Gaussian binding
and adapting method, we can achieve 3DGS manipulation and preserve
high-fidelity rendering after manipulation. Our approach is capable of handling
large deformations, local manipulations, and soft body simulations while
keeping high-quality rendering. Furthermore, we demonstrate that our method is
also effective with inaccurate meshes extracted from 3DGS. Experiments
conducted demonstrate the effectiveness of our method and its superiority over
baseline approaches. |
This paper proposes Mani-GS, a novel method for manipulating 3D Gaussian Splatting (3DGS) representations using a triangular mesh as a proxy, enabling photo-realistic rendering of manipulated objects. |
Manipulating 3D content while preserving rendering quality is crucial for various applications, including content creation, gaming, and VR/AR. Existing NeRF-based editing methods are either inflexible or computationally expensive. This work addresses these limitations by using 3DGS, which offers high fidelity and fast rendering but lacks efficient manipulation methods. |
Mani-GS first extracts a triangular mesh from 3DGS or a neural surface field. Then, it introduces a triangle shape-aware Gaussian binding strategy, where Gaussians are bound to triangles in a local coordinate system and their attributes are optimized. Finally, mesh manipulation is directly transferred to 3DGS, leading to self-adaptation of Gaussian attributes and achieving manipulable rendering. |
Mani-GS outperforms previous editing methods (NeRF-Editing, SuGaR) in terms of rendering quality, achieving higher PSNR, SSIM, and lower LPIPS on the NeRF Synthetic dataset.
The method supports various manipulations, including large deformations, local manipulations (blending, reposing, elastic deformation), and soft body simulations, all while maintaining high-quality rendering.
Mani-GS exhibits robustness to mesh accuracy and can generate plausible results even with inaccurate meshes extracted from 3DGS. |
Highly non-rigid deformations on the mesh may lead to rendering distortions.
Simulating physics on high-resolution meshes is computationally expensive, and the rendering may suffer from boundary inaccuracies if the extracted mesh has significant discrepancies from the ground truth. |
gaussian splatting, 3dgs manipulation, photo-realistic rendering, mesh-based editing, triangle shape-aware binding |
2405.17790
Report |
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification |
Weizhen He, Yiheng Deng, Yunfeng Yan, Feng Zhu, Yizhou Wang, Lei Bai, Qingsong Xie, Donglian Qi, Wanli Ouyang, Shixiang Tang |
Human intelligence can retrieve any person according to both visual and
language descriptions. However, the current computer vision community studies
specific person re-identification (ReID) tasks in different scenarios
separately, which limits the applications in the real world. This paper strives
to resolve this problem by proposing a novel instruct-ReID task that requires
the model to retrieve images according to the given image or language
instructions. Instruct-ReID is the first exploration of a general ReID setting,
where existing 6 ReID tasks can be viewed as special cases by assigning
different instructions. To facilitate research in this new instruct-ReID task,
we propose a large-scale OmniReID++ benchmark equipped with diverse data and
comprehensive evaluation methods e.g., task specific and task-free evaluation
settings. In the task-specific evaluation setting, gallery sets are categorized
according to specific ReID tasks. We propose a novel baseline model, IRM, with
an adaptive triplet loss to handle various retrieval tasks within a unified
framework. For task-free evaluation setting, where target person images are
retrieved from task-agnostic gallery sets, we further propose a new method
called IRM++ with novel memory bank-assisted learning. Extensive evaluations of
IRM and IRM++ on OmniReID++ benchmark demonstrate the superiority of our
proposed methods, achieving state-of-the-art performance on 10 test sets. The
datasets, the model, and the code will be available at
https://github.com/hwz-zju/Instruct-ReID |
This paper proposes a novel Instruct-ReID task, a unified framework for person re-identification that incorporates instructions, encompassing six existing ReID tasks. |
Current ReID methods focus on specific scenarios, leading to high deployment costs and limited performance. Instruct-ReID allows one model to handle multiple tasks, improving efficiency and leveraging diverse data for better performance. |
The paper introduces the OmniReID++ benchmark, extending OmniReID with diverse data and evaluation methods. It proposes two models: IRM with adaptive triplet loss for task-specific evaluation and IRM++ with memory bank contrastive learning for task-free evaluation. |
IRM achieves state-of-the-art results on 10 datasets across 6 ReID tasks under task-specific evaluation setting.
IRM++ achieves significant improvement over IRM and existing state-of-the-art methods on the task-free evaluation setting.
The paper proposes a novel evaluation metric, mAPτ, considering both identity correctness and instruction consistency, providing a more accurate performance evaluation. |
Domain gaps between synthetic and real datasets in CC-ReID require further investigation.
Selecting appropriate thresholds for the mAPτ metric warrants future research. |
person re-identification, multitask learning, benchmark, instruction-guided retrieval, adaptive triplet loss |
2405.17705
Report |
DC-Gaussian: Improving 3D Gaussian Splatting for Reflective Dash Cam Videos |
Linhan Wang, Kai Cheng, Shuo Lei, Shengkun Wang, Wei Yin, Chenyang Lei, Xiaoxiao Long, Chang-Tien Lu |
We present DC-Gaussian, a new method for generating novel views from
in-vehicle dash cam videos. While neural rendering techniques have made
significant strides in driving scenarios, existing methods are primarily
designed for videos collected by autonomous vehicles. However, these videos are
limited in both quantity and diversity compared to dash cam videos, which are
more widely used across various types of vehicles and capture a broader range
of scenarios. Dash cam videos often suffer from severe obstructions such as
reflections and occlusions on the windshields, which significantly impede the
application of neural rendering techniques. To address this challenge, we
develop DC-Gaussian based on the recent real-time neural rendering technique 3D
Gaussian Splatting (3DGS). Our approach includes an adaptive image
decomposition module to model reflections and occlusions in a unified manner.
Additionally, we introduce illumination-aware obstruction modeling to manage
reflections and occlusions under varying lighting conditions. Lastly, we employ
a geometry-guided Gaussian enhancement strategy to improve rendering details by
incorporating additional geometry priors. Experiments on self-captured and
public dash cam videos show that our method not only achieves state-of-the-art
performance in novel view synthesis, but also accurately reconstructing
captured scenes getting rid of obstructions. |
This paper introduces DC-Gaussian, a novel method for generating novel views from dash cam videos while removing obstructions like reflections and occlusions. |
Dash cam videos are abundant and diverse, offering valuable data for autonomous driving applications. However, existing neural rendering techniques struggle with obstructions common in these videos, hindering their use. |
DC-Gaussian builds upon 3D Gaussian Splatting (3DGS) and incorporates: 1) Adaptive image decomposition to separate background and obstructions. 2) Illumination-aware Obstruction Modeling (IOM) with a Latent Intensity Modulation (LIM) module to handle varying lighting. 3) Geometry-guided Gaussian Enhancement (G3E) to refine geometry using multi-view stereo. |
DC-Gaussian outperforms state-of-the-art methods in novel view synthesis on BDD100K and DCVR datasets.
The method effectively removes obstructions, producing high-fidelity renderings of both background and obstruction layers.
Ablation studies demonstrate the contribution of each proposed module (AD, IOM, LIM, G3E) to the overall performance. |
Currently limited to single-sequence videos.
Future work could explore extending DC-Gaussian to multi-sequence videos for leveraging denser views. |
novel view synthesis, 3d gaussian splatting, dash cam videos, obstruction removal, illumination-aware modeling |
2405.17673
Report |
Fast Samplers for Inverse Problems in Iterative Refinement Models |
Kushagra Pandey, Ruihan Yang, Stephan Mandt |
Constructing fast samplers for unconditional diffusion and flow-matching
models has received much attention recently; however, existing methods for
solving inverse problems, such as super-resolution, inpainting, or deblurring,
still require hundreds to thousands of iterative steps to obtain high-quality
results. We propose a plug-and-play framework for constructing efficient
samplers for inverse problems, requiring only pre-trained diffusion or
flow-matching models. We present Conditional Conjugate Integrators, which
leverage the specific form of the inverse problem to project the respective
conditional diffusion/flow dynamics into a more amenable space for sampling.
Our method complements popular posterior approximation methods for solving
inverse problems using diffusion/flow models. We evaluate the proposed method's
performance on various linear image restoration tasks across multiple datasets,
employing diffusion and flow-matching models. Notably, on challenging inverse
problems like 4$\times$ super-resolution on the ImageNet dataset, our method
can generate high-quality samples in as few as 5 conditional sampling steps and
outperforms competing baselines requiring 20-1000 steps. Our code and models
will be publicly available at https://github.com/mandt-lab/CI2RM. |
A plug-and-play framework called Conditional Conjugate Integrators (CCI) for constructing efficient samplers for inverse problems using pre-trained diffusion or flow-matching models. |
Existing methods for solving inverse problems with diffusion/flow models are slow, requiring hundreds to thousands of iterative steps for high-quality results. CCI accelerates these samplers by an order of magnitude. |
CCI leverages the structure of linear inverse problems to project conditional diffusion/flow dynamics into a more amenable space for sampling. It separates linear and non-linear components and parameterizes the transformation by analytically solving the linear coefficients. |
CCI significantly improves sampling efficiency on challenging benchmarks, like super-resolution, inpainting, and Gaussian deblurring.
On 4x super-resolution on ImageNet, CCI achieves better sample quality in 5 steps than baselines in 20-1000 steps.
The method demonstrates a tradeoff between guidance weight and sample quality, allowing control over artifact generation. |
The current implementation relies on an Euler solver; performance could be further improved with advanced solvers.
A more principled framework for non-linear inverse problems needs to be developed. |
diffusion models, flow matching, inverse problems, fast sampling, image restoration |
2405.17661
Report |
RefDrop: Controllable Consistency in Image or Video Generation via Reference Feature Guidance |
Jiaojiao Fan, Haotian Xue, Qinsheng Zhang, Yongxin Chen |
There is a rapidly growing interest in controlling consistency across
multiple generated images using diffusion models. Among various methods, recent
works have found that simply manipulating attention modules by concatenating
features from multiple reference images provides an efficient approach to
enhancing consistency without fine-tuning. Despite its popularity and success,
few studies have elucidated the underlying mechanisms that contribute to its
effectiveness. In this work, we reveal that the popular approach is a linear
interpolation of image self-attention and cross-attention between synthesized
content and reference features, with a constant rank-1 coefficient. Motivated
by this observation, we find that a rank-1 coefficient is not necessary and
simplifies the controllable generation mechanism. The resulting algorithm,
which we coin as RefDrop, allows users to control the influence of reference
context in a direct and precise manner. Besides further enhancing consistency
in single-subject image generation, our method also enables more interesting
applications, such as the consistent generation of multiple subjects,
suppressing specific features to encourage more diverse content, and
high-quality personalized video generation by boosting temporal consistency.
Even compared with state-of-the-art image-prompt-based generators, such as
IP-Adapter, RefDrop is competitive in terms of controllability and quality
while avoiding the need to train a separate image encoder for feature injection
from reference images, making it a versatile plug-and-play solution for any
image or video diffusion model. |
This paper introduces \ours, a training-free, plug-and-play method designed to provide flexible control over consistency in image and video generation by modifying the self-attention mechanism in diffusion models. |
Controllable consistency in image and video generation is crucial for various applications but remains a challenge for foundational generative models. Existing methods are often limited by computational cost, data requirements, or lack of flexibility. |
The authors reformulate concatenated attention, a popular method for consistency generation, as a linear interpolation scheme. Building upon this, they propose \rma, a flexible extension that allows for explicit control over the influence of reference images in attention modules. |
\ours achieves state-of-the-art results in controlling consistency for single and multi-subject image generation, outperforming baselines like IP-Adapter and BLIPD.
The method enables novel applications such as blending features from multiple images and encouraging diversity in generated images by using negative coefficients.
In video generation, \ours significantly improves temporal consistency and stabilizes personalized video generation, effectively reducing flickering and preserving motion. |
The model sometimes struggles to accurately reproduce specific objects in consistent image generation.
Future work could explore using attention masks for more precise control and extending the method to accept clean reference images as input. |
diffusion models, image generation, video generation, consistency control, attention mechanisms |
2405.17532
Report |
ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance |
Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Yunchao Wei |
Recent text-to-image customization works have been proven successful in
generating images of given concepts by fine-tuning the diffusion models on a
few examples. However, these methods tend to overfit the concepts, resulting in
failure to create the concept under multiple conditions (e.g. headphone is
missing when generating a dog wearing a headphone'). Interestingly, we
notice that the base model before fine-tuning exhibits the capability to
compose the base concept with other elements (e.g. a dog wearing a headphone)
implying that the compositional ability only disappears after personalization
tuning. Inspired by this observation, we present ClassDiffusion, a simple
technique that leverages a semantic preservation loss to explicitly regulate
the concept space when learning the new concept. Despite its simplicity, this
helps avoid semantic drift when fine-tuning on the target concepts. Extensive
qualitative and quantitative experiments demonstrate that the use of semantic
preservation loss effectively improves the compositional abilities of the
fine-tune models. In response to the ineffective evaluation of CLIP-T metrics,
we introduce BLIP2-T metric, a more equitable and effective evaluation metric
for this particular domain. We also provide in-depth empirical study and
theoretical analysis to better understand the role of the proposed loss.
Lastly, we also extend our ClassDiffusion to personalized video generation,
demonstrating its flexibility. |
This paper introduces ClassDiffusion, a technique to improve the compositional ability of personalized text-to-image generation models by using a semantic preservation loss during fine-tuning. |
Existing personalized text-to-image models often struggle to combine customized concepts with other elements in a prompt due to overfitting during fine-tuning. |
The paper analyzes the semantic drift in text space and cross-attention strength after fine-tuning. It proposes a semantic preservation loss to minimize the semantic drift of personalized concepts from their superclasses, thus retaining the ability to combine them with other elements. |
ClassDiffusion effectively recovers the compositional ability of personalized text-to-image models, as demonstrated by qualitative and quantitative experiments.
The paper introduces BLIP2-T Score as a more equitable and effective evaluation metric for image-text alignment compared to CLIP-T.
ClassDiffusion also demonstrates potential in personalized video generation, showcasing its flexibility. |
The applicability of ClassDiffusion to human-driven personalized generation, particularly for reconstructing human faces, needs further exploration.
Selecting an appropriate center word for objects with combined categories requires experimentation. |
text-to-image generation, personalized image synthesis, compositional generation, semantic preservation, diffusion models |
2405.17531
Report |
Evolutive Rendering Models |
Fangneng Zhan, Hanxue Liang, Yifan Wang, Michael Niemeyer, Michael Oechsle, Adam Kortylewski, Cengiz Oztireli, Gordon Wetzstein, Christian Theobalt |
The landscape of computer graphics has undergone significant transformations
with the recent advances of differentiable rendering models. These rendering
models often rely on heuristic designs that may not fully align with the final
rendering objectives. We address this gap by pioneering \textit{evolutive
rendering models}, a methodology where rendering models possess the ability to
evolve and adapt dynamically throughout the rendering process. In particular,
we present a comprehensive learning framework that enables the optimization of
three principal rendering elements, including the gauge transformations, the
ray sampling mechanisms, and the primitive organization. Central to this
framework is the development of differentiable versions of these rendering
elements, allowing for effective gradient backpropagation from the final
rendering objectives. A detailed analysis of gradient characteristics is
performed to facilitate a stable and goal-oriented elements evolution. Our
extensive experiments demonstrate the large potential of evolutive rendering
models for enhancing the rendering performance across various domains,
including static and dynamic scene representations, generative modeling, and
texture mapping. |
Introduces Evolutive Rendering Models (ERMs) that replace heuristic design choices in rendering models with learnable components optimized for specific rendering objectives. |
Traditional rendering models rely on fixed, potentially sub-optimal heuristics. ERMs address this by enabling autonomous adaptation throughout the rendering process, leading to improved performance. |
Introduces differentiable versions of three key rendering elements: gauge transformations, ray sampling, and primitive organization. This allows gradient-based optimization directly from the final rendering objective using a novel relay learning mechanism. |
Evolutive gauge transformations enhance rendering quality in static, dynamic, and generative modeling.
Evolutive ray sampling improves both the efficiency and quality of volumetric rendering.
Evolutive primitive organization, particularly in Gaussian Splatting, leads to faster training, reduced memory footprint, and improved visual details. |
Current work focuses on evolving individual elements; integrating all three remains unexplored.
The added learnable components typically result in increased training time. |
neural rendering, differentiable rendering, gauge transformation, ray sampling, primitive organization |
2405.17472
Report |
FreezeAsGuard: Mitigating Illegal Adaptation of Diffusion Models via Selective Tensor Freezing |
Kai Huang, Wei Gao |
Text-to-image diffusion models can be fine-tuned in custom domains to adapt
to specific user preferences, but such unconstrained adaptability has also been
utilized for illegal purposes, such as forging public figures' portraits and
duplicating copyrighted artworks. Most existing work focuses on detecting the
illegally generated contents, but cannot prevent or mitigate illegal
adaptations of diffusion models. Other schemes of model unlearning and
reinitialization, similarly, cannot prevent users from relearning the knowledge
of illegal model adaptation with custom data. In this paper, we present
FreezeAsGuard, a new technique that addresses these limitations and enables
irreversible mitigation of illegal adaptations of diffusion models. The basic
approach is that the model publisher selectively freezes tensors in pre-trained
diffusion models that are critical to illegal model adaptations, to mitigate
the fine-tuned model's representation power in illegal domains but minimize the
impact on legal model adaptations in other domains. Such tensor freezing can be
enforced via APIs provided by the model publisher for fine-tuning, can motivate
users' adoption due to its computational savings. Experiment results with
datasets in multiple domains show that FreezeAsGuard provides stronger power in
mitigating illegal model adaptations of generating fake public figures'
portraits, while having the minimum impact on model adaptation in other legal
domains. The source code is available at:
https://github.com/pittisl/FreezeAsGuard/ |
This paper introduces FreezeAsGuard, a novel technique to irreversibly mitigate illegal adaptations of text-to-image diffusion models (e.g., generating fake portraits) by selectively freezing critical tensors during fine-tuning. |
Existing methods for mitigating misuse of open-sourced diffusion models, like watermarking and unlearning, are reversible and cannot prevent re-learning illegal knowledge through fine-tuning. |
FreezeAsGuard uses bilevel optimization to train a binary mask indicating which tensors to freeze. This mask is optimized to maximize degradation in illegal domains (e.g., specific public figures) while minimizing impact on performance in innocent domains (e.g., logos, clothes). |
FreezeAsGuard effectively mitigates generating fake portraits, reducing image quality by 14% compared to fully fine-tuned models, making subjects unrecognizable.
It minimally impacts legal adaptations, achieving comparable or better image quality in innocent domains than unlearning methods.
It offers computational benefits, saving up to 48% GPU memory and 21% time during fine-tuning. |
The optimal freezing ratio may vary across different diffusion models and illegal domain scales.
Future work includes exploring other applications of FreezeAsGuard for various generative models. |
diffusion models, generative ai, model misuse, illegal content mitigation, tensor freezing |
2405.17461
Report |
EMR-Merging: Tuning-Free High-Performance Model Merging |
Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, Wanli Ouyang |
The success of pretrain-finetune paradigm brings about the release of
numerous model weights. In this case, merging models finetuned on different
tasks to enable a single model with multi-task capabilities is gaining
increasing attention for its practicability. Existing model merging methods
usually suffer from (1) significant performance degradation or (2) requiring
tuning by additional data or training. In this paper, we rethink and analyze
the existing model merging paradigm. We discover that using a single model's
weights can hardly simulate all the models' performance. To tackle this issue,
we propose Elect, Mask & Rescale-Merging (EMR-Merging). We first (a) elect a
unified model from all the model weights and then (b) generate extremely
lightweight task-specific modulators, including masks and rescalers, to align
the direction and magnitude between the unified model and each specific model,
respectively. EMR-Merging is tuning-free, thus requiring no data availability
or any additional training while showing impressive performance. We find that
EMR-Merging shows outstanding performance compared to existing merging methods
under different classical and newly-established settings, including merging
different numbers of vision models (up to 30), NLP models, PEFT models, and
multi-modal models. |
This paper proposes EMR-Merging, a novel, tuning-free model merging method that combines a unified task vector with lightweight, task-specific modulators (masks and rescalers) to improve the performance of merged models. |
Model merging is important for reducing storage and deployment costs associated with using multiple single-task models. Existing methods suffer from performance degradation or require tuning with additional data or training. |
EMR-Merging first elects a unified task vector from multiple task-specific vectors, maximizing shared sign and magnitude information. Then, it generates task-specific masks to align direction and rescalers to align magnitude with individual task vectors. |
EMR-Merging significantly outperforms existing merging methods on various vision, NLP, PEFT, and multi-modal benchmarks.
The method achieves performance comparable to traditional multi-task learning (MTL) but without requiring additional data or training.
EMR-Merging maintains strong performance even when merging a large number of models (up to 30) on challenging tasks. |
Requires slightly more memory compared to some existing methods due to storing task-specific modulators.
Not directly applicable to models trained from scratch as it relies on the pretrain-finetune paradigm. |
model merging, multi-task learning, parameter efficiency, tuning-free, vision and language models |
2405.17430
Report |
Matryoshka Multimodal Models |
Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee |
Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in
visual-linguistic reasoning. These models first embed images into a fixed large
number of visual tokens and then feed them into a Large Language Model (LLM).
However, this design causes an excessive number of tokens for dense visual
scenarios such as high-resolution images and videos, leading to great
inefficiency. While token pruning/merging methods do exist, they produce a
single length output for each image and do not afford flexibility in trading
off information density v.s. efficiency. Inspired by the concept of Matryoshka
Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent
visual content as nested sets of visual tokens that capture information across
multiple coarse-to-fine granularities. Our approach offers several unique
benefits for LMMs: (1) One can explicitly control the visual granularity per
test instance during inference, e.g. , adjusting the number of tokens used to
represent an image based on the anticipated complexity or simplicity of the
content; (2) M3 provides a framework for analyzing the granularity needed for
existing datasets, where we find that COCO-style benchmarks only need around ~9
visual tokens to obtain accuracy similar to that of using all 576 tokens; (3)
Our approach provides a foundation to explore the best trade-off between
performance and visual token length at sample level, where our investigation
reveals that a large gap exists between the oracle upper bound and current
fixed-scale representations. |
This paper presents \shortname{}: \fullname{}, a novel approach that enhances the efficiency and adaptability of Large Multimodal Models (LMMs) by representing visual content as nested sets of tokens with varying granularities. |
Current LMMs often struggle with the computational demands of high-resolution images and videos due to their reliance on a fixed and large number of visual tokens. |
\shortname{} leverages a Matryoshka doll-like structure to encode visual information at multiple levels of detail, enabling flexible control over the number of visual tokens used during inference based on factors like content complexity and efficiency constraints. This is achieved by training the LMM to predict the next token in the text sequence based on a hierarchy of visual token sets derived from CLIP visual features, where coarser token sets are subsets of finer ones. |
\shortname{} maintains or improves upon the performance of baseline LMMs while using significantly fewer tokens, especially in scenarios involving dense visual information like document understanding.
Analysis of \shortname{}'s performance across different visual token scales reveals biases in existing vision-language datasets, suggesting that many benchmarks can achieve comparable results with far fewer tokens than currently used.
A significant gap exists between the oracle upper bound (i.e., the best possible performance achievable with the fewest tokens) and the model's actual performance at specific scales, highlighting the potential for further optimization. |
The paper lacks an effective visual token predictor that could dynamically select the optimal token scale for each input, bridging the gap between oracle performance and current results.
The study primarily focuses on image and video understanding tasks, leaving exploration of its applicability to other domains like 3D understanding or audio-visual tasks for future work. |
large multimodal models, token reduction, adaptive representation learning, vision-language reasoning, efficiency optimization |
2405.17429
Report |
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction |
Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu |
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and
semantics of the surrounding scene and is an important task for the robustness
of vision-centric autonomous driving. Most existing methods employ dense grids
such as voxels as scene representations, which ignore the sparsity of occupancy
and the diversity of object scales and thus lead to unbalanced allocation of
resources. To address this, we propose an object-centric representation to
describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian
represents a flexible region of interest and its semantic features. We
aggregate information from images through the attention mechanism and
iteratively refine the properties of 3D Gaussians including position,
covariance, and semantics. We then propose an efficient Gaussian-to-voxel
splatting method to generate 3D occupancy predictions, which only aggregates
the neighboring Gaussians for a certain position. We conduct extensive
experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental
results demonstrate that GaussianFormer achieves comparable performance with
state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
Code is available at: https://github.com/huang-yh/GaussianFormer. |
This paper proposes GaussianFormer, a novel approach for 3D semantic occupancy prediction that leverages an object-centric representation based on 3D semantic Gaussians. |
Existing voxel and BEV-based methods for 3D occupancy prediction suffer from redundancy due to their grid-based nature, leading to inefficient resource allocation. GaussianFormer addresses this by using sparse 3D Gaussians to flexibly represent regions of interest, improving efficiency and capturing fine-grained details. |
GaussianFormer employs a transformer architecture with self-encoding, image cross-attention, and refinement modules to iteratively learn meaningful 3D Gaussians from multi-view images. An efficient Gaussian-to-voxel splatting module then generates dense 3D occupancy predictions. |
GaussianFormer achieves comparable performance to state-of-the-art methods on nuScenes and KITTI-360 datasets for multi-view and monocular 3D semantic occupancy prediction.
GaussianFormer demonstrates superior efficiency compared to existing methods, reducing memory consumption by 75.2% - 82.2% while maintaining competitive latency.
The ablation study validates the effectiveness of individual components in GaussianFormer, including the refinement strategy, sparse convolution, and deep supervision. |
The performance of GaussianFormer, although comparable, is slightly lower than some state-of-the-art methods, suggesting room for improvement in representation accuracy or hyperparameter tuning.
GaussianFormer requires a large number of Gaussians for satisfactory performance, which could be further optimized by exploring alternative strategies to represent empty space. |
3d occupancy prediction, 3d gaussian splatting, autonomous driving, object-centric representation, vision-based perception |
2405.17421
Report |
MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds |
Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, Kostas Daniilidis |
We introduce 4D Motion Scaffolds (MoSca), a neural information processing
system designed to reconstruct and synthesize novel views of dynamic scenes
from monocular videos captured casually in the wild. To address such a
challenging and ill-posed inverse problem, we leverage prior knowledge from
foundational vision models, lift the video data to a novel Motion Scaffold
(MoSca) representation, which compactly and smoothly encodes the underlying
motions / deformations. The scene geometry and appearance are then disentangled
from the deformation field, and are encoded by globally fusing the Gaussians
anchored onto the MoSca and optimized via Gaussian Splatting. Additionally,
camera poses can be seamlessly initialized and refined during the dynamic
rendering process, without the need for other pose estimation tools.
Experiments demonstrate state-of-the-art performance on dynamic rendering
benchmarks. |
Introduces 4D Motion Scaffolds (MoSca), a system for reconstructing and synthesizing novel views of dynamic scenes from casual monocular videos. |
Addresses the challenging and ill-posed inverse problem of reconstructing dynamic scenes from limited information in casual videos. |
Leverages pretrained vision models for initial priors, lifts video data to a compact Motion Scaffold representation encoding deformations, disentangles geometry and appearance, and uses Gaussian Splatting for rendering and optimization. |
Achieves state-of-the-art performance on dynamic rendering benchmarks like DyCheck.
Enables global fusion of observations across the entire video, leading to more complete reconstructions.
Offers a COLMAP-free solution for camera pose estimation in dynamic scenes. |
Relies on the accuracy of 2D foundational models like trackers and depth estimators.
Limited to reconstructing visible areas, with future work exploring the use of diffusion models for hallucinating unseen regions. |
novel view synthesis, dynamic scene reconstruction, motion scaffolds, gaussian splatting, foundation models |
2405.17414
Report |
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control |
Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas Guibas, Gordon Wetzstein |
Research on video generation has recently made tremendous progress, enabling
high-quality videos to be generated from text prompts or images. Adding control
to the video generation process is an important goal moving forward and recent
approaches that condition video generation models on camera trajectories make
strides towards it. Yet, it remains challenging to generate a video of the same
scene from multiple different camera trajectories. Solutions to this
multi-video generation problem could enable large-scale 3D scene generation
with editable camera trajectories, among other applications. We introduce
collaborative video diffusion (CVD) as an important step towards this vision.
The CVD framework includes a novel cross-video synchronization module that
promotes consistency between corresponding frames of the same video rendered
from different camera poses using an epipolar attention mechanism. Trained on
top of a state-of-the-art camera-control module for video generation, CVD
generates multiple videos rendered from different camera trajectories with
significantly better consistency than baselines, as shown in extensive
experiments. Project page: https://collaborativevideodiffusion.github.io/. |
This paper introduces Collaborative Video Diffusion (CVD), a novel method for generating multiple videos of the same scene from different camera trajectories while ensuring consistency in content and motion. |
Existing video generation models struggle to maintain consistency when generating multiple videos of the same scene from different viewpoints. CVD addresses this limitation, paving the way for applications like large-scale 3D scene generation with editable camera trajectories. |
CVD leverages a cross-video synchronization module with epipolar attention to align features across videos. It employs a hybrid training scheme using RealEstate10K (for static scenes and camera poses) and WebVid10M (for dynamic scenes) to overcome the lack of large-scale multi-view dynamic datasets. A collaborative inference algorithm extends the model to generate an arbitrary number of consistent videos. |
CVD outperforms baselines in generating videos with consistent geometry, as demonstrated by quantitative evaluations using SuperGlue for camera pose estimation.
It exhibits superior semantic consistency across videos, as evidenced by CLIP-based metrics for comparing frame content.
CVD maintains high fidelity in generated content, achieving competitive FID and KID scores compared to baselines. |
The performance of CVD is inherently dependent on the capabilities of its base video diffusion models (AnimateDiff and CameraCtrl).
Real-time video synthesis is currently not feasible due to the computational demands of diffusion models. |
video generation, diffusion models, camera control, multi-view consistency, epipolar geometry |
2405.17405
Report |
Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer |
Ruizhi Shao, Youxin Pang, Zerong Zheng, Jingxiang Sun, Yebin Liu |
We present a novel approach for generating high-quality, spatio-temporally
coherent human videos from a single image under arbitrary viewpoints. Our
framework combines the strengths of U-Nets for accurate condition injection and
diffusion transformers for capturing global correlations across viewpoints and
time. The core is a cascaded 4D transformer architecture that factorizes
attention across views, time, and spatial dimensions, enabling efficient
modeling of the 4D space. Precise conditioning is achieved by injecting human
identity, camera parameters, and temporal signals into the respective
transformers. To train this model, we curate a multi-dimensional dataset
spanning images, videos, multi-view data and 3D/4D scans, along with a
multi-dimensional training strategy. Our approach overcomes the limitations of
previous methods based on GAN or UNet-based diffusion models, which struggle
with complex motions and viewpoint changes. Through extensive experiments, we
demonstrate our method's ability to synthesize realistic, coherent and
free-view human videos, paving the way for advanced multimedia applications in
areas such as virtual reality and animation. Our project website is
https://human4dit.github.io. |
This paper introduces Human4DiT, a novel approach for generating high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints using a 4D diffusion transformer. |
Generating realistic human videos is crucial for various multimedia applications, including virtual reality, animation, gaming, and movie production. Existing methods struggle with complex motions, viewpoint changes, and spatio-temporal consistency. |
The framework combines U-Nets for accurate condition injection and a cascaded 4D diffusion transformer for capturing global correlations across viewpoints and time. It utilizes a multi-dimensional dataset and training strategy, along with a spatio-temporally consistent diffusion sampling method during inference. |
Human4DiT outperforms state-of-the-art methods in generating monocular, multi-view, 3D static, and free-view human videos.
The 4D diffusion transformer effectively captures spatio-temporal correlations, resulting in more natural dynamic effects and fewer artifacts.
The method demonstrates the ability to generate coherent free-viewpoint videos with varying camera trajectories. |
Lack of an explicit 4D representation leads to subtle artifacts in free-view videos.
The current implementation struggles with generating small structures like fingers and accessories. |
human video generation, diffusion models, diffusion transformers, view synthesis, 4d content generation |
2405.17401
Report |
RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control |
Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu |
We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play
solution for training-free personalization of diffusion models. Existing
training-free approaches exhibit difficulties in (a) style extraction from
reference images in the absence of additional style or content text
descriptions, (b) unwanted content leakage from reference style images, and (c)
effective composition of style and content. RB-Modulation is built on a novel
stochastic optimal controller where a style descriptor encodes the desired
attributes through a terminal cost. The resulting drift not only overcomes the
difficulties above, but also ensures high fidelity to the reference style and
adheres to the given text prompt. We also introduce a cross-attention-based
feature aggregation scheme that allows RB-Modulation to decouple content and
style from the reference image. With theoretical justification and empirical
evidence, our framework demonstrates precise extraction and control of content
and style in a training-free manner. Further, our method allows a seamless
composition of content and style, which marks a departure from the dependency
on external adapters or ControlNets. |
This paper proposes Reference-Based Modulation (RB-Modulation), a plug-and-play method for training-free personalization of diffusion models, enabling stylization and content-style composition using a single reference image. |
Current training-free methods struggle with style extraction, content leakage from reference images, and effective composition. RB-Modulation addresses these limitations by modulating the drift field in diffusion models using a novel stochastic optimal control framework. |
The method leverages a stochastic optimal controller that incorporates a style descriptor in its terminal cost to guide the reverse diffusion process. It also introduces an Attention Feature Aggregation (AFA) module to disentangle content and style within cross-attention layers, ensuring prompt alignment and high fidelity to the reference image. |
RB-Modulation successfully performs stylization and content-style composition using only a single reference image, outperforming state-of-the-art training-free methods.
Human evaluation confirms superior performance in style alignment, prompt alignment, and overall quality compared to alternatives.
Theoretical analysis connects optimal control and reverse diffusion dynamics, providing insights into the method's effectiveness. |
The method's performance might be limited by the quality of the chosen style descriptor and the pre-trained diffusion model.
Future work can explore alternative style descriptors and apply the framework to various diffusion models with diverse datasets. |
diffusion models, image stylization, content-style composition, stochastic optimal control, training-free personalization |
2405.17398
Report |
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability |
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, Hongyang Li |
World models can foresee the outcomes of different actions, which is of
paramount importance for autonomous driving. Nevertheless, existing driving
world models still have limitations in generalization to unseen environments,
prediction fidelity of critical details, and action controllability for
flexible application. In this paper, we present Vista, a generalizable driving
world model with high fidelity and versatile controllability. Based on a
systematic diagnosis of existing methods, we introduce several key ingredients
to address these limitations. To accurately predict real-world dynamics at high
resolution, we propose two novel losses to promote the learning of moving
instances and structural information. We also devise an effective latent
replacement approach to inject historical frames as priors for coherent
long-horizon rollouts. For action controllability, we incorporate a versatile
set of controls from high-level intentions (command, goal point) to low-level
maneuvers (trajectory, angle, and speed) through an efficient learning
strategy. After large-scale training, the capabilities of Vista can seamlessly
generalize to different scenarios. Extensive experiments on multiple datasets
show that Vista outperforms the most advanced general-purpose video generator
in over 70% of comparisons and surpasses the best-performing driving world
model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize
the capacity of Vista itself to establish a generalizable reward for real-world
action evaluation without accessing the ground truth actions. |
This paper presents \textit{\modelname}, a generalizable driving world model that predicts realistic and continuous futures at high spatiotemporal resolution, featuring versatile action controllability across unseen scenarios and serving as a reward function for action evaluation. |
Existing driving world models lack sufficient generalization to unseen environments, struggle to predict critical details at high fidelity, and often support limited action control modalities, hindering their practical application in autonomous driving. |
The model leverages a latent replacement approach to inject dynamic priors, promoting coherent future prediction. Two novel losses, a dynamics enhancement loss and a structure preservation loss, enhance prediction fidelity. Versatile action controllability is achieved through a unified conditioning interface and efficient learning strategy using both labeled and unlabeled driving datasets. |
\modelname outperforms state-of-the-art driving world models on nuScenes by a significant margin in FID and FVD scores.
Human evaluation across diverse datasets confirms its superior visual quality and motion rationality compared to general-purpose video generators.
The model demonstrates potential as a generalizable reward function, effectively evaluating actions based on prediction uncertainty. |
The model's computational efficiency needs improvement for real-world deployment.
Further work is needed to maintain prediction quality in long-horizon rollouts and during drastic view shifts. |
world models, autonomous driving, video generation, action controllability, reward function |
2405.17393
Report |
EASI-Tex: Edge-Aware Mesh Texturing from Single Image |
Sai Raj Kishore Perla, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang |
We present a novel approach for single-image mesh texturing, which employs a
diffusion model with judicious conditioning to seamlessly transfer an object's
texture from a single RGB image to a given 3D mesh object. We do not assume
that the two objects belong to the same category, and even if they do, there
can be significant discrepancies in their geometry and part proportions. Our
method aims to rectify the discrepancies by conditioning a pre-trained Stable
Diffusion generator with edges describing the mesh through ControlNet, and
features extracted from the input image using IP-Adapter to generate textures
that respect the underlying geometry of the mesh and the input texture without
any optimization or training. We also introduce Image Inversion, a novel
technique to quickly personalize the diffusion model for a single concept using
a single image, for cases where the pre-trained IP-Adapter falls short in
capturing all the details from the input image faithfully. Experimental results
demonstrate the efficiency and effectiveness of our edge-aware single-image
mesh texturing approach, coined EASI-Tex, in preserving the details of the
input texture on diverse 3D objects, while respecting their geometry. |
EASI-Tex is a novel, efficient, optimization-free approach for transferring textures from a single RGB image to a 3D mesh, respecting both the input texture and the mesh's geometry. |
Existing methods struggle to accurately transfer textures from a single image while preserving the 3D model's geometric details and semantic identity. |
The method leverages a pre-trained Stable Diffusion model with ControlNet for edge conditioning from the mesh and IP-Adapter for conditioning on features extracted from the input texture image. It also introduces "Image Inversion" to personalize the diffusion model for complex textures using a single image. |
EASI-Tex demonstrates superior preservation of input texture details and better respects the 3D mesh's geometry compared to baselines.
It offers control over the degree of texture transfer using a tunable parameter.
The method is significantly faster than optimization-based alternatives and doesn't require per-texture fine-tuning like existing personalization-based methods. |
The input resolution of the CLIP image encoder in IP-Adapter limits the capture of fine texture details.
Texture seams may appear due to the iterative texture pasting strategy in the employed mesh texturing technique. |
3d mesh texturing, texture transfer, diffusion models, single image, edge-aware |
2405.17351
Report |
DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Refocusing,Defocus Rendering and Blur Removal |
Yujie Wang, Praneeth Chakravarthula, Baoquan Chen |
3D Gaussian Splatting-based techniques have recently advanced 3D scene
reconstruction and novel view synthesis, achieving high-quality real-time
rendering. However, these approaches are inherently limited by the underlying
pinhole camera assumption in modeling the images and hence only work for
All-in-Focus (AiF) sharp image inputs. This severely affects their
applicability in real-world scenarios where images often exhibit defocus blur
due to the limited depth-of-field (DOF) of imaging devices. Additionally,
existing 3D Gaussian Splatting (3DGS) methods also do not support rendering of
DOF effects.
To address these challenges, we introduce DOF-GS that allows for rendering
adjustable DOF effects, removing defocus blur as well as refocusing of 3D
scenes, all from multi-view images degraded by defocus blur. To this end, we
re-imagine the traditional Gaussian Splatting pipeline by employing a finite
aperture camera model coupled with explicit, differentiable defocus rendering
guided by the Circle-of-Confusion (CoC). The proposed framework provides for
dynamic adjustment of DOF effects by changing the aperture and focal distance
of the underlying camera model on-demand. It also enables rendering varying DOF
effects of 3D scenes post-optimization, and generating AiF images from
defocused training images. Furthermore, we devise a joint optimization strategy
to further enhance details in the reconstructed scenes by jointly optimizing
rendered defocused and AiF images. Our experimental results indicate that
DOF-GS produces high-quality sharp all-in-focus renderings conditioned on
inputs compromised by defocus blur, with the training process incurring only a
modest increase in GPU memory consumption. We further demonstrate the
applications of the proposed method for adjustable defocus rendering and
refocusing of the 3D scene from input images degraded by defocus blur. |
DOF-GS, a novel 3D Gaussian Splatting framework that handles defocus blur in input images and enables adjustable depth-of-field (DOF) effects in rendered images. |
Existing 3DGS methods are limited by the pinhole camera model and require all-in-focus inputs, hindering their applicability to real-world blurry images and DOF rendering. |
DOF-GS employs a finite aperture camera model, CoC-guided DOF rendering, learnable camera parameters (aperture, focal distance) per view, and a joint optimization strategy leveraging an In-Focus Localization Network (ILN). |
DOF-GS successfully reconstructs scenes from blurry multi-view images, outperforming existing methods in synthesizing high-quality novel views.
The method allows for adjustable DOF effects by manipulating aperture and focal distance parameters during rendering.
DOF-GS demonstrates superior GPU memory efficiency compared to methods relying on neural modules for blur simulation. |
Current implementation relies on pre-estimated camera poses, which can be inaccurate due to blur in inputs.
Future work will explore joint optimization of camera poses to further enhance reconstruction quality. |
3d gaussian splatting, depth-of-field, defocus blur, novel view synthesis, refocusing |
2405.17306
Report |
Controllable Longer Image Animation with Diffusion Models |
Qiang Wang, Minghua Liu, Junjun Hu, Fan Jiang, Mu Xu |
Generating realistic animated videos from static images is an important area
of research in computer vision. Methods based on physical simulation and motion
prediction have achieved notable advances, but they are often limited to
specific object textures and motion trajectories, failing to exhibit highly
complex environments and physical dynamics. In this paper, we introduce an
open-domain controllable image animation method using motion priors with video
diffusion models. Our method achieves precise control over the direction and
speed of motion in the movable region by extracting the motion field
information from videos and learning moving trajectories and strengths. Current
pretrained video generation models are typically limited to producing very
short videos, typically less than 30 frames. In contrast, we propose an
efficient long-duration video generation method based on noise reschedule
specifically tailored for image animation tasks, facilitating the creation of
videos over 100 frames in length while maintaining consistency in content
scenery and motion coordination. Specifically, we decompose the denoise process
into two distinct phases: the shaping of scene contours and the refining of
motion details. Then we reschedule the noise to control the generated frame
sequences maintaining long-distance noise correlation. We conducted extensive
experiments with 10 baselines, encompassing both commercial tools and academic
methodologies, which demonstrate the superiority of our method. Our project
page: https://wangqiang9.github.io/Controllable.github.io/ |
This paper proposes a novel method for generating controllable and longer image animations using diffusion models, leveraging motion priors derived from optical flow fields to guide the animation process. |
Existing image animation methods often struggle with precise motion control, especially in open-domain settings, and generating longer videos with consistent content and motion. |
The proposed method extracts motion fields from training videos and utilizes them as conditional constraints for diffusion models. It employs a refinement model to enhance user-provided sparse trajectories and incorporates global motion strength guidance. Additionally, it introduces a phased inference strategy and shared noise rescheduling for generating longer videos with better consistency. |
The method achieves superior quantitative results compared to several open-source methods and commercial tools, demonstrating its effectiveness in generating high-quality animations.
It allows precise control over the direction, speed, and strength of object motion, enabling realistic and user-intended animations.
The proposed longer video generation method effectively maintains temporal consistency and visual coherence, outperforming existing techniques. |
The current reliance on optical flow for motion description limits the capacity for content constraints.
Future work will explore more flexible multi-condition controls, such as incorporating sketch or depth information. |
image-to-video, diffusion models, controllable generation, image animation, motion priors |
2405.17258
Report |
$\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning |
Runqian Wang, Soumya Ghosh, David Cox, Diego Antognini, Aude Oliva, Rogerio Feris, Leonid Karlinsky |
Low-rank adapters (LoRA) and their variants are popular parameter-efficient
fine-tuning (PEFT) techniques that closely match full model fine-tune
performance while requiring only a small number of additional parameters. These
additional LoRA parameters are specific to the base model being adapted. When
the base model needs to be deprecated and replaced with a new one, all the
associated LoRA modules need to be re-trained. Such re-training requires access
to the data used to train the LoRA for the original base model. This is
especially problematic for commercial cloud applications where the LoRA modules
and the base models are hosted by service providers who may not be allowed to
host proprietary client task data. To address this challenge, we propose
$\textit{Trans-LoRA}$ -- a novel method for lossless, nearly data-free transfer
of LoRAs across base models. Our approach relies on synthetic data to transfer
LoRA modules. Using large language models, we design a synthetic data generator
to approximate the data-generating process of the $\textit{observed}$ task data
subset. Training on the resulting synthetic dataset transfers LoRA modules to
new models. We show the effectiveness of our approach using both LLama and
Gemma model families. Our approach achieves lossless (mostly improved) LoRA
transfer between models within and across different base model families, and
even between different PEFT methods, on a wide variety of tasks. |
This paper proposes \method{}, a novel approach for lossless and data-efficient transfer of LoRA modules across different base language models, addressing the challenge of model deprecation in cloud applications. |
LoRA modules are tied to specific base models, requiring retraining when base models are updated. This is problematic in cloud settings where client data used for LoRA training is often confidential and inaccessible. |
\method{} uses a synthetic data generator (guided by a few seed examples) and a discriminator trained on real and synthetic data to create a distillation curriculum for transferring LoRA parameters to new base models. |
Lossless LoRA transfer is achieved, with transferred LoRAs matching or exceeding source LoRA performance on various tasks and across different LLM families (Llama, Gemma).
The method demonstrates positive transfer, often outperforming both the source LoRA and the target base model by combining knowledge from both.
Transfer is effective across different PEFT methods (LoRA, DoRA, Prompt Tuning) and remains robust in continuous transfer scenarios (simulating multiple model updates). |
The approach requires an initial synthetic data generation step, introducing additional computation.
In rare cases, insufficient task understanding by the synthesizer may lead to suboptimal transfer, requiring adjustments in seed sample size. |
lora, peft, transfer learning, knowledge distillation, synthetic data |
2405.17251
Report |
GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping |
Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji |
Generating novel views from a single image remains a challenging task due to
the complexity of 3D scenes and the limited diversity in the existing
multi-view datasets to train a model on. Recent research combining large-scale
text-to-image (T2I) models with monocular depth estimation (MDE) has shown
promise in handling in-the-wild images. In these methods, an input view is
geometrically warped to novel views with estimated depth maps, then the warped
image is inpainted by T2I models. However, they struggle with noisy depth maps
and loss of semantic details when warping an input view to novel viewpoints. In
this paper, we propose a novel approach for single-shot novel view synthesis, a
semantic-preserving generative warping framework that enables T2I generative
models to learn where to warp and where to generate, through augmenting
cross-view attention with self-attention. Our approach addresses the
limitations of existing methods by conditioning the generative model on source
view images and incorporating geometric warping signals. Qualitative and
quantitative evaluations demonstrate that our model outperforms existing
methods in both in-domain and out-of-domain scenarios. Project page is
available at https://GenWarp-NVS.github.io/. |
This paper introduces GenWarp, a novel view synthesis framework that learns where to warp and where to generate in images, enabling the creation of high-quality novel views from single images. |
Existing methods for single-shot novel view synthesis struggle with noisy depth maps and loss of semantic details, particularly at large viewpoint changes. GenWarp addresses these limitations by leveraging the generative prior of text-to-image diffusion models and incorporating geometric warping signals. |
GenWarp uses a two-stream architecture consisting of a semantic preserver network and a diffusion model. It integrates monocular depth estimation (MDE) with warped coordinate embeddings and augments self-attention with cross-view attention to guide the generation process. |
GenWarp effectively handles noisy depth maps and preserves semantic details from the input view, outperforming existing methods in terms of FID and PSNR.
It demonstrates strong generalization capability, effectively synthesizing novel views for in-the-wild images including AI-generated images.
The model exhibits robustness to varying camera viewpoints and scene types. |
GenWarp may struggle with generating novel views from extremely distant viewpoints where depth-based correspondence is not effective.
The performance of the model is influenced by the quality of multi-view datasets used for fine-tuning. |
novel view synthesis, generative models, diffusion models, single-shot, semantic preservation |
2405.17187
Report |
Memorize What Matters: Emergent Scene Decomposition from Multitraverse |
Yiming Li, Zehong Wang, Yue Wang, Zhiding Yu, Zan Gojcic, Marco Pavone, Chen Feng, Jose M. Alvarez |
Humans naturally retain memories of permanent elements, while ephemeral
moments often slip through the cracks of memory. This selective retention is
crucial for robotic perception, localization, and mapping. To endow robots with
this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised,
camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM
converts multitraverse RGB videos from the same region into a Gaussian-based
environmental map while concurrently performing 2D ephemeral object
segmentation. Our key observation is that the environment remains consistent
across traversals, while objects frequently change. This allows us to exploit
self-supervision from repeated traversals to achieve environment-object
decomposition. More specifically, 3DGM formulates multitraverse environmental
mapping as a robust differentiable rendering problem, treating pixels of the
environment and objects as inliers and outliers, respectively. Using robust
feature distillation, feature residuals mining, and robust optimization, 3DGM
jointly performs 2D segmentation and 3D mapping without human intervention. We
build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets,
to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and
neural rendering. Extensive results verify the effectiveness and potential of
our method for self-driving and robotics. |
Presents 3D Gaussian Mapping (3DGM), a self-supervised and camera-only framework for simultaneous 3D environment mapping and 2D unsupervised object segmentation from multi-traversal driving data. |
Addresses limitations of existing 3D mapping methods that rely on pre-trained segmentation models or LiDAR by exploiting the consistency of environments and transience of objects across multiple traversals. |
Utilizes Structure from Motion for initialization and leverages a robust differentiable rendering pipeline with feature distillation and residuals mining to jointly optimize 3D environmental Gaussians and 2D ephemerality masks. |
Achieves comparable unsupervised 2D segmentation performance to supervised methods, outperforming state-of-the-art unsupervised techniques by a significant margin.
Demonstrates accurate 3D environment reconstruction from camera-only input, achieving a lower Chamfer Distance compared to a LiDAR-based baseline.
Shows promising results in novel view synthesis, effectively rendering environments while excluding transient objects and their shadows. |
Faces challenges in handling large environmental variations like nighttime and seasonal changes.
Segmentation can be affected by motion blur, appearance shifts, and difficulties in segmenting shadows and reflective surfaces. |
3d mapping, self-supervised learning, unsupervised segmentation, gaussian splatting, autonomous driving |
2405.17176
Report |
DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models |
Yuqing Zhang, Yuan Liu, Zhiyu Xie, Lei Yang, Zhongyuan Liu, Mengzhou Yang, Runze Zhang, Qilong Kou, Cheng Lin, Wenping Wang, Xiaogang Jin |
2D diffusion model, which often contains unwanted baked-in shading effects
and results in unrealistic rendering effects in the downstream applications.
Generating Physically Based Rendering (PBR) materials instead of just RGB
textures would be a promising solution. However, directly distilling the PBR
material parameters from 2D diffusion models still suffers from incorrect
material decomposition, such as baked-in shading effects in albedo. We
introduce DreamMat, an innovative approach to resolve the aforementioned
problem, to generate high-quality PBR materials from text descriptions. We find
out that the main reason for the incorrect material distillation is that
large-scale 2D diffusion models are only trained to generate final shading
colors, resulting in insufficient constraints on material decomposition during
distillation. To tackle this problem, we first finetune a new light-aware 2D
diffusion model to condition on a given lighting environment and generate the
shading results on this specific lighting condition. Then, by applying the same
environment lights in the material distillation, DreamMat can generate
high-quality PBR materials that are not only consistent with the given geometry
but also free from any baked-in shading effects in albedo. Extensive
experiments demonstrate that the materials produced through our methods exhibit
greater visual appeal to users and achieve significantly superior rendering
quality compared to baseline methods, which are preferable for downstream tasks
such as game and film production. |
DreamMat: A novel method for generating high-quality, text-guided PBR materials on untextured 3D meshes. |
Existing text-to-3D appearance generation methods often produce unrealistic results due to baked-in shading effects in generated textures, limiting their use in rendering pipelines. |
DreamMat distills a geometry- and light-aware diffusion model, leveraging a hash-grid-based material representation and a classifier score distillation (CSD) loss. This approach ensures consistency with input geometry, text prompts, and lighting conditions. |
Generates high-quality albedo, roughness, and metallic maps disentangled from lighting.
Exhibits superior visual fidelity and text alignment compared to baseline methods.
Produces materials compatible with modern graphics engines, enabling realistic renderings under diverse lighting. |
Limited support for complex materials like transparent or highly reflective surfaces due to the simplified BRDF model.
Relatively long distillation time (around 20 minutes) hindering interactive applications. |
text-guided synthesis, 3d material generation, inverse rendering, diffusion models, pbr materials |
2405.17158
Report |
PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution |
Yong Liu, Hang Dong, Jinshan Pan, Qingji Dong, Kai Chen, Rongxiang Zhang, Xing Mei, Lean Fu, Fei Wang |
Diffusion models significantly improve the quality of super-resolved images
with their impressive content generation capabilities. However, the huge
computational costs limit the applications of these methods.Recent efforts have
explored reasonable inference acceleration to reduce the number of sampling
steps, but the computational cost remains high as each step is performed on the
entire image.This paper introduces PatchScaler, a patch-independent
diffusion-based single image super-resolution (SR) method, designed to enhance
the efficiency of the inference process.The proposed method is motivated by the
observation that not all the image patches within an image need the same
sampling steps for reconstructing high-resolution images.Based on this
observation, we thus develop a Patch-adaptive Group Sampling (PGS) to divide
feature patches into different groups according to the patch-level
reconstruction difficulty and dynamically assign an appropriate sampling
configuration for each group so that the inference speed can be better
accelerated.In addition, to improve the denoising ability at each step of the
sampling, we develop a texture prompt to guide the estimations of the diffusion
model by retrieving high-quality texture priors from a patch-independent
reference texture memory.Experiments show that our PatchScaler achieves
favorable performance in both quantitative and qualitative evaluations with
fast inference speed.Our code and model are available at
\url{https://github.com/yongliuy/PatchScaler}. |
This paper introduces PatchScaler, a patch-independent diffusion-based single image super-resolution method designed for efficient inference. It employs patch-adaptive group sampling to tailor sampling configurations to individual patches based on their reconstruction difficulty. |
Diffusion models excel at super-resolution but suffer from high computational costs due to numerous sampling steps applied uniformly to the entire image, even if some patches require fewer steps. |
The method uses a global restoration module to generate a coarse HR image and a confidence map. Patches are grouped by difficulty, and a patch-adaptive group sampling strategy determines an optimal starting point for reverse denoising, reducing steps. A texture prompt enhances detail reconstruction by retrieving similar texture priors. |
PatchScaler achieves faster inference speeds compared to other diffusion-based SR methods, particularly for high-resolution images.
It outperforms state-of-the-art SR methods on perceptual quality metrics like ManIQA, CLIPIQA, and MUSIQ.
The proposed texture prompt proves more effective than traditional text prompts for SISR due to better alignment with image content. |
The model's performance might be limited by training from scratch and the inherent degradation of diffusion models at lower resolutions.
Future work includes exploring the application of PatchScaler to other low-level vision tasks like video super-resolution, image deblurring, and HDR. |
super-resolution, diffusion models, patch-based processing, efficient inference, texture synthesis |
2405.17083
Report |
F-3DGS: Factorized Coordinates and Representations for 3D Gaussian Splatting |
Xiangyu Sun, Joo Chan Lee, Daniel Rho, Jong Hwan Ko, Usman Ali, Eunbyung Park |
The neural radiance field (NeRF) has made significant strides in representing
3D scenes and synthesizing novel views. Despite its advancements, the high
computational costs of NeRF have posed challenges for its deployment in
resource-constrained environments and real-time applications. As an alternative
to NeRF-like neural rendering methods, 3D Gaussian Splatting (3DGS) offers
rapid rendering speeds while maintaining excellent image quality. However, as
it represents objects and scenes using a myriad of Gaussians, it requires
substantial storage to achieve high-quality representation. To mitigate the
storage overhead, we propose Factorized 3D Gaussian Splatting (F-3DGS), a novel
approach that drastically reduces storage requirements while preserving image
quality. Inspired by classical matrix and tensor factorization techniques, our
method represents and approximates dense clusters of Gaussians with
significantly fewer Gaussians through efficient factorization. We aim to
efficiently represent dense 3D Gaussians by approximating them with a limited
amount of information for each axis and their combinations. This method allows
us to encode a substantially large number of Gaussians along with their
essential attributes -- such as color, scale, and rotation -- necessary for
rendering using a relatively small number of elements. Extensive experimental
results demonstrate that F-3DGS achieves a significant reduction in storage
costs while maintaining comparable quality in rendered images. |
This paper proposes Factorized 3D Gaussian Splatting (F-3DGS), a novel approach that significantly reduces the storage requirements of 3D Gaussian Splatting (3DGS) while preserving comparable image quality. |
3DGS, while offering fast rendering speeds and excellent image quality for 3D scene representation, often necessitates a large number of Gaussians and their attributes, leading to high storage costs and hindering its practicality in resource-constrained environments. |
F-3DGS leverages matrix and tensor factorization techniques, inspired by classical and neural rendering factorization methods. It employs a factorized coordinate scheme and decomposes Gaussian attributes (color, scale, rotation, opacity) to efficiently compress the model size. |
F-3DGS achieves comparable image quality to 3DGS while drastically reducing storage costs, exceeding 90% reduction in some cases.
The method maintains fast rendering speeds, making it suitable for real-time applications.
Evaluations on synthetic-NeRF, Tanks & Temples, and Mip-NeRF 360 datasets demonstrate the effectiveness of F-3DGS. |
The current implementation primarily focuses on optimizing F-3DGS for smaller scenes; further research is needed to enhance its applicability to large, unbounded scenes.
The initialization scheme, while effective, relies on pre-trained 3DGS models; exploring alternative initialization strategies could be beneficial. |
3d gaussian splatting, 3d reconstruction, real-time rendering, tensor factorization, compression |
2405.17082
Report |
Ensembling Diffusion Models via Adaptive Feature Aggregation |
Cong Wang, Kuan Tian, Yonghang Guan, Jun Zhang, Zhiwei Jiang, Fei Shen, Xiao Han, Qing Gu, Wei Yang |
The success of the text-guided diffusion model has inspired the development
and release of numerous powerful diffusion models within the open-source
community. These models are typically fine-tuned on various expert datasets,
showcasing diverse denoising capabilities. Leveraging multiple high-quality
models to produce stronger generation ability is valuable, but has not been
extensively studied. Existing methods primarily adopt parameter merging
strategies to produce a new static model. However, they overlook the fact that
the divergent denoising capabilities of the models may dynamically change
across different states, such as when experiencing different prompts, initial
noises, denoising steps, and spatial locations. In this paper, we propose a
novel ensembling method, Adaptive Feature Aggregation (AFA), which dynamically
adjusts the contributions of multiple models at the feature level according to
various states (i.e., prompts, initial noises, denoising steps, and spatial
locations), thereby keeping the advantages of multiple diffusion models, while
suppressing their disadvantages. Specifically, we design a lightweight
Spatial-Aware Block-Wise (SABW) feature aggregator that adaptive aggregates the
block-wise intermediate features from multiple U-Net denoisers into a unified
one. The core idea lies in dynamically producing an individual attention map
for each model's features by comprehensively considering various states. It is
worth noting that only SABW is trainable with about 50 million parameters,
while other models are frozen. Both the quantitative and qualitative
experiments demonstrate the effectiveness of our proposed Adaptive Feature
Aggregation method. The code is available at https://github.com/tenvence/afa/. |
This paper presents Adaptive Feature Aggregation (AFA), a novel ensembling method for text-guided diffusion models that dynamically adjusts contributions from multiple models based on various factors like prompts, noises, and denoising steps. |
Leveraging the diverse strengths of numerous open-source diffusion models, fine-tuned on various datasets, is crucial for achieving better image generation quality and contextual alignment. |
AFA utilizes a lightweight Spatial-Aware Block-Wise (SABW) feature aggregator to dynamically combine intermediate features from multiple U-Net denoisers based on learned spatial attention maps, considering various states like prompts, noises, and denoising steps. |
AFA consistently outperforms individual base models and baseline methods in terms of image quality and context alignment.
AFA exhibits robust performance even with fewer inference steps, leading to comparable computational efficiency to single model inference.
Visualization of attention maps showcases AFA's capability to adaptively leverage different models based on context and timestep. |
AFA's single inference step can be computationally demanding due to running all base models.
Future work includes exploring more efficient aggregator designs and training strategies to further enhance efficiency. |
image generation, diffusion models, model ensembling, text-to-image synthesis, adaptive feature aggregation |
2405.17069
Report |
Training-free Editioning of Text-to-Image Models |
Jinqi Wang, Yunfei Fu, Zhangcan Ding, Bailin Deng, Yu-Kun Lai, Yipeng Qin |
Inspired by the software industry's practice of offering different editions
or versions of a product tailored to specific user groups or use cases, we
propose a novel task, namely, training-free editioning, for text-to-image
models. Specifically, we aim to create variations of a base text-to-image model
without retraining, enabling the model to cater to the diverse needs of
different user groups or to offer distinct features and functionalities. To
achieve this, we propose that different editions of a given text-to-image model
can be formulated as concept subspaces in the latent space of its text encoder
(e.g., CLIP). In such a concept subspace, all points satisfy a specific user
need (e.g., generating images of a cat lying on the grass/ground/falling
leaves). Technically, we apply Principal Component Analysis (PCA) to obtain the
desired concept subspaces from representative text embedding that correspond to
a specific user need or requirement. Projecting the text embedding of a given
prompt into these low-dimensional subspaces enables efficient model editioning
without retraining. Intuitively, our proposed editioning paradigm enables a
service provider to customize the base model into its "cat edition" (or other
editions) that restricts image generation to cats, regardless of the user's
prompt (e.g., dogs, people, etc.). This introduces a new dimension for product
differentiation, targeted functionality, and pricing strategies, unlocking
novel business models for text-to-image generators. Extensive experimental
results demonstrate the validity of our approach and its potential to enable a
wide range of customized text-to-image model editions across various domains
and applications. |
This paper introduces "training-free editioning" for text-to-image models, enabling customization without retraining by projecting text embeddings into concept subspaces. |
This approach addresses the challenge of tailoring text-to-image models to specific needs and unlocks new business models for service providers. |
The method leverages PCA on representative text embeddings to create concept subspaces, each corresponding to a specific domain or attribute, and then projects input prompt embeddings into these subspaces. |
Concept subspace projection successfully restricts image generation to the desired concept (e.g., a "cat edition" only generates cat images).
The method maintains high image quality and diversity, comparable to the base model (Stable Diffusion).
Projected embeddings exhibit close proximity to their "replaced" counterparts, indicating successful projection. |
The current work focuses on a basic linguistic template and a limited word list.
Further exploration is needed for complex prompt structures and a wider range of concepts. |
text-to-image synthesis, model editioning, concept subspaces, clip embeddings, pca |
2405.17013
Report |
MotionLLM: Multimodal Motion-Language Learning with Large Language Models |
Qi Wu, Yubo Zhao, Yifan Wang, Yu-Wing Tai, Chi-Keung Tang |
Recent advancements in Multimodal Large Language Models (MM-LLMs) have
demonstrated promising potential in terms of generalization and robustness when
applied to different modalities. While previous works have already achieved 3D
human motion generation using various approaches including language modeling,
they mostly % are mostly carefully designed use specialized architecture and
are restricted to single-human motion generation. Inspired by the success of
MM-LLMs, we propose MotionLLM, a simple and general framework that can achieve
single-human, multi-human motion generation, and motion captioning by
fine-tuning pre-trained LLMs. Specifically, we encode and quantize motions into
discrete LLM-understandable tokens, which results in a unified vocabulary
consisting of both motion and text tokens. With only 1--3% parameters of the
LLMs trained by using adapters, our single-human motion generation achieves
comparable results to those diffusion models and other trained-from-scratch
transformer-based models. Additionally, we show that our approach is scalable
and flexible, allowing easy extension to multi-human motion generation through
autoregressive generation of single-human motions. Project page:
https://knoxzhao.github.io/MotionLLM |
Introduces MotionLLM, a simple and general framework for single/multi-human motion generation and motion captioning by fine-tuning pre-trained LLMs with motion-text unified vocabulary. |
Addresses limitations of previous methods in handling semantically complex text and adapting to different motion-language tasks. |
Encodes motions into discrete tokens using VQ-VAE or RVQ-VAE, combines motion tokens with text tokens to form a unified vocabulary for LLM fine-tuning using adapters. |
Achieves competitive single-human motion generation results compared to diffusion models and other trained-from-scratch models.
Outperforms state-of-the-art methods in motion captioning, generating semantically accurate and contextually appropriate descriptions.
Demonstrates flexibility by extending to multi-human motion generation through autoregressive generation of single-human motions. |
Long inference time due to the autoregressive nature of LLMs.
Limited performance in multi-human motion generation due to data scarcity and complexity of motion language descriptions. |
motion generation, motion captioning, multimodal learning, large language models, motion tokenization |
2405.16947
Report |
Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models |
Qian Wang, Abdelrahman Eldesokey, Mohit Mendiratta, Fangneng Zhan, Adam Kortylewski, Christian Theobalt, Peter Wonka |
We introduce the first zero-shot approach for Video Semantic Segmentation
(VSS) based on pre-trained diffusion models. A growing research direction
attempts to employ diffusion models to perform downstream vision tasks by
exploiting their deep understanding of image semantics. Yet, the majority of
these approaches have focused on image-related tasks like semantic
correspondence and segmentation, with less emphasis on video tasks such as VSS.
Ideally, diffusion-based image semantic segmentation approaches can be applied
to videos in a frame-by-frame manner. However, we find their performance on
videos to be subpar due to the absence of any modeling of temporal information
inherent in the video data. To this end, we tackle this problem and introduce a
framework tailored for VSS based on pre-trained image and video diffusion
models. We propose building a scene context model based on the diffusion
features, where the model is autoregressively updated to adapt to scene
changes. This context model predicts per-frame coarse segmentation maps that
are temporally consistent. To refine these maps further, we propose a
correspondence-based refinement strategy that aggregates predictions
temporally, resulting in more confident predictions. Finally, we introduce a
masked modulation approach to upsample the coarse maps to the full resolution
at a high quality. Experiments show that our proposed approach outperforms
existing zero-shot image semantic segmentation approaches significantly on
various VSS benchmarks without any training or fine-tuning. Moreover, it rivals
supervised VSS approaches on the VSPW dataset despite not being explicitly
trained for VSS. |
This paper introduces the first zero-shot approach for Video Semantic Segmentation (VSS) using pre-trained diffusion models, enhancing temporal consistency in video segmentation. |
Existing diffusion-based image segmentation methods, when applied frame-by-frame to videos, lack temporal consistency due to the absence of temporal information modeling. This work addresses this gap by introducing a framework specifically designed for VSS. |
The approach constructs a scene context model using diffusion features, which autoregressively updates to accommodate scene changes. It then employs a correspondence-based refinement strategy for temporal and spatial consistency. Finally, a masked modulation process generates full-resolution segmentation maps. |
The method significantly outperforms existing zero-shot image semantic segmentation approaches on VSS benchmarks like VSPW, CityScapes, and Camvid.
It achieves comparable performance to supervised VSS approaches on the VSPW dataset despite not being explicitly trained for VSS.
The study finds that features from Stable Diffusion (SD) currently produce better results than Stable Video Diffusion (SVD), potentially due to the smaller training dataset size for SVD. |
The approach's performance is dependent on the quality of image inversion and VAE encoding, which can discard fine details.
The method is instance-agnostic, grouping objects of the same class into a single cluster. Future work could explore Video Instance or Panoptic Segmentation. |
video semantic segmentation, diffusion models, zero-shot learning, temporal consistency, scene context modeling |
2405.16923
Report |
SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain |
Butian Xiong, Xiaoyu Ye, Tze Ho Elden Tse, Kai Han, Shuguang Cui, Zhen Li |
With the emergence of Gaussian Splats, recent efforts have focused on
large-scale scene geometric reconstruction. However, most of these efforts
either concentrate on memory reduction or spatial space division, neglecting
information in the semantic space. In this paper, we propose a novel method,
named SA-GS, for fine-grained 3D geometry reconstruction using semantic-aware
3D Gaussian Splats. Specifically, we leverage prior information stored in large
vision models such as SAM and DINO to generate semantic masks. We then
introduce a geometric complexity measurement function to serve as soft
regularization, guiding the shape of each Gaussian Splat within specific
semantic areas. Additionally, we present a method that estimates the expected
number of Gaussian Splats in different semantic areas, effectively providing a
lower bound for Gaussian Splats in these areas. Subsequently, we extract the
point cloud using a novel probability density-based extraction method,
transforming Gaussian Splats into a point cloud crucial for downstream tasks.
Our method also offers the potential for detailed semantic inquiries while
maintaining high image-based reconstruction results. We provide extensive
experiments on publicly available large-scale scene reconstruction datasets
with highly accurate point clouds as ground truth and our novel dataset. Our
results demonstrate the superiority of our method over current state-of-the-art
Gaussian Splats reconstruction methods by a significant margin in terms of
geometric-based measurement metrics. Code and additional results will soon be
available on our project page. |
Introduces SA-GS, a novel method for fine-grained 3D geometry reconstruction using semantic-aware 3D Gaussian Splats. |
Addresses limitations of existing 3D Gaussian Splatting (3DGS) methods that struggle with unrealistic geometric reconstruction, particularly in scenes with complex lighting. |
Leverages semantic information from large vision models (e.g., SAM, DINO) to guide the shape and opacity of Gaussian Splats, effectively controlling geometric complexity and mitigating unrealistic surface generation. |
Significantly improves geometric reconstruction accuracy compared to state-of-the-art methods like SuGaR and 2D Gaussian Splats.
Effectively reduces memory consumption during training by dynamically adjusting the number of Gaussian Splats based on semantic and geometric complexity.
Provides a hierarchical probability density sampling strategy for extracting detailed point clouds while mitigating the 'fantasy surface' problem. |
Current implementation doesn't explicitly handle occlusion between Gaussian Splats during training.
Reliance on user-provided semantic information can be a limitation. |
3d reconstruction, gaussian splatting, semantic segmentation, point cloud extraction, large-scale scene reconstruction |
2405.16915
Report |
Multilingual Diversity Improves Vision-Language Representations |
Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna |
Massive web-crawled image-text datasets lay the foundation for recent
progress in multimodal learning. These datasets are designed with the goal of
training a model to do well on standard computer vision benchmarks, many of
which, however, have been shown to be English-centric (e.g., ImageNet).
Consequently, existing data curation techniques gravitate towards using
predominantly English image-text pairs and discard many potentially useful
non-English samples. Our work questions this practice. Multilingual data is
inherently enriching not only because it provides a gateway to learn about
culturally salient concepts, but also because it depicts common concepts
differently from monolingual data. We thus conduct a systematic study to
explore the performance benefits of using more samples of non-English origins
with respect to English vision tasks. By translating all multilingual
image-text pairs from a raw web crawl to English and re-filtering them, we
increase the prevalence of (translated) multilingual data in the resulting
training set. Pre-training on this dataset outperforms using English-only or
English-dominated datasets on ImageNet, ImageNet distribution shifts,
image-English-text retrieval and on average across 38 tasks from the DataComp
benchmark. On a geographically diverse task like GeoDE, we also observe
improvements across all regions, with the biggest gain coming from Africa. In
addition, we quantitatively show that English and non-English data are
significantly different in both image and (translated) text space. We hope that
our findings motivate future work to be more intentional about including
multicultural and multilingual data, not just when non-English or
geographically diverse tasks are involved, but to enhance model capabilities at
large. |
This paper investigates whether incorporating multilingual data during pre-training can improve the performance of vision-language models on English vision tasks. |
Existing vision-language datasets and models often exhibit a monolingual bias, limiting their ability to learn culturally diverse concepts and generalize to non-English tasks. This work explores the potential benefits of leveraging the diversity present in multilingual data to improve model capabilities on a broader range of tasks. |
The authors translate a large web-crawled image-text dataset (DataComp) to English, re-filter it based on image-text alignment, and train a CLIP model on this translated multilingual data. They compare the performance of this model to models trained on English-only or English-dominated datasets on a range of English vision tasks. |
Training on translated multilingual data outperforms training on English-only or English-dominated datasets on various English vision tasks, including ImageNet, ImageNet distribution shifts, and image-English-text retrieval.
On the geographically diverse GeoDE task, training on translated multilingual data significantly improves accuracy across all regions, particularly in Africa.
Analysis of the image and text distributions reveals significant differences between English and translated non-English data, indicating that they capture distinct and complementary information. |
The study primarily focuses on data filtering based on image-text cosine similarity, and it remains unclear whether the observed benefits hold for other filtering methods.
Translation may introduce artifacts and potentially reduce the richness of the original language. Future work can explore alternative approaches to effectively leverage multilingual data without relying solely on translation. |
multilingual vision-language models, data diversity, cross-lingual transfer learning, vision-language pre-training, data curation |
2405.16895
Report |
Anonymization Prompt Learning for Facial Privacy-Preserving Text-to-Image Generation |
Liang Shi, Jie Zhang, Shiguang Shan |
Text-to-image diffusion models, such as Stable Diffusion, generate highly
realistic images from text descriptions. However, the generation of certain
content at such high quality raises concerns. A prominent issue is the accurate
depiction of identifiable facial images, which could lead to malicious deepfake
generation and privacy violations. In this paper, we propose Anonymization
Prompt Learning (APL) to address this problem. Specifically, we train a
learnable prompt prefix for text-to-image diffusion models, which forces the
model to generate anonymized facial identities, even when prompted to produce
images of specific individuals. Extensive quantitative and qualitative
experiments demonstrate the successful anonymization performance of APL, which
anonymizes any specific individuals without compromising the quality of
non-identity-specific image generation. Furthermore, we reveal the
plug-and-play property of the learned prompt prefix, enabling its effective
application across different pretrained text-to-image models for transferrable
privacy and security protection against the risks of deepfakes. |
This paper introduces Anonymization Prompt Learning (APL), a method to prevent text-to-image diffusion models from generating identifiable facial images of specific individuals, thereby mitigating deepfake risks and privacy concerns. |
The ability of text-to-image models to create realistic images of identifiable faces raises serious ethical concerns about malicious deepfake generation and privacy violations. |
APL trains a learnable prompt prefix (Anonymization Prompt) that, when prepended to any input prompt, forces the model to generate anonymized facial images if the prompt specifies an identity, while maintaining image quality and text fidelity for other prompts. |
APL significantly reduces the accuracy of generated identities, effectively anonymizing faces even for individuals not seen during training.
The learned Anonymization Prompt exhibits transferability, demonstrating effectiveness across different pretrained text-to-image models.
APL preserves the overall quality of generated images and their alignment with text prompts, ensuring minimal impact on the model's general image generation capabilities. |
The reliance on ChatGPT for generating attribute descriptions may introduce inaccuracies in training data.
Further research can explore expanding APL to anonymize other sensitive attributes beyond facial features. |
text-to-image generation, diffusion models, deepfakes, privacy protection, prompt learning |
2405.16888
Report |
Part123: Part-aware 3D Reconstruction from a Single-view Image |
Anran Liu, Cheng Lin, Yuan Liu, Xiaoxiao Long, Zhiyang Dou, Hao-Xiang Guo, Ping Luo, Wenping Wang |
Recently, the emergence of diffusion models has opened up new opportunities
for single-view reconstruction. However, all the existing methods represent the
target object as a closed mesh devoid of any structural information, thus
neglecting the part-based structure, which is crucial for many downstream
applications, of the reconstructed shape. Moreover, the generated meshes
usually suffer from large noises, unsmooth surfaces, and blurry textures,
making it challenging to obtain satisfactory part segments using 3D
segmentation techniques. In this paper, we present Part123, a novel framework
for part-aware 3D reconstruction from a single-view image. We first use
diffusion models to generate multiview-consistent images from a given image,
and then leverage Segment Anything Model (SAM), which demonstrates powerful
generalization ability on arbitrary objects, to generate multiview segmentation
masks. To effectively incorporate 2D part-based information into 3D
reconstruction and handle inconsistency, we introduce contrastive learning into
a neural rendering framework to learn a part-aware feature space based on the
multiview segmentation masks. A clustering-based algorithm is also developed to
automatically derive 3D part segmentation results from the reconstructed
models. Experiments show that our method can generate 3D models with
high-quality segmented parts on various objects. Compared to existing
unstructured reconstruction methods, the part-aware 3D models from our method
benefit some important applications, including feature-preserving
reconstruction, primitive fitting, and 3D shape editing. |
This paper presents Part123, a novel framework for reconstructing a part-aware 3D model from a single-view image. |
Part-based 3D models are crucial for many real-world applications, but existing single-view reconstruction methods neglect the part-based structure. |
Part123 first generates multiview images using diffusion models and predicts their 2D segmentation masks with SAM. Then it uses contrastive learning in a neural rendering framework to learn part-aware features based on multiview masks. Finally, an automatic clustering-based algorithm is used to extract 3D part segmentation results. |
Part123 can generate high-quality 3D models with meaningful part segments on various objects.
The part-aware models from Part123 benefit applications such as feature-preserving reconstruction, primitive fitting, and shape editing.
The method shows robustness to different numbers of multiview images and different generative models. |
The accuracy of part segmentation relies on the quality of multiview images and 2D segmentation.
The method currently only focuses on single objects without considering complex scenes. |
3d reconstruction, part segmentation, diffusion models, contrastive learning, neural rendering |
2405.16852
Report |
EM Distillation for One-step Diffusion Models |
Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, Ruiqi Gao |
While diffusion models can learn complex distributions, sampling requires a
computationally expensive iterative process. Existing distillation methods
enable efficient sampling, but have notable limitations, such as performance
degradation with very few sampling steps, reliance on training data access, or
mode-seeking optimization that may fail to capture the full distribution. We
propose EM Distillation (EMD), a maximum likelihood-based approach that
distills a diffusion model to a one-step generator model with minimal loss of
perceptual quality. Our approach is derived through the lens of
Expectation-Maximization (EM), where the generator parameters are updated using
samples from the joint distribution of the diffusion teacher prior and inferred
generator latents. We develop a reparametrized sampling scheme and a noise
cancellation technique that together stabilizes the distillation process. We
further reveal an interesting connection of our method with existing methods
that minimize mode-seeking KL. EMD outperforms existing one-step generative
methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares
favorably with prior work on distilling text-to-image diffusion models. |
This paper presents EM Distillation (EMD), a new method for distilling diffusion models into efficient one-step generators while maintaining high perceptual quality. |
Diffusion models excel at learning complex distributions but suffer from slow sampling speeds. EMD addresses this by enabling fast, one-step generation with minimal quality loss. |
EMD leverages an Expectation-Maximization (EM)-like framework. It introduces a novel reparametrized sampling scheme and a noise cancellation technique to stabilize and accelerate the distillation process. |
EMD achieves state-of-the-art FID scores on one-step image generation for ImageNet 64x64 and 128x128.
The method demonstrates the effectiveness of multi-step Langevin updates on both data and latent variables during distillation.
EMD shows promising results on computationally expensive text-to-image generation by effectively distilling Stable Diffusion models. |
EMD currently relies on initializing the student model from the teacher model for optimal performance.
The method's reliance on multi-step sampling introduces additional computational cost during training. |
diffusion models, generative models, knowledge distillation, image generation, text-to-image generation |
2405.16849
Report |
Sync4D: Video Guided Controllable Dynamics for Physics-Based 4D Generation |
Zhoujie Fu, Jiacheng Wei, Wenhao Shen, Chaoyue Song, Xiaofeng Yang, Fayao Liu, Xulei Yang, Guosheng Lin |
In this work, we introduce a novel approach for creating controllable
dynamics in 3D-generated Gaussians using casually captured reference videos.
Our method transfers the motion of objects from reference videos to a variety
of generated 3D Gaussians across different categories, ensuring precise and
customizable motion transfer. We achieve this by employing blend skinning-based
non-parametric shape reconstruction to extract the shape and motion of
reference objects. This process involves segmenting the reference objects into
motion-related parts based on skinning weights and establishing shape
correspondences with generated target shapes. To address shape and temporal
inconsistencies prevalent in existing methods, we integrate physical
simulation, driving the target shapes with matched motion. This integration is
optimized through a displacement loss to ensure reliable and genuine dynamics.
Our approach supports diverse reference inputs, including humans, quadrupeds,
and articulated objects, and can generate dynamics of arbitrary length,
providing enhanced fidelity and applicability. Unlike methods heavily reliant
on diffusion video generation models, our technique offers specific and
high-quality motion transfer, maintaining both shape integrity and temporal
consistency. |
This paper introduces Sync4D, a novel method for generating controllable dynamics in 3D-generated Gaussians by transferring motion from casually captured videos. |
Existing methods for dynamic 3D content generation often struggle with inaccurate motion representations, shape inconsistency, and lack of precise motion control. Sync4D addresses these limitations by leveraging real-world video guidance and physical simulation. |
The method involves shape reconstruction from the reference video, establishing shape correspondences between reference and target objects, and integrating physical simulation to drive the target shape with matched motion, optimized by a displacement loss. |
Sync4D successfully transfers motion from various sources (humans, animals, objects) to diverse 3D Gaussian objects, ensuring high fidelity and customization across categories.
The method maintains shape integrity and temporal consistency in generated dynamics, outperforming existing approaches relying on video diffusion models.
By integrating physical simulation and optimizing with a displacement loss, Sync4D ensures realistic and plausible motions while minimizing cumulative errors. |
Sync4D faces challenges transferring motion between objects with significantly different topologies.
The initial pose of the reference video and generated 3D object cannot be substantially different due to the method's focus on relative motion learning. |
4d generation, motion transfer, physical simulation, 3d gaussian, shape reconstruction |
2405.16847
Report |
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction |
Yinda Chen, Haoyuan Shi, Xiaoyu Liu, Te Shi, Ruobing Zhang, Dong Liu, Zhiwei Xiong, Feng Wu |
Autoregressive next-token prediction is a standard pretraining method for
large-scale language models, but its application to vision tasks is hindered by
the non-sequential nature of image data, leading to cumulative errors. Most
vision models employ masked autoencoder (MAE) based pretraining, which faces
scalability issues. To address these challenges, we introduce
\textbf{TokenUnify}, a novel pretraining method that integrates random token
prediction, next-token prediction, and next-all token prediction. We provide
theoretical evidence demonstrating that TokenUnify mitigates cumulative errors
in visual autoregression. Cooperated with TokenUnify, we have assembled a
large-scale electron microscopy (EM) image dataset with ultra-high resolution,
ideal for creating spatially correlated long sequences. This dataset includes
over 120 million annotated voxels, making it the largest neuron segmentation
dataset to date and providing a unified benchmark for experimental validation.
Leveraging the Mamba network inherently suited for long-sequence modeling on
this dataset, TokenUnify not only reduces the computational complexity but also
leads to a significant 45\% improvement in segmentation performance on
downstream EM neuron segmentation tasks compared to existing methods.
Furthermore, TokenUnify demonstrates superior scalability over MAE and
traditional autoregressive methods, effectively bridging the gap between
pretraining strategies for language and vision models. Code is available at
\url{https://github.com/ydchen0806/TokenUnify}. |
Introduces TokenUnify, a novel pretraining method for visual autoregression that integrates random token prediction, next-token prediction, and next-all token prediction. |
Addresses the limitations of existing vision pretraining methods like masked autoencoders (scalability) and traditional autoregression (cumulative errors). |
1. Proposes TokenUnify to mitigate cumulative errors in autoregression. 2. Introduces Mamba architecture for efficient long-sequence modeling. 3. Compiles a large-scale, ultra-high-resolution 3D electron microscopy (EM) dataset of mouse brain slices. |
TokenUnify led to a 45% improvement in performance on EM neuron segmentation tasks.
TokenUnify outperformed MAE by 21% in pretraining performance with fewer parameters.
TokenUnify demonstrated superior scaling properties compared to MAE and traditional autoregressive methods. |
Effectiveness on natural images and diverse downstream tasks needs further validation.
Future work includes exploring model lightweighting and efficient fine-tuning strategies. |
pretraining, vision models, autoregression, electron microscopy, segmentation |
2405.16829
Report |
PyGS: Large-scale Scene Representation with Pyramidal 3D Gaussian Splatting |
Zipeng Wang, Dan Xu |
Neural Radiance Fields (NeRFs) have demonstrated remarkable proficiency in
synthesizing photorealistic images of large-scale scenes. However, they are
often plagued by a loss of fine details and long rendering durations. 3D
Gaussian Splatting has recently been introduced as a potent alternative,
achieving both high-fidelity visual results and accelerated rendering
performance. Nonetheless, scaling 3D Gaussian Splatting is fraught with
challenges. Specifically, large-scale scenes grapples with the integration of
objects across multiple scales and disparate viewpoints, which often leads to
compromised efficacy as the Gaussians need to balance between detail levels.
Furthermore, the generation of initialization points via COLMAP from
large-scale dataset is both computationally demanding and prone to incomplete
reconstructions. To address these challenges, we present Pyramidal 3D Gaussian
Splatting (PyGS) with NeRF Initialization. Our approach represent the scene
with a hierarchical assembly of Gaussians arranged in a pyramidal fashion. The
top level of the pyramid is composed of a few large Gaussians, while each
subsequent layer accommodates a denser collection of smaller Gaussians. We
effectively initialize these pyramidal Gaussians through sampling a rapidly
trained grid-based NeRF at various frequencies. We group these pyramidal
Gaussians into clusters and use a compact weighting network to dynamically
determine the influence of each pyramid level of each cluster considering
camera viewpoint during rendering. Our method achieves a significant
performance leap across multiple large-scale datasets and attains a rendering
time that is over 400 times faster than current state-of-the-art approaches. |
This paper introduces PyGS, a novel multi-scale 3D Gaussian Splatting framework designed for efficient and detailed large-scale scene representation. |
Existing NeRF-based methods struggle with fine detail rendering and speed in large scenes, while 3D Gaussian Splatting faces challenges with multi-scale objects and slow initialization in such settings. |
PyGS utilizes a hierarchical structure of 3D Gaussians, organized into pyramid levels for multi-scale detail capture. It initializes these Gaussians efficiently using a coarsely trained grid-based NeRF and dynamically adjusts level weights during rendering via a compact weighting network informed by camera viewpoint and cluster embeddings. |
PyGS outperforms state-of-the-art NeRF-based methods and original 3DGS across various metrics on four large-scale datasets, achieving high-fidelity results with a significant speed boost.
NeRF-based initialization proves superior to random or COLMAP-based methods, yielding denser point clouds with better geometric details.
The adaptive weighting strategy significantly enhances rendering quality compared to simpler alternatives. |
Modeling even larger environments necessitates further exploration of parallel optimization techniques due to substantial memory and computational demands.
Future research can investigate the application of PyGS in related domains, such as 3D reconstruction, scene editing, and virtual reality. |
neural radiance fields, 3d gaussian splatting, large-scale scene representation, multi-scale modeling, novel view synthesis |
2405.16823
Report |
Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection |
Gihyun Kwon, Jangho Park, Jong Chul Ye |
While text-to-image models have achieved impressive capabilities in image
generation and editing, their application across various modalities often
necessitates training separate models. Inspired by existing method of single
image editing with self attention injection and video editing with shared
attention, we propose a novel unified editing framework that combines the
strengths of both approaches by utilizing only a basic 2D image text-to-image
(T2I) diffusion model. Specifically, we design a sampling method that
facilitates editing consecutive images while maintaining semantic consistency
utilizing shared self-attention features during both reference and consecutive
image sampling processes. Experimental results confirm that our method enables
editing across diverse modalities including 3D scenes, videos, and panorama
images. |
This paper proposes a novel unified editing method that enables seamless editing across panorama images, videos, and 3D scenes using only a single 2D image text-to-image diffusion model. |
Existing text-to-image models often require separate models for different modalities (3D, video, panorama), leading to difficulty in attribute editing and higher resource consumption. This method aims to overcome these challenges by using a single 2D model for all. |
The method leverages the sequential nature of images in different modalities. It combines the strengths of single image editing (using self-attention injection) and sequential image editing (using shared attention) by employing two parallel paths: disentangled editing on a reference image and context transfer using shared self-attention features. |
Outperforms baseline methods in 3D scene editing, achieving superior semantic object editing and overall style transfer while preserving scene structure.
Successfully edits panorama images, demonstrating better text alignment and structural consistency compared to existing techniques.
Achieves impressive results in video editing, showing superior text-guided semantic changes and cross-frame consistency. |
Maintaining consistency can be challenging when the semantic distance between sequential frames is significantly large.
The ability to edit using inappropriate text prompts raises ethical concerns. |
text-to-image, diffusion models, image editing, 3d scene editing, video editing, panorama editing |
2405.16822
Report |
Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels |
Yikai Wang, Xinzhou Wang, Zilong Chen, Zhengyi Wang, Fuchun Sun, Jun Zhu |
Video generative models are receiving particular attention given their
ability to generate realistic and imaginative frames. Besides, these models are
also observed to exhibit strong 3D consistency, significantly enhancing their
potential to act as world simulators. In this work, we present Vidu4D, a novel
reconstruction model that excels in accurately reconstructing 4D (i.e.,
sequential 3D) representations from single generated videos, addressing
challenges associated with non-rigidity and frame distortion. This capability
is pivotal for creating high-fidelity virtual contents that maintain both
spatial and temporal coherence. At the core of Vidu4D is our proposed Dynamic
Gaussian Surfels (DGS) technique. DGS optimizes time-varying warping functions
to transform Gaussian surfels (surface elements) from a static state to a
dynamically warped state. This transformation enables a precise depiction of
motion and deformation over time. To preserve the structural integrity of
surface-aligned Gaussian surfels, we design the warped-state geometric
regularization based on continuous warping fields for estimating normals.
Additionally, we learn refinements on rotation and scaling parameters of
Gaussian surfels, which greatly alleviates texture flickering during the
warping process and enhances the capture of fine-grained appearance details.
Vidu4D also contains a novel initialization state that provides a proper start
for the warping fields in DGS. Equipping Vidu4D with an existing video
generative model, the overall framework demonstrates high-fidelity text-to-4D
generation in both appearance and geometry. |
Introduces Vidu4D, a novel reconstruction model that generates accurate 4D representations from single generated videos, addressing challenges like non-rigidity and frame distortion. |
Enables creation of high-fidelity virtual content with strong spatial and temporal coherence, crucial for VR, visualization, and AI. |
Utilizes Dynamic Gaussian Surfels (DGS), optimizing time-varying warping functions for transforming Gaussian surfels to depict motion and deformation. Incorporates warped-state normal regularization and refinement of Gaussian surfel parameters for accurate geometry and appearance. |
Achieves superior novel-view reconstruction compared to state-of-the-art methods in terms of detail preservation, texture quality, and geometric accuracy.
Quantitative evaluation shows significant improvements in PSNR, SSIM, and LPIPS metrics.
Ablation studies confirm the effectiveness of warped-state regularization and refinement strategies in DGS. |
Current limitations include dependence on video quality, scalability for large scenes, and computational demands for real-time applications.
Future work will address these limitations and explore applications in content creation and editing. |
4d reconstruction, video generation, dynamic gaussian surfels, non-rigid deformation, text-to-4d generation |
2405.16803
Report |
TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing |
Xinyu Zhang, Mengxue Kang, Fei Wei, Shuang Xu, Yuhe Liu, Lin Ma |
As the field of image generation rapidly advances, traditional diffusion
models and those integrated with multimodal large language models (LLMs) still
encounter limitations in interpreting complex prompts and preserving image
consistency pre and post-editing. To tackle these challenges, we present an
innovative image editing framework that employs the robust Chain-of-Thought
(CoT) reasoning and localizing capabilities of multimodal LLMs to aid diffusion
models in generating more refined images. We first meticulously design a CoT
process comprising instruction decomposition, region localization, and detailed
description. Subsequently, we fine-tune the LISA model, a lightweight
multimodal LLM, using the CoT process of Multimodal LLMs and the mask of the
edited image. By providing the diffusion models with knowledge of the generated
prompt and image mask, our models generate images with a superior understanding
of instructions. Through extensive experiments, our model has demonstrated
superior performance in image generation, surpassing existing state-of-the-art
models. Notably, our model exhibits an enhanced ability to understand complex
prompts and generate corresponding images, while maintaining high fidelity and
consistency in images before and after generation. |
This paper proposes a novel image editing framework leveraging the reasoning and localizing capabilities of multimodal LLMs to enhance diffusion models for generating high-fidelity images from complex textual prompts. |
Current diffusion models and those integrated with LLMs face challenges in interpreting complex prompts and preserving image consistency pre- and post-editing. This work aims to address these limitations for more sophisticated and accurate image generation. |
The framework utilizes a Chain-of-Thought (CoT) process comprising instruction decomposition, region localization, and detailed description. It fine-tunes a lightweight multimodal LLM (LISA) with CoT data from GPT-4V and employs it to generate precise masks and inpainting prompts for a diffusion-based inpainting model. |
The model demonstrates superior performance in following complex instructions for image editing compared to existing state-of-the-art models.
It generates images with high fidelity, preserving the content of the original image while accurately modifying the specified regions.
The framework proves to be both effective and efficient, benefiting from the reasoning abilities of LLMs and the fine-tuned LISA model's performance and speed. |
The work is limited by the quantity and quality of the training dataset, which restricts the model's ability to generate precise, object-level masks.
The inpainting quality heavily relies on the prompt descriptions and the inherent randomness of diffusion models, affecting consistency. |
image editing, diffusion models, multimodal llms, chain-of-thought, high-fidelity generation |
2405.16788
Report |
3D Reconstruction with Fast Dipole Sums |
Hanyu Chen, Bailey Miller, Ioannis Gkioulekas |
We introduce a technique for the reconstruction of high-fidelity surfaces
from multi-view images. Our technique uses a new point-based representation,
the dipole sum, which generalizes the winding number to allow for interpolation
of arbitrary per-point attributes in point clouds with noisy or outlier points.
Using dipole sums allows us to represent implicit geometry and radiance fields
as per-point attributes of a point cloud, which we initialize directly from
structure from motion. We additionally derive Barnes-Hut fast summation schemes
for accelerated forward and reverse-mode dipole sum queries. These queries
facilitate the use of ray tracing to efficiently and differentiably render
images with our point-based representations, and thus update their point
attributes to optimize scene geometry and appearance. We evaluate this inverse
rendering framework against state-of-the-art alternatives, based on ray tracing
of neural representations or rasterization of Gaussian point-based
representations. Our technique significantly improves reconstruction quality at
equal runtimes, while also supporting more general rendering techniques such as
shadow rays for direct illumination. In the supplement, we provide interactive
visualizations of our results. |
This paper introduces "dipole sum," a novel point-based representation for reconstructing high-fidelity surfaces from multi-view images using an inverse rendering framework. |
Existing neural rendering techniques often struggle with high computational costs and difficulties leveraging 3D information from structure from motion. This paper addresses these limitations by enabling efficient and direct utilization of point clouds for high-quality surface reconstruction. |
The methodology involves generalizing the winding number concept to allow interpolation of attributes in noisy point clouds, using this to represent geometry and radiance fields, and leveraging Barnes-Hut fast summation for efficient computation and backpropagation during inverse rendering. |
The proposed technique significantly improves reconstruction quality at equal runtimes compared to state-of-the-art alternatives like neural and Gaussian representations.
It supports more general rendering techniques such as shadow rays, enhancing the accuracy of direct illumination.
The method directly leverages and refines point clouds from structure from motion, improving efficiency and detail in surface reconstruction. |
The paper acknowledges difficulties in accurately reconstructing surfaces with strong specular reflections, highlighting a need for improved handling of such appearances.
While the method demonstrates the potential for use with advanced rendering algorithms like path tracing, further investigation is needed to fully explore these capabilities. |
winding number, point-based modeling, inverse rendering, 3d reconstruction, ray tracing |
2405.16785
Report |
PromptFix: You Prompt and We Fix the Photo |
Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, Jiebo Luo |
Diffusion models equipped with language models demonstrate excellent
controllability in image generation tasks, allowing image processing to adhere
to human instructions. However, the lack of diverse instruction-following data
hampers the development of models that effectively recognize and execute
user-customized instructions, particularly in low-level tasks. Moreover, the
stochastic nature of the diffusion process leads to deficiencies in image
generation or editing tasks that require the detailed preservation of the
generated images. To address these limitations, we propose PromptFix, a
comprehensive framework that enables diffusion models to follow human
instructions to perform a wide variety of image-processing tasks. First, we
construct a large-scale instruction-following dataset that covers comprehensive
image-processing tasks, including low-level tasks, image editing, and object
creation. Next, we propose a high-frequency guidance sampling method to
explicitly control the denoising process and preserve high-frequency details in
unprocessed areas. Finally, we design an auxiliary prompting adapter, utilizing
Vision-Language Models (VLMs) to enhance text prompts and improve the model's
task generalization. Experimental results show that PromptFix outperforms
previous methods in various image-processing tasks. Our proposed model also
achieves comparable inference efficiency with these baseline models and
exhibits superior zero-shot capabilities in blind restoration and combination
tasks. The dataset and code will be aviliable at
https://github.com/yeates/PromptFix. |
This paper proposes PromptFix, a novel diffusion-based model with an accompanying large-scale visual-instruction training dataset, aimed at improving instruction-guided low-level image processing. |
Existing instruction-following datasets lack diversity and struggle with low-level tasks, hindering the development of effective models for detailed image processing. |
PromptFix leverages High-frequency Guidance Sampling to preserve spatial details and a VLM-based Auxiliary Prompt Module to enhance semantic understanding and adapt to severe image degradation. |
PromptFix demonstrates superior performance in instruction-based image processing tasks, surpassing existing methods in colorization, watermark removal, and object removal.
The model exhibits strong zero-shot capabilities, effectively handling blind restoration for low-light enhancement, desnowing, and dehazing.
PromptFix excels in multi-task processing, demonstrating the ability to address multiple low-level tasks within a single image. |
Blind restoration using PromptFix occasionally leads to out-of-conditioned image control, highlighting the need for user-specified instructions when possible.
While High-frequency Guidance Sampling enhances detail preservation, it can slightly reduce overall image quality. |
image processing, diffusion models, vision-language models, image restoration, instruction following |
2405.16645
Report |
Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models |
Hanwen Liang, Yuyang Yin, Dejia Xu, Hanxue Liang, Zhangyang Wang, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei |
The availability of large-scale multimodal datasets and advancements in
diffusion models have significantly accelerated progress in 4D content
generation. Most prior approaches rely on multiple image or video diffusion
models, utilizing score distillation sampling for optimization or generating
pseudo novel views for direct supervision. However, these methods are hindered
by slow optimization speeds and multi-view inconsistency issues. Spatial and
temporal consistency in 4D geometry has been extensively explored respectively
in 3D-aware diffusion models and traditional monocular video diffusion models.
Building on this foundation, we propose a strategy to migrate the temporal
consistency in video diffusion models to the spatial-temporal consistency
required for 4D generation. Specifically, we present a novel framework,
\textbf{Diffusion4D}, for efficient and scalable 4D content generation.
Leveraging a meticulously curated dynamic 3D dataset, we develop a 4D-aware
video diffusion model capable of synthesizing orbital views of dynamic 3D
assets. To control the dynamic strength of these assets, we introduce a
3D-to-4D motion magnitude metric as guidance. Additionally, we propose a novel
motion magnitude reconstruction loss and 3D-aware classifier-free guidance to
refine the learning and generation of motion dynamics. After obtaining orbital
views of the 4D asset, we perform explicit 4D construction with Gaussian
splatting in a coarse-to-fine manner. The synthesized multi-view consistent 4D
image set enables us to swiftly generate high-fidelity and diverse 4D assets
within just several minutes. Extensive experiments demonstrate that our method
surpasses prior state-of-the-art techniques in terms of generation efficiency
and 4D geometry consistency across various prompt modalities. |
Presents Diffusion4D, a novel framework for efficient and consistent 4D content generation leveraging 4D-aware video diffusion models and explicit 4D construction. |
Addresses the limitations of existing 4D generation methods, such as slow optimization speed and multi-view inconsistency, aiming for efficient and consistent generation of dynamic 3D content. |
1. Curates a large-scale, high-quality 4D dataset from existing 3D datasets. 2. Develops a 4D-aware video diffusion model to synthesize orbital views of dynamic 3D assets, incorporating a 3D-to-4D motion magnitude metric and guidance. 3. Performs explicit 4D construction using Gaussian splatting with a coarse-to-fine strategy. |
Achieves state-of-the-art performance in text-to-4D and image-to-4D generation, outperforming baselines in terms of generation efficiency and 4D geometry consistency.
Successfully generates dynamic 3D assets from static 3D content, demonstrating the versatility of the framework.
Shows significant improvement in quantitative metrics (CLIP, LPIPS, PSNR, SSIM, FVD) and qualitative evaluations (user study) compared to existing methods. |
Current implementation uses a limited video resolution and temporal sequence length.
Dataset diversity and quality can be further improved. |
4d content generation, video diffusion models, 3d-to-4d motion magnitude, gaussian splatting, spatial-temporal consistency |
2405.16605
Report |
Demystify Mamba in Vision: A Linear Attention Perspective |
Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang |
Mamba is an effective state space model with linear computation complexity.
It has recently shown impressive efficiency in dealing with high-resolution
inputs across various vision tasks. In this paper, we reveal that the powerful
Mamba model shares surprising similarities with linear attention Transformer,
which typically underperform conventional Transformer in practice. By exploring
the similarities and disparities between the effective Mamba and subpar linear
attention Transformer, we provide comprehensive analyses to demystify the key
factors behind Mamba's success. Specifically, we reformulate the selective
state space model and linear attention within a unified formulation, rephrasing
Mamba as a variant of linear attention Transformer with six major distinctions:
input gate, forget gate, shortcut, no attention normalization, single-head, and
modified block design. For each design, we meticulously analyze its pros and
cons, and empirically evaluate its impact on model performance in vision tasks.
Interestingly, the results highlight the forget gate and block design as the
core contributors to Mamba's success, while the other four designs are less
crucial. Based on these findings, we propose a Mamba-Like Linear Attention
(MLLA) model by incorporating the merits of these two key designs into linear
attention. The resulting model outperforms various vision Mamba models in both
image classification and high-resolution dense prediction tasks, while enjoying
parallelizable computation and fast inference speed. Code is available at
https://github.com/LeapLabTHU/MLLA. |
This paper reveals the close relationship between the efficient Mamba model and the linear attention Transformer, analyzing their similarities and disparities to understand the key factors behind Mamba's effectiveness. |
Mamba has shown impressive performance in various vision tasks with linear computation complexity, but it surprisingly shares similarities with the less effective linear attention Transformer, demanding an investigation into the reasons behind this difference. |
The paper reformulates selective state space model (Mamba) and linear attention within a unified framework, identifying six distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. The impact of each distinction on model performance is then empirically evaluated through ablations on vision tasks. |
The forget gate and block design are identified as the core contributors to Mamba's superior performance.
The forget gate, while effective, necessitates recurrent computation that might not be ideal for vision models and can be replaced by suitable positional encoding.
A novel Mamba-Like Linear Attention (MLLA) model, incorporating the merits of Mamba's design into linear attention, outperforms various vision Mamba models in image classification and dense prediction tasks, while enabling parallelizable computation. |
The analysis might not cover all subtle implementation differences between Mamba and linear attention Transformer.
Future work can investigate alternative parallelizable mechanisms to replace the forget gate for improved performance. |
mamba, linear attention, transformer, vision transformer, state space model |
2405.16596
Report |
Protect-Your-IP: Scalable Source-Tracing and Attribution against Personalized Generation |
Runyi Li, Xuanyu Zhang, Zhipei Xu, Yongbing Zhang, Jian Zhang |
With the advent of personalized generation models, users can more readily
create images resembling existing content, heightening the risk of violating
portrait rights and intellectual property (IP). Traditional post-hoc detection
and source-tracing methods for AI-generated content (AIGC) employ proactive
watermark approaches; however, these are less effective against personalized
generation models. Moreover, attribution techniques for AIGC rely on passive
detection but often struggle to differentiate AIGC from authentic images,
presenting a substantial challenge. Integrating these two processes into a
cohesive framework not only meets the practical demands for protection and
forensics but also improves the effectiveness of attribution tasks. Inspired by
this insight, we propose a unified approach for image copyright source-tracing
and attribution, introducing an innovative watermarking-attribution method that
blends proactive and passive strategies. We embed copyright watermarks into
protected images and train a watermark decoder to retrieve copyright
information from the outputs of personalized models, using this watermark as an
initial step for confirming if an image is AIGC-generated. To pinpoint specific
generation techniques, we utilize powerful visual backbone networks for
classification. Additionally, we implement an incremental learning strategy to
adeptly attribute new personalized models without losing prior knowledge,
thereby enhancing the model's adaptability to novel generation methods. We have
conducted experiments using various celebrity portrait series sourced online,
and the results affirm the efficacy of our method in source-tracing and
attribution tasks, as well as its robustness against knowledge forgetting. |
This paper proposes a novel framework for source-tracing and attribution of personalized generated images, employing a combination of proactive watermarking and passive detection mechanisms. |
The rise of personalized AI image generation models poses significant threats to portrait rights and intellectual property (IP) by enabling easy creation of images resembling existing content. |
This work embeds copyright watermarks into protected images using a box-free watermarking technique. These watermarks are detectable even after images are processed by personalized generation models, allowing for source-tracing. For attribution, a hierarchical approach is proposed, first detecting the presence of the watermark and then classifying the specific generation method using a visual backbone network. An incremental learning strategy is also incorporated for adaptable attribution of newly emerging generation methods. |
The proposed watermarking method effectively embeds copyright information while preserving image quality, outperforming the compared method.
The combined proactive and passive attribution approach achieves high accuracy in both detecting AI-generated content and identifying the specific generation method.
The implemented incremental learning strategy effectively updates the attribution model for new generation methods while mitigating catastrophic forgetting. |
The dataset used for training and validation could be more extensive.
The attribution approach currently requires an extra training process for new generation methods, and developing a more flexible and self-adaptive approach would be beneficial. |
ai-generated content, copyright protection, source-tracing, attribution, watermarking |
2405.16570
Report |
ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling |
Francesca Babiloni, Alexandros Lattas, Jiankang Deng, Stefanos Zafeiriou |
We propose ID-to-3D, a method to generate identity- and text-guided 3D human
heads with disentangled expressions, starting from even a single casually
captured in-the-wild image of a subject. The foundation of our approach is
anchored in compositionality, alongside the use of task-specific 2D diffusion
models as priors for optimization. First, we extend a foundational model with a
lightweight expression-aware and ID-aware architecture, and create 2D priors
for geometry and texture generation, via fine-tuning only 0.2% of its available
training parameters. Then, we jointly leverage a neural parametric
representation for the expressions of each subject and a multi-stage generation
of highly detailed geometry and albedo texture. This combination of strong face
identity embeddings and our neural representation enables accurate
reconstruction of not only facial features but also accessories and hair and
can be meshed to provide render-ready assets for gaming and telepresence. Our
results achieve an unprecedented level of identity-consistent and high-quality
texture and geometry generation, generalizing to a ``world'' of unseen 3D
identities, without relying on large 3D captured datasets of human assets. |
ID-to-3D: a method for generating identity- and text-guided 3D human heads with disentangled expressions from a single in-the-wild image. |
Existing methods struggle to generate high-quality 3D head avatars with personalized identity and expressions due to limitations in 3D data and disentangling geometry, texture, and lighting. |
The method leverages compositionality and task-specific 2D diffusion models as priors. It uses ArcFace embeddings for identity, a neural parametric representation for expressions, and a two-stage Score Distillation Sampling pipeline for generating geometry and albedo texture. |
Outperforms text-based and image-based SDS baselines in generating 3D heads with superior geometric details and texture quality.
Generates a wide variety of ID-consistent expressions, captured by latent codes.
Allows for ID-consistent editing of geometry and appearance using text prompts. |
Generalization capacity is limited by the used face embedding network and diffusion model, potentially introducing biases.
Lack of specific optimization for physically bounded textures and geometries might occasionally produce unnatural facial characteristics. |
3d head generation, score distillation sampling, identity-consistent, expressive avatars, diffusion models |
2405.16567
Report |
Automatic Jailbreaking of the Text-to-Image Generative AI Systems |
Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang |
Recent AI systems have shown extremely powerful performance, even surpassing
human performance, on various tasks such as information retrieval, language
generation, and image generation based on large language models (LLMs). At the
same time, there are diverse safety risks that can cause the generation of
malicious contents by circumventing the alignment in LLMs, which are often
referred to as jailbreaking. However, most of the previous works only focused
on the text-based jailbreaking in LLMs, and the jailbreaking of the
text-to-image (T2I) generation system has been relatively overlooked. In this
paper, we first evaluate the safety of the commercial T2I generation systems,
such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive
prompts. From this empirical study, we find that Copilot and Gemini block only
12% and 17% of the attacks with naive prompts, respectively, while ChatGPT
blocks 84% of them. Then, we further propose a stronger automated jailbreaking
pipeline for T2I generation systems, which produces prompts that bypass their
safety guards. Our automated jailbreaking framework leverages an LLM optimizer
to generate prompts to maximize degree of violation from the generated images
without any weight updates or gradient computation. Surprisingly, our simple
yet effective approach successfully jailbreaks the ChatGPT with 11.0% block
rate, making it generate copyrighted contents in 76% of the time. Finally, we
explore various defense strategies, such as post-generation filtering and
machine unlearning techniques, but found that they were inadequate, which
suggests the necessity of stronger defense mechanisms. |
This paper proposes an Automated Prompt Generation Pipeline (APGP) to evaluate and expose the risk of copyright infringement in commercial text-to-image (T2I) generation systems. |
Despite the advancement of AI systems and their integration into commercial T2I platforms, the risk of copyright infringement remains a significant concern, and current systems lack robust evaluation mechanisms. |
The APGP leverages large language models (LLMs) to generate high-risk prompts from target images by optimizing a self-generated QA score and incorporating keyword penalties to bypass safety guards. |
The study reveals that most commercial T2I systems, including Midjourney, Gemini, and Copilot, exhibit a high likelihood of copyright violation even with simple prompts.
ChatGPT, while initially appearing more secure, is also vulnerable to copyright infringement when tested with APGP-generated prompts, achieving a 76% violation rate.
Simple defense mechanisms, such as copyright detection filtering and concept unlearning models, prove inadequate in mitigating the risks highlighted by the APGP. |
The violation rate can fluctuate due to the inherent randomness of commercial T2I systems.
The paper's focus on copyright infringement analysis is primarily technical, lacking a comprehensive legal perspective on the observed violations. |
copyright infringement, text-to-image generation, jailbreaking, ai safety, large language models |
2405.16555
Report |
vHeat: Building Vision Models upon Heat Conduction |
Zhaozhi Wang, Yue Liu, Yunfan Liu, Hongtian Yu, Yaowei Wang, Qixiang Ye, Yunjie Tian |
A fundamental problem in learning robust and expressive visual
representations lies in efficiently estimating the spatial relationships of
visual semantics throughout the entire image. In this study, we propose vHeat,
a novel vision backbone model that simultaneously achieves both high
computational efficiency and global receptive field. The essential idea,
inspired by the physical principle of heat conduction, is to conceptualize
image patches as heat sources and model the calculation of their correlations
as the diffusion of thermal energy. This mechanism is incorporated into deep
models through the newly proposed module, the Heat Conduction Operator (HCO),
which is physically plausible and can be efficiently implemented using DCT and
IDCT operations with a complexity of $\mathcal{O}(N^{1.5})$. Extensive
experiments demonstrate that vHeat surpasses Vision Transformers (ViTs) across
various vision tasks, while also providing higher inference speeds, reduced
FLOPs, and lower GPU memory usage for high-resolution images. The code will be
released at https://github.com/MzeroMiko/vHeat. |
This paper introduces vHeat, a novel vision backbone model inspired by the physical principle of heat conduction, achieving both high computational efficiency and global receptive field. |
Existing vision models, including CNNs, ViTs, and SSMs, struggle to balance computational complexity with the ability to capture long-range dependencies in images. vHeat addresses this challenge by modeling the propagation of visual semantics as heat diffusion. |
vHeat leverages the Heat Conduction Operator (HCO), which simulates visual heat conduction using 2D DCT and IDCT operations. This approach offers an interpretable mechanism for global information propagation with a complexity of O(N^1.5). |
vHeat outperforms benchmark models like ConvNeXt and Swin Transformers in image classification, object detection, and semantic segmentation tasks.
vHeat demonstrates superior computational efficiency, exhibiting higher inference speeds, reduced FLOPs, and lower GPU memory usage, particularly for high-resolution images.
Visualization analysis confirms vHeat's ability to establish global receptive fields and adapt its visual heat conduction based on image content. |
The training process of vHeat can be challenging when long-range information conduction is required, demanding extensive training for effective long-range dependency learning.
A dedicated self-supervised learning method tailored for vHeat, similar to masked image modeling for ViTs, is yet to be developed. |
vision backbone, heat conduction, global receptive field, computational efficiency, image classification |
2405.16537
Report |
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models |
Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, Xingang Pan |
The remarkable generative capabilities of diffusion models have motivated
extensive research in both image and video editing. Compared to video editing
which faces additional challenges in the time dimension, image editing has
witnessed the development of more diverse, high-quality approaches and more
capable software like Photoshop. In light of this gap, we introduce a novel and
generic solution that extends the applicability of image editing tools to
videos by propagating edits from a single frame to the entire video using a
pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively
preserves the visual and motion integrity of the source video depending on the
extent of the edits, effectively handling global edits, local edits, and
moderate shape changes, which existing methods cannot fully achieve. At the
core of our method are two main processes: Coarse Motion Extraction to align
basic motion patterns with the original video, and Appearance Refinement for
precise adjustments using fine-grained attention matching. We also incorporate
a skip-interval strategy to mitigate quality degradation from auto-regressive
generation across multiple video clips. Experimental results demonstrate our
framework's superior performance in fine-grained video editing, proving its
capability to produce high-quality, temporally consistent outputs. |
Presents I2VEdit, a framework for fine-grained video editing that propagates user-made edits from the first frame to the whole video using a pre-trained image-to-video model. |
Bridges the gap between advanced image editing tools and the limited capabilities of current video editing methods by leveraging the strength of image editing tools for video editing. |
Employs a two-stage pipeline: 1) Coarse Motion Extraction: learns motion patterns from the source video using LoRA and skip-interval cross-attention. 2) Appearance Refinement: fine-tunes appearance and motion using attention matching, enhanced by smooth area random perturbation (SARP) during latent inversion. |
Outperforms text-guided video editing and traditional image-guided methods in terms of editing quality, motion preservation, and appearance consistency.
Demonstrates strong performance on various tasks, including local editing, global style transfer, and identity manipulation.
Smooth area random perturbation (SARP) effectively addresses issues related to smooth regions during latent inversion, resulting in significant quality improvement. |
May produce minor color and texture inconsistencies in unedited areas.
Editing quality may degrade for videos with significant content change across clips. |
video editing, diffusion models, image-to-video generation, attention mechanism, low-rank adaptation |
2405.16534
Report |
Pruning for Robust Concept Erasing in Diffusion Models |
Tianyun Yang, Juan Cao, Chang Xu |
Despite the impressive capabilities of generating images, text-to-image
diffusion models are susceptible to producing undesirable outputs such as NSFW
content and copyrighted artworks. To address this issue, recent studies have
focused on fine-tuning model parameters to erase problematic concepts. However,
existing methods exhibit a major flaw in robustness, as fine-tuned models often
reproduce the undesirable outputs when faced with cleverly crafted prompts.
This reveals a fundamental limitation in the current approaches and may raise
risks for the deployment of diffusion models in the open world. To address this
gap, we locate the concept-correlated neurons and find that these neurons show
high sensitivity to adversarial prompts, thus could be deactivated when erasing
and reactivated again under attacks. To improve the robustness, we introduce a
new pruning-based strategy for concept erasing. Our method selectively prunes
critical parameters associated with the concepts targeted for removal, thereby
reducing the sensitivity of concept-related neurons. Our method can be easily
integrated with existing concept-erasing techniques, offering a robust
improvement against adversarial inputs. Experimental results show a significant
enhancement in our model's ability to resist adversarial inputs, achieving
nearly a 40% improvement in erasing the NSFW content and a 30% improvement in
erasing artwork style. |
This paper introduces a novel pruning-based strategy for concept erasing in text-to-image diffusion models, which enhances robustness against adversarial prompts. |
Existing concept erasing methods are vulnerable to adversarial prompts that can regenerate supposedly erased content, posing risks for real-world deployment of diffusion models. |
The method identifies concept-correlated neurons sensitive to adversarial prompts and uses a differentiable pruning strategy guided by the concept erasing objective to selectively prune parameters, reducing neuron sensitivity. |
The approach significantly improves robustness against adversarial attacks in erasing nudity, art styles, and objects.
Pruning with erasing is found to be more effective than pruning before or after erasing.
The method maintains good image generation quality for non-erased concepts. |
The concept neuron identification relies on a numerical criterion that may be sensitive to the erased model selection.
Future work includes exploring more accurate concept neuron identification and investigating the potential for developing more sophisticated attack strategies. |
diffusion models, concept erasing, pruning, robustness, adversarial prompts |
2405.16517
Report |
Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors |
Soumava Paul, Christopher Wewer, Bernt Schiele, Jan Eric Lenssen |
We aim to tackle sparse-view reconstruction of a 360 3D scene using priors
from latent diffusion models (LDM). The sparse-view setting is ill-posed and
underconstrained, especially for scenes where the camera rotates 360 degrees
around a point, as no visual information is available beyond some frontal views
focused on the central object(s) of interest. In this work, we show that
pretrained 2D diffusion models can strongly improve the reconstruction of a
scene with low-cost fine-tuning. Specifically, we present SparseSplat360
(Sp2360), a method that employs a cascade of in-painting and artifact removal
models to fill in missing details and clean novel views. Due to superior
training and rendering speeds, we use an explicit scene representation in the
form of 3D Gaussians over NeRF-based implicit representations. We propose an
iterative update strategy to fuse generated pseudo novel views with existing 3D
Gaussians fitted to the initial sparse inputs. As a result, we obtain a
multi-view consistent scene representation with details coherent with the
observed inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows
that our proposed 2D to 3D distillation algorithm considerably improves the
performance of a regularized version of 3DGS adapted to a sparse-view setting
and outperforms existing sparse-view reconstruction methods in 360 scene
reconstruction. Qualitatively, our method generates entire 360 scenes from as
few as 9 input views, with a high degree of foreground and background detail. |
Introduces SparseSplat360, a method for reconstructing 360° 3D scenes from sparse views using latent diffusion models to generate pseudo novel views. |
Sparse-view 3D reconstruction is challenging due to limited information and traditional methods struggle with artifacts and missing details. |
SparseSplat360 employs a two-step process using 2D diffusion models for in-painting missing regions and removing artifacts in rendered novel views. These improved views iteratively refine a 3D Gaussian representation of the scene. |
Outperforms existing sparse-view reconstruction methods in 360° scene reconstruction.
Generates entire 360° scenes from as few as 9 input views with high detail.
Significantly faster and more data-efficient than methods relying on large-scale 3D datasets. |
Limited by the accuracy of the initial sparse point cloud from SfM.
Future work includes incorporating stronger geometry cues from 3D vision foundation models. |
3d reconstruction, sparse view synthesis, diffusion models, generative priors, 3d gaussian splatting |
2405.16504
Report |
A Unified Implicit Attention Formulation for Gated-Linear Recurrent Sequence Models |
Itamar Zimerman, Ameen Ali, Lior Wolf |
Recent advances in efficient sequence modeling have led to attention-free
layers, such as Mamba, RWKV, and various gated RNNs, all featuring
sub-quadratic complexity in sequence length and excellent scaling properties,
enabling the construction of a new type of foundation models. In this paper, we
present a unified view of these models, formulating such layers as implicit
causal self-attention layers. The formulation includes most of their
sub-components and is not limited to a specific part of the architecture. The
framework compares the underlying mechanisms on similar grounds for different
layers and provides a direct means for applying explainability methods. Our
experiments show that our attention matrices and attribution method outperform
an alternative and a more limited formulation that was recently proposed for
Mamba. For the other architectures for which our method is the first to provide
such a view, our method is effective and competitive in the relevant metrics
compared to the results obtained by state-of-the-art transformer explainability
methods. Our code is publicly available. |
This paper presents a unified view of attention-free sequence models like Mamba, RWKV, and Griffin as implicit causal self-attention layers, enabling explainability methods for these architectures. |
This unified view facilitates comparisons between transformer and non-transformer architectures and enables the development of new explainability and interpretability techniques for non-transformer models, crucial for understanding aspects like robustness, bias, and fairness. |
The authors mathematically formulate the layers of these models (Mamba, RWKV, Griffin) as data-controlled linear operators, effectively representing them as implicit attention mechanisms. This approach involves analyzing the token mixing components, incorporating elements like gate branches and convolutional layers. |
The implicit attention matrices derived from Mamba, Griffin, and RWKV exhibit patterns similar to traditional transformers, particularly in capturing long-range dependencies.
The proposed attention representation leads to more accurate and interpretable attention maps compared to previous formulations, as demonstrated by visualization and superior performance in segmentation tests.
Ablation studies confirm the importance of incorporating all architectural components (e.g., gate branches, convolutional layers) in the unified attention representation for optimal performance. |
The paper primarily focuses on Mamba, RWKV, and Griffin, with potential to extend the framework to other architectures like Hyena and HGRN2.
Future work could explore how differences in these architectures are reflected in their self-attention matrices to reveal more about their inductive biases. |
self-attention, explainable ai (xai), sequence modeling, non-transformer architectures, mamba, rwkv, griffin |
2405.16501
Report |
User-Friendly Customized Generation with Multi-Modal Prompts |
Linhao Zhong, Yan Hong, Wentao Chen, Binglin Zhou, Yiyi Zhang, Jianfu Zhang, Liqing Zhang |
Text-to-image generation models have seen considerable advancement, catering
to the increasing interest in personalized image creation. Current
customization techniques often necessitate users to provide multiple images
(typically 3-5) for each customized object, along with the classification of
these objects and descriptive textual prompts for scenes. This paper questions
whether the process can be made more user-friendly and the customization more
intricate. We propose a method where users need only provide images along with
text for each customization topic, and necessitates only a single image per
visual concept. We introduce the concept of a ``multi-modal prompt'', a novel
integration of text and images tailored to each customization concept, which
simplifies user interaction and facilitates precise customization of both
objects and scenes. Our proposed paradigm for customized text-to-image
generation surpasses existing finetune-based methods in user-friendliness and
the ability to customize complex objects with user-friendly inputs. Our code is
available at
$\href{https://github.com/zhongzero/Multi-Modal-Prompt}{https://github.com/zhongzero/Multi-Modal-Prompt}$. |
This paper proposes a user-friendly paradigm for customized text-to-image generation that simplifies user interaction by requiring only a single image per visual concept and accompanying text. |
Existing methods often need multiple images per concept and struggle to capture intricate details of complex objects. This paradigm addresses these limitations by enhancing user-friendliness and customization granularity. |
The method leverages a two-stage process: 1) extracting descriptions of main objects from user-provided images using BLIP for image captioning and ChatGPT for semantic analysis, and 2) finetuning a diffusion model with these descriptions to enable customized image generation based on user prompts. |
The proposed paradigm outperforms existing methods in detailed customization of complex objects, as evidenced by qualitative comparisons.
Quantitative evaluations using DINO score, CLIP-I score, and CLIP-T score demonstrate the superior performance of the paradigm in both image and text alignment.
Human preference studies confirm that users prefer the proposed method over traditional approaches for both image and text alignment. |
The current implementation shows limitations in handling multi-image scenarios due to constraints of existing stable diffusion models.
The current definition of multi-modal prompts is restricted to customizing main objects, limiting broader semantic understanding and customization. |
text-to-image generation, image customization, multi-modal prompts, diffusion models, user-friendly interface |
2405.16470
Report |
Image Deraining with Frequency-Enhanced State Space Model |
Shugo Yamashita, Masaaki Ikehara |
Removing rain artifacts in images is recognized as a significant issue. In
this field, deep learning-based approaches, such as convolutional neural
networks (CNNs) and Transformers, have succeeded. Recently, State Space Models
(SSMs) have exhibited superior performance across various tasks in both natural
language processing and image processing due to their ability to model
long-range dependencies. This study introduces SSM to rain removal and proposes
a Deraining Frequency-Enhanced State Space Model (DFSSM). To effectively remove
rain streaks, which produce high-intensity frequency components in specific
directions, we employ frequency domain processing concurrently with SSM.
Additionally, we develop a novel mixed-scale gated-convolutional block, which
uses convolutions with multiple kernel sizes to capture various scale
degradations effectively and integrates a gating mechanism to manage the flow
of information. Finally, experiments on synthetic and real-world rainy image
datasets show that our method surpasses state-of-the-art methods. |
This paper proposes DFSSM, a novel deraining model based on State Space Models (SSMs) that effectively removes rain artifacts from images by incorporating frequency domain processing. |
Rain artifacts in images can severely degrade the performance of vision-based systems. Removing these artifacts is crucial for improving the quality and reliability of such systems. |
The DFSSM leverages SSMs to capture long-range dependencies and employs a Frequency-Enhanced State Space Block (FSSB) for efficient rain streak removal. It also introduces a Mixed-Scale Gated-Convolutional Block (MGCB) to handle various scales of rain degradations and manage the flow of information within the network. The model is trained with L1 loss and Frequency Reconstruction loss. |
DFSSM outperforms state-of-the-art deraining methods on both synthetic (Rain200H, Rain200L) and real-world (SPA-Data) datasets.
Frequency domain processing through FFTM and the use of MGCB are shown to be effective for rain removal.
Ablation studies demonstrate the contribution of each component in DFSSM to the overall performance gain. |
The inference time of DFSSM is currently slower than some compared Transformer-based methods, potentially due to the lack of optimized implementation for SSMs.
Future work could focus on further improving the model efficiency and exploring the application of DFSSM in video deraining tasks. |
image deraining, state space models, frequency domain processing, deep learning, computer vision |
2405.16401
Report |
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning |
Neha Kalibhat, Priyatham Kattakinda, Arman Zarei, Nikita Seleznev, Samuel Sharpe, Senthil Kumar, Soheil Feizi |
Vision transformers have established a precedent of patchifying images into
uniformly-sized chunks before processing. We hypothesize that this design
choice may limit models in learning comprehensive and compositional
representations from visual data. This paper explores the notion of providing
semantically-meaningful visual tokens to transformer encoders within a
vision-language pre-training framework. Leveraging off-the-shelf segmentation
and scene-graph models, we extract representations of instance segmentation
masks (referred to as tangible tokens) and relationships and actions (referred
to as intangible tokens). Subsequently, we pre-train a vision-side transformer
by incorporating these newly extracted tokens and aligning the resultant
embeddings with caption embeddings from a text-side encoder. To capture the
structural and semantic relationships among visual tokens, we introduce
additive attention weights, which are used to compute self-attention scores.
Our experiments on COCO demonstrate notable improvements over ViTs in learned
representation quality across text-to-image (+47%) and image-to-text retrieval
(+44%) tasks. Furthermore, we showcase the advantages on compositionality
benchmarks such as ARO (+18%) and Winoground (+10%). |
This paper proposes using semantically meaningful visual tokens, extracted from off-the-shelf segmentation and scene-graph models, to improve representation learning in vision transformers. |
The authors hypothesize that the standard practice of patchifying images into uniformly-sized chunks limits the model's ability to learn comprehensive and compositional representations. |
The authors extract tangible tokens (instance segmentation masks) and intangible tokens (relationships and actions) using SEEM and RAM. They pre-train a vision transformer by incorporating these tokens and aligning the resulting embeddings with caption embeddings from a text-side encoder. Additive attention weights, based on structural and semantic relationships, are introduced to enhance representation learning. |
The proposed method achieves a 47% improvement in text-to-image retrieval accuracy over a standard ViT and 9% over a fine-tuned CLIP model on COCO.
The learned representations show improved compositional reasoning capabilities, outperforming a ViT by 18% on the ARO benchmark and 10% on the Winoground benchmark.
Using additive attention based on semantic relationships and relative positions further enhances performance on compositionality benchmarks. |
Pre-processing images to extract tokens introduces computational and memory overhead.
The scalability of the approach to larger datasets and more complex scenes needs further investigation. |
vision transformers, tokenization, semantic segmentation, scene graphs, compositional reasoning |
2405.16393
Report |
Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation |
Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui |
Recent advancements in human video synthesis have enabled the generation of
high-quality videos through the application of stable diffusion models.
However, existing methods predominantly concentrate on animating solely the
human element (the foreground) guided by pose information, while leaving the
background entirely static. Contrary to this, in authentic, high-quality
videos, backgrounds often dynamically adjust in harmony with foreground
movements, eschewing stagnancy. We introduce a technique that concurrently
learns both foreground and background dynamics by segregating their movements
using distinct motion representations. Human figures are animated leveraging
pose-based motion, capturing intricate actions. Conversely, for backgrounds, we
employ sparse tracking points to model motion, thereby reflecting the natural
interaction between foreground activity and environmental changes. Training on
real-world videos enhanced with this innovative motion depiction approach, our
model generates videos exhibiting coherent movement in both foreground subjects
and their surrounding contexts. To further extend video generation to longer
sequences without accumulating errors, we adopt a clip-by-clip generation
strategy, introducing global features at each step. To ensure seamless
continuity across these segments, we ingeniously link the final frame of a
produced clip with input noise to spawn the succeeding one, maintaining
narrative flow. Throughout the sequential generation process, we infuse the
feature representation of the initial reference image into the network,
effectively curtailing any cumulative color inconsistencies that may otherwise
arise. Empirical evaluations attest to the superiority of our method in
producing videos that exhibit harmonious interplay between foreground actions
and responsive background dynamics, surpassing prior methodologies in this
regard. |
This paper proposes a novel video generation method that decouples foreground and background motion representation, enabling the generation of videos with dynamic backgrounds, unlike previous methods that mainly focused on animating foreground figures against static backgrounds. |
Most existing human video synthesis methods generate videos with static backgrounds, which contradicts real-world scenarios where backgrounds are often dynamic. This limits the realism of generated videos. |
The proposed method utilizes pose estimation to capture foreground (human) motion and sparse tracking points to model background motion. It employs a clip-by-clip generation strategy with condition concatenation and global feature extraction to generate longer videos without accumulating errors. |
The method successfully generates realistic human videos with natural foreground motion and believable background dynamics, outperforming previous state-of-the-art methods on benchmark datasets.
Qualitative and quantitative evaluations demonstrate the superior performance of the proposed method in terms of visual quality, motion fidelity, and temporal coherence.
Ablation studies confirm the effectiveness of each proposed component, including foreground and background motion representation, condition concatenation, and global feature extraction. |
The method's performance depends on the accuracy of the pose estimation and tracking point extraction techniques used.
The use of sparse tracking points may not capture the full complexity of background motion. Increasing the number of tracking points could improve this but at a computational cost. |
video generation, diffusion models, motion representation, dynamic backgrounds, long video synthesis |
2405.16341
Report |
R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model |
Changhoon Kim, Kyle Min, Yezhou Yang |
In the evolving landscape of text-to-image (T2I) diffusion models, the
remarkable capability to generate high-quality images from textual descriptions
faces challenges with the potential misuse of reproducing sensitive content. To
address this critical issue, we introduce Robust Adversarial Concept Erase
(RACE), a novel approach designed to mitigate these risks by enhancing the
robustness of concept erasure method for T2I models. RACE utilizes a
sophisticated adversarial training framework to identify and mitigate
adversarial text embeddings, significantly reducing the Attack Success Rate
(ASR). Impressively, RACE achieves a 30 percentage point reduction in ASR for
the ``nudity'' concept against the leading white-box attack method. Our
extensive evaluations demonstrate RACE's effectiveness in defending against
both white-box and black-box attacks, marking a significant advancement in
protecting T2I diffusion models from generating inappropriate or misleading
imagery. This work underlines the essential need for proactive defense measures
in adapting to the rapidly advancing field of adversarial challenges. |
The paper introduces RACE (Robust Adversarial Concept Erase), a novel method to enhance the robustness of concept erasure in text-to-image diffusion models against adversarial attacks aiming to regenerate erased content. |
Existing concept erasure techniques, while effective in removing sensitive content, are vulnerable to red-teaming attacks that can reconstruct the erased concepts using cleverly designed prompts. This poses risks of misuse and necessitates more robust erasure methods. |
RACE leverages an adversarial training framework that identifies adversarial text embeddings capable of reconstructing erased concepts. It efficiently uncovers these embeddings within a single timestep of the diffusion process and integrates them into the concept erasure workflow, enhancing the model's resilience against attacks. |
RACE significantly reduces the Attack Success Rate (ASR) against both white-box and black-box attacks targeting various concepts, including artistic styles, explicit content, and objects.
For instance, RACE achieves over a 30% reduction in ASR for the 'nudity' concept against the leading white-box attack method.
RACE exhibits disentanglement capabilities, effectively erasing target concepts while minimizing the impact on the generation of other unrelated concepts. |
There's a trade-off observed between enhancing robustness and maintaining image quality, particularly noticeable when erasing concepts beyond artistic styles.
The selection of representative keywords for concept erasure significantly influences the effectiveness of the method, as highlighted by the challenges in erasing 'violence' and 'illegal act' content. |
text-to-image synthesis, concept erasure, adversarial training, diffusion models, robustness |
2405.16287
Report |
LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters |
Xinyu Zhou, Boris Knyazev, Alexia Jolicoeur-Martineau, Jie Fu |
A good initialization of deep learning models is essential since it can help
them converge better and faster. However, pretraining large models is
unaffordable for many researchers, which makes a desired prediction for initial
parameters more necessary nowadays. Graph HyperNetworks (GHNs), one approach to
predicting model parameters, have recently shown strong performance in
initializing large vision models. Unfortunately, predicting parameters of very
wide networks relies on copying small chunks of parameters multiple times and
requires an extremely large number of parameters to support full prediction,
which greatly hinders its adoption in practice. To address this limitation, we
propose LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter
decoder that expands to significantly wider networks without requiring as
excessive increase of parameters as in previous attempts. LoGAH allows us to
predict the parameters of 774-million large neural networks in a
memory-efficient manner. We show that vision and language models (i.e., ViT and
GPT-2) initialized with LoGAH achieve better performance than those initialized
randomly or using existing hypernetworks. Furthermore, we show promising
transfer learning results w.r.t. training LoGAH on small datasets and using the
predicted parameters to initialize for larger tasks. We provide the codes in
https://github.com/Blackzxy/LoGAH . |
This paper proposes LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts. |
A good initialization of deep learning models is essential, but pretraining large models is unaffordable for many researchers. Existing GHNs have limitations in predicting parameters of very wide networks. |
The paper introduces LoGAH, a novel low-rank parameter decoder that reduces the number of parameters required for prediction. It also creates new datasets, ViTs-1K and GPTs-1K, containing diverse ViT-style and GPT-2-style computational graphs, respectively. |
LoGAH outperforms GHN-3 and random initialization in initializing ViT and GPT-2 models, achieving better performance on CIFAR, ImageNet, and WikiText datasets.
Increasing the meta-batch size during training can improve LoGAH performance significantly.
LoGAH demonstrates promising transfer learning ability, showing good performance when trained on a smaller dataset and used for initializing larger tasks. |
The GPT-2 experiments are limited to the WikiText dataset and smaller LoGAH models due to time and resource constraints.
Training on larger datasets and exploring LoGAH's capability on modern LLMs is left for future work. |
graph hypernetworks, parameter prediction, model initialization, vision transformers, gpt-2 |
2405.16260
Report |
Enhancing Consistency-Based Image Generation via Adversarialy-Trained Classification and Energy-Based Discrimination |
Shelly Golan, Roy Ganz, Michael Elad |
The recently introduced Consistency models pose an efficient alternative to
diffusion algorithms, enabling rapid and good quality image synthesis. These
methods overcome the slowness of diffusion models by directly mapping noise to
data, while maintaining a (relatively) simpler training. Consistency models
enable a fast one- or few-step generation, but they typically fall somewhat
short in sample quality when compared to their diffusion origins. In this work
we propose a novel and highly effective technique for post-processing
Consistency-based generated images, enhancing their perceptual quality. Our
approach utilizes a joint classifier-discriminator model, in which both
portions are trained adversarially. While the classifier aims to grade an image
based on its assignment to a designated class, the discriminator portion of the
very same network leverages the softmax values to assess the proximity of the
input image to the targeted data manifold, thereby serving as an Energy-based
Model. By employing example-specific projected gradient iterations under the
guidance of this joint machine, we refine synthesized images and achieve an
improved FID scores on the ImageNet 64x64 dataset for both Consistency-Training
and Consistency-Distillation techniques. |
This paper introduces a novel post-processing technique to enhance the perceptual quality of images generated by Consistency models using a joint classifier-discriminator network. |
Consistency models offer fast image synthesis but often lack the quality of diffusion models. This method bridges this quality gap without extensive retraining. |
A joint classifier-discriminator is adversarially trained on both real and synthetic images. This model then guides the refinement of generated images using projected gradient iterations, aiming to align them with both a target class and the real data manifold. |
The method significantly improves FID scores on ImageNet 64x64 for both Consistency-Training (27.48% boost) and Consistency-Distillation (20.96% boost).
The joint classifier-discriminator proves more effective than using a robust classifier alone (BIGROC), showing an additional 11.2% FID improvement.
Preliminary results suggest the method's generalizability to other generative models beyond Consistency models. |
The study is limited by the capabilities of the chosen RN50 architecture for the joint model.
Training relies solely on Consistency-generated images, limiting its generalization potential. Future work could explore diverse datasets with different generative models. |
image synthesis, consistency models, perceptual quality, adversarial training, energy-based models |
2405.16098
Report |
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion |
Zizhao Hu, Mohammad Rostami |
The Transformer architecture has dominated machine learning in a wide range
of tasks. The specific characteristic of this architecture is an expensive
scaled dot-product attention mechanism that models the inter-token
interactions, which is known to be the reason behind its success. However, such
a mechanism does not have a direct parallel to the human brain which brings the
question if the scaled-dot product is necessary for intelligence with strong
expressive power. Inspired by the lateralization of the human brain, we propose
a new simple but effective architecture called the Lateralization MLP (L-MLP).
Stacking L-MLP blocks can generate complex architectures. Each L-MLP block is
based on a multi-layer perceptron (MLP) that permutes data dimensions,
processes each dimension in parallel, merges them, and finally passes through a
joint MLP. We discover that this specific design outperforms other MLP variants
and performs comparably to a transformer-based architecture in the challenging
diffusion task while being highly efficient. We conduct experiments using
text-to-image generation tasks to demonstrate the effectiveness and efficiency
of L-MLP. Further, we look into the model behavior and discover a connection to
the function of the human brain. Our code is publicly available:
\url{https://github.com/zizhao-hu/L-MLP} |
This paper proposes L-MLP, a novel MLP-based architecture for vision tasks inspired by the functional lateralization of the human brain. |
The dominant Transformer architecture, while effective, lacks a direct parallel in the human brain and relies on computationally expensive attention mechanisms. L-MLP offers a simpler, more brain-inspired, and computationally efficient alternative. |
L-MLP leverages a two-stage processing approach with dimension permutation, separate normalization and transformations for different dimensions, merging of processed features, and residual connections. The authors demonstrate the architecture's effectiveness on a challenging text-to-image diffusion task. |
L-MLP achieves comparable image generation quality to Transformer-based models on the MS-COCO dataset, achieving an FID score of 8.62.
The architecture demonstrates superior computational efficiency, with faster training and inference speeds compared to Transformers.
Analysis of L-MLP reveals functional lateralization within the network during training, mimicking the behavior of the human brain. |
L-MLP still exhibits an expressive gap compared to Transformer-based models, potentially due to the absence of higher-order interactions present in attention mechanisms.
The current design's quadratic scaling to sequence length limits its application in natural language processing tasks requiring the handling of long sequences. |
mlp, vision transformer, diffusion models, text-to-image generation, brain-inspired ai |
2405.16034
Report |
DiffuBox: Refining 3D Object Detection with Point Diffusion |
Xiangyu Chen, Zhenzhen Liu, Katie Z Luo, Siddhartha Datta, Adhitya Polavaram, Yan Wang, Yurong You, Boyi Li, Marco Pavone, Wei-Lun Chao, Mark Campbell, Bharath Hariharan, Kilian Q. Weinberger |
Ensuring robust 3D object detection and localization is crucial for many
applications in robotics and autonomous driving. Recent models, however, face
difficulties in maintaining high performance when applied to domains with
differing sensor setups or geographic locations, often resulting in poor
localization accuracy due to domain shift. To overcome this challenge, we
introduce a novel diffusion-based box refinement approach. This method employs
a domain-agnostic diffusion model, conditioned on the LiDAR points surrounding
a coarse bounding box, to simultaneously refine the box's location, size, and
orientation. We evaluate this approach under various domain adaptation
settings, and our results reveal significant improvements across different
datasets, object classes and detectors. |
This paper introduces a novel diffusion-based box refinement approach for domain adaptation in 3D object detection, which refines bounding box location, size, and orientation using a domain-agnostic diffusion model conditioned on LiDAR points. |
Robust 3D object detection is crucial for robotics and autonomous driving, but existing models struggle with domain shift. This method addresses the challenge of maintaining high performance across domains with different sensor setups or geographic locations. |
The method leverages a point cloud diffusion model trained on a normalized box view (NBV) to learn the scale-invariant distribution of points relative to object bounding boxes. This allows for refining noisy bounding box proposals from object detectors without retraining. |
The method significantly improves mAP performance (up to 24 mAP) across different datasets, object classes, and detectors.
It shows particularly strong improvements in near-range box refinement where point density is higher.
The approach complements existing domain adaptation methods and further improves their performance. |
The method currently doesn't address false negatives in object detection.
Future work could explore incorporating exploration strategies or distilling detectors to handle false negatives. |
3d object detection, domain adaptation, diffusion models, lidar point clouds, autonomous driving |
2405.16009
Report |
Streaming Long Video Understanding with Large Language Models |
Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang |
This paper presents VideoStreaming, an advanced vision-language large model
(VLLM) for video understanding, that capably understands arbitrary-length video
with a constant number of video tokens streamingly encoded and adaptively
selected. The challenge of video understanding in the vision language area
mainly lies in the significant computational burden caused by the great number
of tokens extracted from long videos. Previous works rely on sparse sampling or
frame compression to reduce tokens. However, such approaches either disregard
temporal information in a long time span or sacrifice spatial details,
resulting in flawed compression. To address these limitations, our
VideoStreaming has two core designs: Memory-Propagated Streaming Encoding and
Adaptive Memory Selection. The Memory-Propagated Streaming Encoding
architecture segments long videos into short clips and sequentially encodes
each clip with a propagated memory. In each iteration, we utilize the encoded
results of the preceding clip as historical memory, which is integrated with
the current clip to distill a condensed representation that encapsulates the
video content up to the current timestamp. After the encoding process, the
Adaptive Memory Selection strategy selects a constant number of
question-related memories from all the historical memories and feeds them into
the LLM to generate informative responses. The question-related selection
reduces redundancy within the memories, enabling efficient and precise video
understanding. Meanwhile, the disentangled video extraction and reasoning
design allows the LLM to answer different questions about a video by directly
selecting corresponding memories, without the need to encode the whole video
for each question. Our model achieves superior performance and higher
efficiency on long video benchmarks, showcasing precise temporal comprehension
for detailed question answering. |
This paper proposes VideoStreaming, a novel Vision-Language Large Model (VLLM) that understands arbitrarily long videos efficiently using a fixed number of video tokens by streamingly encoding and adaptively selecting memories. |
Understanding long videos is a challenge for VLLMs due to the high computational cost and potential information loss from large token sequences extracted from videos. Existing methods based on sparse sampling or frame compression fail to fully capture temporal dynamics or require recomputation for different queries. |
VideoStreaming introduces two core designs: (1) Memory-Propagated Streaming Encoding divides a long video into short clips and encodes each clip sequentially using a small language model, incorporating historical information from the previous clip. (2) Adaptive Memory Selection uses a question-related indicator to select a fixed number of most relevant memories from all encoded memories, reducing redundancy and enabling precise video understanding. |
VideoStreaming outperforms existing methods on various long video benchmarks, including VideoChatGPT, EgoSchema, Next-QA, Next-GQA, MovieChat-1K, and MovieNet-QA.
The model exhibits superior temporal understanding, evidenced by its high performance on tasks requiring precise temporal grounding.
VideoStreaming achieves high inference efficiency by significantly reducing the number of tokens fed into the LLM compared to existing methods. |
The current uniform frame sampling strategy could be improved by considering the information density of different video segments.
Exploration of adaptive segmentation techniques that dynamically adjust clip lengths based on video content complexity is a promising direction for future work. |
vision-language model, long video understanding, video question answering, temporal grounding, memory-propagated streaming encoding, adaptive memory selection |
2405.16005
Report |
PTQ4DiT: Post-training Quantization for Diffusion Transformers |
Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, Yan Yan |
The recent introduction of Diffusion Transformers (DiTs) has demonstrated
exceptional capabilities in image generation by using a different backbone
architecture, departing from traditional U-Nets and embracing the scalable
nature of transformers. Despite their advanced capabilities, the wide
deployment of DiTs, particularly for real-time applications, is currently
hampered by considerable computational demands at the inference stage.
Post-training Quantization (PTQ) has emerged as a fast and data-efficient
solution that can significantly reduce computation and memory footprint by
using low-bit weights and activations. However, its applicability to DiTs has
not yet been explored and faces non-trivial difficulties due to the unique
design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ
method for DiTs. We discover two primary quantization challenges inherent in
DiTs, notably the presence of salient channels with extreme magnitudes and the
temporal variability in distributions of salient activation over multiple
timesteps. To tackle these challenges, we propose Channel-wise Salience
Balancing (CSB) and Spearmen's $\rho$-guided Salience Calibration (SSC). CSB
leverages the complementarity property of channel magnitudes to redistribute
the extremes, alleviating quantization errors for both activations and weights.
SSC extends this approach by dynamically adjusting the balanced salience to
capture the temporal variations in activation. Additionally, to eliminate extra
computational costs caused by PTQ4DiT during inference, we design an offline
re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT
successfully quantizes DiTs to 8-bit precision (W8A8) while preserving
comparable generation ability and further enables effective quantization to
4-bit weight precision (W4A8) for the first time. |
This paper proposes PTQ4DiT, a novel post-training quantization method specifically designed for Diffusion Transformers (DiTs) that effectively reduces their computational complexity while maintaining high-quality image generation. |
Diffusion Transformers (DiTs) have shown exceptional image generation capabilities but their high computational cost at inference hinders their deployment in real-time applications. PTQ4DiT addresses this by enabling efficient inference through quantization without the need for costly retraining. |
PTQ4DiT tackles the challenges of salient channels and temporal variation in DiTs by introducing: (1) Channel-wise Salience Balancing (CSB) to redistribute extreme magnitudes in activation and weight channels, and (2) Spearman's rho-guided Salience Calibration (SSC) to dynamically adjust salience evaluations across different timesteps. Additionally, a re-parameterization scheme ensures efficient inference by pre-integrating balancing matrices. |
PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving generation quality comparable to the full-precision models.
The method enables effective quantization to 4-bit weight precision (W4A8) for the first time, achieving significantly better performance than existing PTQ methods.
Ablation studies validate the effectiveness of the proposed CSB and SSC components in improving the quantization performance. |
The research currently focuses on visual generation.
The ethical considerations of potential misuse of generative models are acknowledged but not fully addressed in the scope of this work. |
diffusion models, transformers, model quantization, image generation, real-time applications |
2405.15914
Report |
ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching |
Yumin Zhang, Xingyu Miao, Haoran Duan, Bo Wei, Tejal Shah, Yang Long, Rajiv Ranjan |
Text-to-3D content creation is a rapidly evolving research area. Given the
scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion
models for 3D synthesis. Among these approaches, Score Distillation Sampling
(SDS) has been widely adopted. However, the issue of over-smoothing poses a
significant limitation on the high-fidelity generation of 3D models. To address
this challenge, LucidDreamer replaces the Denoising Diffusion Probabilistic
Model (DDPM) in SDS with the Denoising Diffusion Implicit Model (DDIM) to
construct Interval Score Matching (ISM). However, ISM inevitably inherits
inconsistencies from DDIM, causing reconstruction errors during the DDIM
inversion process. This results in poor performance in the detailed generation
of 3D objects and loss of content. To alleviate these problems, we propose a
novel method named Exact Score Matching (ESM). Specifically, ESM leverages
auxiliary variables to mathematically guarantee exact recovery in the DDIM
reverse process. Furthermore, to effectively capture the dynamic changes of the
original and auxiliary variables, the LoRA of a pre-trained diffusion model
implements these exact paths. Extensive experiments demonstrate the
effectiveness of ESM in text-to-3D generation, particularly highlighting its
superiority in detailed generation. |
This paper proposes Exact Score Matching (ESM), a novel text-to-3D generation method that improves consistency and detail by addressing limitations of the DDIM inversion process in Interval Score Matching (ISM). |
Over-smoothing in existing text-to-3D methods hinders the generation of detailed, high-fidelity 3D models. This paper addresses this by mitigating inconsistencies in the DDIM inversion process used in ISM. |
ESM introduces auxiliary noise variables to construct an exact recovery path during DDIM inversion. It leverages LoRA to adapt a pre-trained 2D diffusion model, effectively capturing the dynamic changes of original and auxiliary noise variables. |
ESM generates high-fidelity 3D models consistent with given text prompts.
Qualitative comparisons show ESM surpasses existing methods in detail generation, particularly in complex geometries and textures.
Experiments demonstrate the impact of hyperparameters like mixture ratio and step sizes on generation quality. |
The method can exhibit unstable generation in some cases.
Generation quality is sensitive to hyperparameter tuning. |
text-to-3d generation, diffusion models, score distillation sampling, denoising diffusion implicit models, exact score matching |
2405.15891
Report |
Score Distillation via Reparametrized DDIM |
Artem Lukoianov, Haitz Sáez de Ocáriz Borde, Kristjan Greenewald, Vitor Campagnolo Guizilini, Timur Bagautdinov, Vincent Sitzmann, Justin Solomon |
While 2D diffusion models generate realistic, high-detail images, 3D shape
generation methods like Score Distillation Sampling (SDS) built on these 2D
diffusion models produce cartoon-like, over-smoothed shapes. To help explain
this discrepancy, we show that the image guidance used in Score Distillation
can be understood as the velocity field of a 2D denoising generative process,
up to the choice of a noise term. In particular, after a change of variables,
SDS resembles a high-variance version of Denoising Diffusion Implicit Models
(DDIM) with a differently-sampled noise term: SDS introduces noise i.i.d.
randomly at each step, while DDIM infers it from the previous noise
predictions. This excessive variance can lead to over-smoothing and unrealistic
outputs. We show that a better noise approximation can be recovered by
inverting DDIM in each SDS update step. This modification makes SDS's
generative process for 2D images almost identical to DDIM. In 3D, it removes
over-smoothing, preserves higher-frequency detail, and brings the generation
quality closer to that of 2D samplers. Experimentally, our method achieves
better or similar 3D generation quality compared to other state-of-the-art
Score Distillation methods, all without training additional neural networks or
multi-view supervision, and providing useful insights into relationship between
2D and 3D asset generation with diffusion models. |
This paper proposes Score Distillation via Inversion (SDI), a method for 3D shape generation that improves upon Score Distillation Sampling (SDS) by addressing the discrepancy in quality between 2D and 3D generation with diffusion models. |
While 2D diffusion models excel at generating realistic images, 3D shape generation methods like SDS often produce over-smoothed and less detailed results. This paper aims to bridge this quality gap. |
The paper analyzes the SDS algorithm and reveals its connection to DDIM. It then proposes replacing the random noise sampling in SDS with DDIM inversion to improve noise estimation and generation quality. |
SDI generates 3D objects with significantly higher fidelity and detail compared to SDS, closing the quality gap to 2D diffusion models.
The paper provides theoretical insights into the relationship between SDS and DDIM, showing that SDS can be interpreted as a high-variance version of DDIM.
Through experiments and ablations, the authors demonstrate the effectiveness of DDIM inversion for noise estimation in SDS and show SDI achieves comparable or better results to state-of-the-art methods without additional training or complex pipelines. |
The paper identifies limitations related to 3D consistency and content drift between views, suggesting future work on incorporating depth or normal estimation and stronger view conditioning.
Another limitation stems from the algorithm inheriting biases and limitations present in the underlying 2D diffusion model, such as generating unrealistic features or skewed distributions. |
3d shape generation, diffusion models, score distillation, ddim inversion, nerf |
2405.15885
Report |
Diffusion Bridge Implicit Models |
Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu |
Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion
models for interpolating between two arbitrary paired distributions given as
endpoints. Despite their promising performance in tasks like image translation,
DDBMs require a computationally intensive sampling process that involves the
simulation of a (stochastic) differential equation through hundreds of network
evaluations. In this work, we present diffusion bridge implicit models (DBIMs)
for accelerated sampling of diffusion bridges without extra training. We
generalize DDBMs via a class of non-Markovian diffusion bridges defined on the
discretized timesteps concerning sampling, which share the same training
objective as DDBMs. These generalized diffusion bridges give rise to generative
processes ranging from stochastic to deterministic (i.e., an implicit
probabilistic model) while being up to 25$\times$ faster than the vanilla
sampler of DDBMs. Moreover, the deterministic sampling procedure yielded by
DBIMs enables faithful encoding and reconstruction by a booting noise used in
the initial sampling step, and allows us to perform semantically meaningful
interpolation in image translation tasks by regarding the booting noise as the
latent variable. |
This paper proposes Diffusion Bridge Implicit Models (DBIMs) for accelerated sampling of Denoising Diffusion Bridge Models (DDBMs). |
DDBMs are powerful for interpolating paired distributions but suffer from slow sampling, DBIMs aim to address this limitation. |
The paper generalizes DDBMs to non-Markovian diffusion bridges on discretized timesteps, enabling deterministic sampling akin to implicit probabilistic models. |
DBIMs achieve up to 25x faster sampling than DDBMs without extra training.
DBIMs achieve state-of-the-art FID scores in image translation and restoration tasks with 100 sampling steps.
Deterministic DBIMs enable faithful encoding and semantically meaningful interpolation. |
DBIMs, while faster than DDBMs, are still slower than GAN-based methods for one-step generation.
Future work includes developing dedicated ODE solvers for DDBMs and exploring bridge distillation methods. |
diffusion bridge models, implicit models, accelerated sampling, image translation, image restoration |
2405.15881
Report |
Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation |
Shentong Mo, Yapeng Tian |
In recent developments, the Mamba architecture, known for its selective state
space approach, has shown potential in the efficient modeling of long
sequences. However, its application in image generation remains underexplored.
Traditional diffusion transformers (DiT), which utilize self-attention blocks,
are effective but their computational complexity scales quadratically with the
input length, limiting their use for high-resolution images. To address this
challenge, we introduce a novel diffusion architecture, Diffusion Mamba (DiM),
which foregoes traditional attention mechanisms in favor of a scalable
alternative. By harnessing the inherent efficiency of the Mamba architecture,
DiM achieves rapid inference times and reduced computational load, maintaining
linear complexity with respect to sequence length. Our architecture not only
scales effectively but also outperforms existing diffusion transformers in both
image and video generation tasks. The results affirm the scalability and
efficiency of DiM, establishing a new benchmark for image and video generation
techniques. This work advances the field of generative models and paves the way
for further applications of scalable architectures. |
This paper proposes Diffusion Mamba (DiM), a novel diffusion model architecture for image and video generation that leverages the efficiency of the Mamba architecture, replacing traditional attention mechanisms with state space models to reduce computational complexity. |
Existing diffusion models, particularly diffusion transformers (DiT), face scalability limitations due to the quadratic complexity of attention mechanisms, hindering their application to high-resolution image and video generation tasks. DiM addresses this challenge by integrating the Mamba architecture's linear complexity for efficient sequence processing. |
DiM adapts the Mamba architecture to handle 2D image data by transforming latent image representations into sequences of patches processed by DiM blocks. Each DiM block employs bidirectional state space models to capture spatial dependencies within and across frames, ensuring temporal coherence in video generation. |
DiM consistently outperforms DiT across various model sizes and training steps on image generation benchmarks like ImageNet, demonstrating faster convergence and better FID-50K scores.
The architecture exhibits significant computational efficiency, achieving comparable or better image generation quality with notably lower Gflops compared to DiT, particularly at higher resolutions.
DiM effectively extends to video generation, achieving competitive Frechet Video Distance (FVD) scores on the UCF-101 dataset, demonstrating its ability to generate high-fidelity and temporally coherent video clips. |
The performance of DiM in video generation, specifically for scenarios involving highly dynamic content or low visibility, requires further investigation.
The current DiM implementation's capacity to capture long-term dependencies in extended video sequences, essential for long-form video generation, needs further exploration and potential enhancements. |
diffusion models, image generation, video generation, state space models, mamba architecture |
2405.15769
Report |
FastDrag: Manipulate Anything in One Step |
Xuanjia Zhao, Jian Guan, Congyi Fan, Dongli Xu, Youtian Lin, Haiwei Pan, Pengming Feng |
Drag-based image editing using generative models provides precise control
over image contents, enabling users to manipulate anything in an image with a
few clicks. However, prevailing methods typically adopt $n$-step iterations for
latent semantic optimization to achieve drag-based image editing, which is
time-consuming and limits practical applications. In this paper, we introduce a
novel one-step drag-based image editing method, i.e., FastDrag, to accelerate
the editing process. Central to our approach is a latent warpage function
(LWF), which simulates the behavior of a stretched material to adjust the
location of individual pixels within the latent space. This innovation achieves
one-step latent semantic optimization and hence significantly promotes editing
speeds. Meanwhile, null regions emerging after applying LWF are addressed by
our proposed bilateral nearest neighbor interpolation (BNNI) strategy. This
strategy interpolates these regions using similar features from neighboring
areas, thus enhancing semantic integrity. Additionally, a
consistency-preserving strategy is introduced to maintain the consistency
between the edited and original images by adopting semantic information from
the original image, saved as key and value pairs in self-attention module
during diffusion inversion, to guide the diffusion sampling. Our FastDrag is
validated on the DragBench dataset, demonstrating substantial improvements in
processing time over existing methods, while achieving enhanced editing
performance. |
This paper introduces FastDrag, a novel one-step drag-based image editing method that significantly accelerates the editing process while maintaining high quality. |
Existing drag-based image editing methods rely on time-consuming n-step iterative optimizations, limiting their practicality. FastDrag addresses this limitation by enabling one-step optimization. |
FastDrag employs a latent warpage function (LWF) to simulate the behavior of stretched materials, enabling one-step adjustment of pixel locations in the latent space. It also utilizes bilateral nearest neighbor interpolation (BNNI) to fill null regions and a consistency-preserving strategy to maintain image coherence. |
FastDrag is significantly faster than state-of-the-art methods, achieving up to 700% speed improvement.
It achieves comparable, if not better, editing performance compared to existing techniques.
FastDrag maintains high image quality even in complex textures and multi-point dragging scenarios. |
The paper focuses on drag-based editing and may not generalize to other editing paradigms.
Future work could explore extending FastDrag to handle more complex editing tasks. |
image editing, drag-based editing, diffusion models, latent space manipulation, one-step optimization |
2405.15758
Report |
InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation |
Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian |
Recent talking avatar generation models have made strides in achieving
realistic and accurate lip synchronization with the audio, but often fall short
in controlling and conveying detailed expressions and emotions of the avatar,
making the generated video less vivid and controllable. In this paper, we
propose a novel text-guided approach for generating emotionally expressive 2D
avatars, offering fine-grained control, improved interactivity, and
generalizability to the resulting video. Our framework, named InstructAvatar,
leverages a natural language interface to control the emotion as well as the
facial motion of avatars. Technically, we design an automatic annotation
pipeline to construct an instruction-video paired training dataset, equipped
with a novel two-branch diffusion-based generator to predict avatars with audio
and text instructions at the same time. Experimental results demonstrate that
InstructAvatar produces results that align well with both conditions, and
outperforms existing methods in fine-grained emotion control, lip-sync quality,
and naturalness. Our project page is
https://wangyuchi369.github.io/InstructAvatar/. |
Introduces InstructAvatar, a text-guided diffusion-based model for generating expressive 2D talking avatars with fine-grained control over emotions and facial motions. |
Existing talking avatar generation models struggle with conveying and controlling detailed expressions and motions, resulting in less vivid and controllable videos. |
Leverages a natural language interface to control avatar expressions and motions. Employs an automatic annotation pipeline with GPT-4V to generate fine-grained text instructions from videos. Uses a two-branch diffusion model with cross-attention to incorporate emotion and motion instructions during video generation. |
InstructAvatar significantly improves emotion control, lip-sync quality, and naturalness compared to existing methods.
Enables control over a wider range of instructions due to its natural language interface.
Successfully animates avatars directly from text instructions without relying on audio cues. |
Limited ability to control disentangled single action units due to training on combined action units.
Modest training dataset size may hinder robustness in handling out-of-domain instructions or appearances.
Inability to simultaneously control emotion and motion due to training data limitations. |
talking avatar generation, emotion control, facial motion control, text-guided generation, diffusion models |
2405.15757
Report |
Looking Backward: Streaming Video-to-Video Translation with Feature Banks |
Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu |
This paper introduces StreamV2V, a diffusion model that achieves real-time
streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V
methods using batches to process limited frames, we opt to process frames in a
streaming fashion, to support unlimited frames. At the heart of StreamV2V lies
a backward-looking principle that relates the present to the past. This is
realized by maintaining a feature bank, which archives information from past
frames. For incoming frames, StreamV2V extends self-attention to include banked
keys and values and directly fuses similar past features into the output. The
feature bank is continually updated by merging stored and new features, making
it compact but informative. StreamV2V stands out for its adaptability and
efficiency, seamlessly integrating with image diffusion models without
fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x
faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative
metrics and user studies confirm StreamV2V's exceptional ability to maintain
temporal consistency. |
StreamV2V is a novel diffusion model that performs real-time video-to-video translation on streaming video inputs, unlike previous batch-based methods limited to short clips. |
Existing video-to-video translation methods are constrained by batch processing, limiting their ability to handle long or streaming videos in real-time applications. |
StreamV2V processes frames sequentially, maintaining a compact feature bank of past frames. It leverages extended self-attention and direct feature fusion to ensure temporal consistency during generation, building upon the StreamDiffusion framework. |
StreamV2V achieves real-time performance (20 FPS on a single A100 GPU), significantly outpacing prior V2V methods.
A dynamic merging strategy for the feature bank balances compactness with informativeness, enabling efficient processing without sacrificing consistency.
Quantitative metrics and user studies demonstrate StreamV2V's effectiveness in maintaining temporal consistency while enabling real-time video editing. |
StreamV2V's editing capability is currently limited by the underlying image editing method (SDEdit).
While generally consistent, the model can produce artifacts, especially in videos with rapid camera or object movement, leaving room for further improvement. |
video-to-video translation, diffusion models, real-time processing, streaming video, feature banks |
2405.15738
Report |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models |
Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng |
High-resolution Large Multimodal Models (LMMs) encounter the challenges of
excessive visual tokens and quadratic visual complexity. Current
high-resolution LMMs address the quadratic complexity while still generating
excessive visual tokens. However, the redundancy in visual tokens is the key
problem as it leads to more substantial compute. To mitigate this issue, we
propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the
visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses
high-resolution images into information-rich visual features, effectively
preventing the generation of excessive visual tokens. To enhance the
capabilities of ConvLLaVA, we propose two critical optimizations. Since the
low-resolution pretrained ConvNeXt underperforms when directly applied on high
resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original
compression ratio is inadequate for much higher resolution inputs, we train a
successive stage to further compress the visual tokens, thereby reducing
redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536
resolution generating only 576 visual tokens, capable of handling images of
arbitrary aspect ratios. Experimental results demonstrate that our method
achieves competitive performance with state-of-the-art models on mainstream
benchmarks. The ConvLLaVA model series are publicly available at
https://github.com/alibaba/conv-llava. |
This paper introduces ConvLLaVA, a large multimodal model that utilizes a five-stage ConvNeXt as its visual encoder to address the challenges of excessive visual tokens and quadratic visual complexity in high-resolution images. |
Existing high-resolution large multimodal models (LMMs) often generate excessive visual tokens, leading to high computational costs and hindering efficient visual information extraction. ConvLLaVA tackles this issue by effectively compressing high-resolution images into information-rich visual features. |
The authors propose ConvLLaVA, which replaces the traditional Vision Transformer (ViT) with a hierarchical ConvNeXt backbone as the visual encoder. They introduce two key optimizations: (1) updating the pretrained ConvNeXt for better performance on high-resolution images and (2) adding a fifth stage to ConvNeXt to further compress visual tokens, reducing redundancy. |
ConvLLaVA with a five-stage ConvNeXt successfully compresses visual information, generating fewer visual tokens than ViT-based models at the same resolution.
Updating the pretrained ConvNeXt for high-resolution inputs is crucial for achieving competitive performance on general capability benchmarks.
Higher-resolution ConvLLaVA models consistently outperform lower-resolution counterparts on fine-grained tasks, indicating the effectiveness of compressing high-resolution images into information-rich visual tokens. |
The relatively small kernel size of the current ConvNeXt architecture, optimized for low-resolution images, may limit capacity at extremely high resolutions.
The optimal balance between visual information compression and retrieval capabilities for high-resolution understanding requires further investigation. |
large multimodal models, convnext, visual token compression, high-resolution image understanding, vision-language models |
2405.15734
Report |
LM4LV: A Frozen Large Language Model for Low-level Vision Tasks |
Boyang Zheng, Jinjin Gu, Shijun Li, Chao Dong |
The success of large language models (LLMs) has fostered a new research trend
of multi-modality large language models (MLLMs), which changes the paradigm of
various fields in computer vision. Though MLLMs have shown promising results in
numerous high-level vision and vision-language tasks such as VQA and
text-to-image, no works have demonstrated how low-level vision tasks can
benefit from MLLMs. We find that most current MLLMs are blind to low-level
features due to their design of vision modules, thus are inherently incapable
for solving low-level vision tasks. In this work, we purpose $\textbf{LM4LV}$,
a framework that enables a FROZEN LLM to solve a range of low-level vision
tasks without any multi-modal data or prior. This showcases the LLM's strong
potential in low-level vision and bridges the gap between MLLMs and low-level
vision tasks. We hope this work can inspire new perspectives on LLMs and deeper
understanding of their mechanisms. |
This paper investigates the capability of a frozen Large Language Model (LLM) to process and generate low-level visual features, demonstrating its potential in solving low-level vision tasks like image denoising and deraining without multi-modal data or prior. |
Bridging the gap between MLLMs, which excel in high-level vision tasks, and low-level vision tasks is crucial for leveraging LLMs' reasoning and text generation abilities for better user interaction and interpretability in low-level vision. |
The paper proposes LM4LV, a framework that integrates a fine-tuned Masked Autoencoder (MAE) with a frozen LLM. This framework uses linear layers to adapt between visual and text features. The LLM is trained to autoregressively generate visual features conditioned on degraded images and task tokens. |
LM4LV successfully performs various low-level vision tasks like denoising, deblurring, and deraining, showcasing the LLM's ability to process low-level features.
The choice of the vision module is crucial, with MAE outperforming VQGAN and BEiT due to its superior image reconstruction ability and potential alignment with the LLM's representation space.
Auto-regressive generation is essential for LM4LV's success, as a more straightforward ViT-LLM generation scheme fails to produce high-quality results. |
LM4LV struggles to restore high-frequency details due to the lack of image prior.
There is a performance gap between LM4LV and a one-layer Transformer in denoising, indicating room for improvement. |
large language models, low-level vision, multi-modality, image restoration, auto-regressive generation |
2405.15688
Report |
UNION: Unsupervised 3D Object Detection using Object Appearance-based Pseudo-Classes |
Ted Lentsch, Holger Caesar, Dariu M. Gavrila |
Unsupervised 3D object detection methods have emerged to leverage vast
amounts of data efficiently without requiring manual labels for training.
Recent approaches rely on dynamic objects for learning to detect objects but
penalize the detections of static instances during training. Multiple rounds of
(self) training are used in which detected static instances are added to the
set of training targets; this procedure to improve performance is
computationally expensive. To address this, we propose the method UNION. We use
spatial clustering and self-supervised scene flow to obtain a set of static and
dynamic object proposals from LiDAR. Subsequently, object proposals' visual
appearances are encoded to distinguish static objects in the foreground and
background by selecting static instances that are visually similar to dynamic
objects. As a result, static and dynamic foreground objects are obtained
together, and existing detectors can be trained with a single training. In
addition, we extend 3D object discovery to detection by using object
appearance-based cluster labels as pseudo-class labels for training object
classification. We conduct extensive experiments on the nuScenes dataset and
increase the state-of-the-art performance for unsupervised object discovery,
i.e. UNION more than doubles the average precision to 33.9. The code will be
made publicly available. |
UNION, a novel framework for unsupervised 3D object detection that leverages LiDAR, camera, and temporal information jointly to generate pseudo-labels for training existing object detectors without manual annotations. |
Unsupervised 3D object detection reduces the dependency on expensive manual labeling, making it important for leveraging large-scale datasets. Existing methods suffer from limitations like iterative self-training and difficulty in detecting static foreground objects. |
UNION first generates object proposals by clustering LiDAR points and estimating their motion using scene flow. Then, it encodes the visual appearance of these proposals using a pre-trained vision foundation model and clusters them based on appearance similarity. Finally, it identifies and leverages the clusters containing dynamic objects to discover both static and dynamic mobile objects, generating pseudo-labels for training. |
UNION significantly outperforms existing unsupervised 3D object discovery methods on the nuScenes dataset, achieving more than double the average precision of the best baseline.
Appearance-based clustering is identified as the key component driving UNION's performance improvement.
UNION demonstrates the feasibility of unsupervised multi-class 3D object detection by generating pseudo-class labels from appearance clusters. |
The performance of UNION on rare object classes may be limited due to assumptions made about object frequency during appearance clustering.
Future work includes extending UNION to handle rare classes more effectively and incorporating radar data for improved motion estimation. |
unsupervised learning, 3d object detection, lidar, camera, scene flow |
2405.15622
Report |
LAM3D: Large Image-Point-Cloud Alignment Model for 3D Reconstruction from Single Image |
Ruikai Cui, Xibin Song, Weixuan Sun, Senbo Wang, Weizhe Liu, Shenzhou Chen, Taizhang Shang, Yang Li, Nick Barnes, Hongdong Li, Pan Ji |
Large Reconstruction Models have made significant strides in the realm of
automated 3D content generation from single or multiple input images. Despite
their success, these models often produce 3D meshes with geometric
inaccuracies, stemming from the inherent challenges of deducing 3D shapes
solely from image data. In this work, we introduce a novel framework, the Large
Image and Point Cloud Alignment Model (LAM3D), which utilizes 3D point cloud
data to enhance the fidelity of generated 3D meshes. Our methodology begins
with the development of a point-cloud-based network that effectively generates
precise and meaningful latent tri-planes, laying the groundwork for accurate 3D
mesh reconstruction. Building upon this, our Image-Point-Cloud Feature
Alignment technique processes a single input image, aligning to the latent
tri-planes to imbue image features with robust 3D information. This process not
only enriches the image features but also facilitates the production of
high-fidelity 3D meshes without the need for multi-view input, significantly
reducing geometric distortions. Our approach achieves state-of-the-art
high-fidelity 3D mesh reconstruction from a single image in just 6 seconds, and
experiments on various datasets demonstrate its effectiveness. |
The paper introduces LAM3D, a Large Image and Point Cloud Alignment Model for enhancing the fidelity of 3D meshes generated from single images by utilizing 3D point cloud data as priors. |
Existing large reconstruction models often generate inaccurate 3D meshes from single or few-shot images due to the difficulty of deducing 3D shapes solely from 2D data. |
The method involves two stages: 1) compressing point clouds into latent tri-plane representations, and 2) aligning single image features to these tri-planes using a diffusion-based approach. |
LAM3D achieves state-of-the-art high-fidelity 3D mesh reconstruction from single images in 6 seconds.
The use of point cloud priors significantly reduces geometric distortions compared to models relying solely on images.
Independent diffusion processes for each tri-plane (XY, XZ, YZ) improve preservation of 3D structural information. |
The current model focuses on geometry reconstruction and lacks texture reconstruction capabilities.
Future work will explore extending LAM3D for geometric and texture reconstruction. |
3d reconstruction, point cloud, diffusion models, feature alignment, single image |
2405.15619
Report |
DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation |
Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, Dongyan Guo |
Monocular camera calibration is a key precondition for numerous 3D vision
applications. Despite considerable advancements, existing methods often hinge
on specific assumptions and struggle to generalize across varied real-world
scenarios, and the performance is limited by insufficient training data.
Recently, diffusion models trained on expansive datasets have been confirmed to
maintain the capability to generate diverse, high-quality images. This success
suggests a strong potential of the models to effectively understand varied
visual information. In this work, we leverage the comprehensive visual
knowledge embedded in pre-trained diffusion models to enable more robust and
accurate monocular camera intrinsic estimation. Specifically, we reformulate
the problem of estimating the four degrees of freedom (4-DoF) of camera
intrinsic parameters as a dense incident map generation task. The map details
the angle of incidence for each pixel in the RGB image, and its format aligns
well with the paradigm of diffusion models. The camera intrinsic then can be
derived from the incident map with a simple non-learning RANSAC algorithm
during inference. Moreover, to further enhance the performance, we jointly
estimate a depth map to provide extra geometric information for the incident
map estimation. Extensive experiments on multiple testing datasets demonstrate
that our model achieves state-of-the-art performance, gaining up to a 40%
reduction in prediction errors. Besides, the experiments also show that the
precise camera intrinsic and depth maps estimated by our pipeline can greatly
benefit practical applications such as 3D reconstruction from a single
in-the-wild image. |
Presents DiffCalib, a novel diffusion-based approach for robust and accurate monocular camera calibration from single in-the-wild images by reformulating intrinsic parameter estimation as a dense incident map generation task. |
Existing methods struggle to generalize across diverse real-world scenarios due to reliance on specific geometric assumptions or objects. DiffCalib leverages the rich visual knowledge of pre-trained diffusion models to overcome this limitation, improving calibration accuracy and robustness. |
Utilizes a pre-trained Stable Diffusion model with a frozen VAE encoder/decoder. Fine-tunes the U-Net to generate incident maps, representing the angle of incidence for each pixel. Optionally jointly estimates depth maps to enhance performance. Employs a non-learning RANSAC algorithm to derive intrinsic parameters from the generated incident map. |
Achieves state-of-the-art performance on both seen and unseen datasets, demonstrating superior generalization ability.
Jointly estimating incident and depth maps mutually benefits both tasks, resulting in more accurate predictions.
Enables high-quality 3D reconstruction from single in-the-wild images, surpassing existing methods in detail preservation and geometric accuracy. |
Limited exploration of the impact of different diffusion model architectures and training strategies on calibration performance.
Reliance on a pre-trained Stable Diffusion model necessitates significant computational resources for training and inference. |
monocular camera calibration, diffusion models, incident map generation, depth estimation, 3d reconstruction |
2405.15580
Report |
Open-Vocabulary SAM3D: Understand Any 3D Scene |
Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Yabiao Wang, Yong Liu |
Open-vocabulary 3D scene understanding presents a significant challenge in
the field. Recent advancements have sought to transfer knowledge embedded in
vision language models from the 2D domain to 3D domain. However, these
approaches often require learning prior knowledge from specific 3D scene
datasets, which limits their applicability in open-world scenarios. The Segment
Anything Model (SAM) has demonstrated remarkable zero-shot segmentation
capabilities, prompting us to investigate its potential for comprehending 3D
scenes without the need for training. In this paper, we introduce OV-SAM3D, a
universal framework for open-vocabulary 3D scene understanding. This framework
is designed to perform understanding tasks for any 3D scene without requiring
prior knowledge of the scene. Specifically, our method is composed of two key
sub-modules: First, we initiate the process by generating superpoints as the
initial 3D prompts and refine these prompts using segment masks derived from
SAM. Moreover, we then integrate a specially designed overlapping score table
with open tags from the Recognize Anything Model (RAM) to produce final 3D
instances with open-world label. Empirical evaluations conducted on the
ScanNet200 and nuScenes datasets demonstrate that our approach surpasses
existing open-vocabulary methods in unknown open-world environments. |
Presents OV-SAM3D, a universal open-vocabulary 3D scene understanding framework capable of interpreting any 3D scene without prior knowledge. |
Addresses the challenge of open-vocabulary 3D scene understanding where models must locate and recognize objects in 3D scenes from text guidance, even for unseen objects, without relying on specific 3D scene dataset knowledge. |
Leverages superpoints as initial 3D prompts, refines them using SAM-derived segmentation masks, and employs an overlapping score table with RAM-recognized open tags to produce final 3D instances with open-world labels. |
Surpasses existing open-vocabulary methods in unknown open-world environments on ScanNet200 and nuScenes datasets.
Demonstrates the effectiveness of combining multiple foundation models (SAM, RAM, CLIP) for open-vocabulary 3D scene understanding.
Highlights the potential of transferring knowledge from 2D foundation models to the 3D domain for zero-shot learning. |
Current limitations in vision foundation models' ability to handle complex scenes with zero-shot learning.
Dependence on the performance of underlying foundation models (SAM, RAM, CLIP). |
open-vocabulary learning, 3d scene understanding, zero-shot learning, foundation models, segment anything model (sam) |
2405.15574
Report |
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models |
Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro |
The rapid development of large language and vision models (LLVMs) has been
driven by advances in visual instruction tuning. Recently, open-source LLVMs
have curated high-quality visual instruction tuning datasets and utilized
additional vision encoders or multiple computer vision models in order to
narrow the performance gap with powerful closed-source LLVMs. These
advancements are attributed to multifaceted information required for diverse
capabilities, including fundamental image understanding, real-world knowledge
about common-sense and non-object concepts (e.g., charts, diagrams, symbols,
signs, and math problems), and step-by-step procedures for solving complex
questions. Drawing from the multifaceted information, we present a new
efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages
multifaceted rationale to enhance understanding and answering capabilities. To
embed lengthy rationales containing abundant information, we employ the Mamba
architecture, capable of processing sequential data with linear time
complexity. We introduce a new concept of traversal of rationale that
facilitates efficient embedding of rationale. Subsequently, the backbone
multimodal language model (MLM) is trained to generate answers with the aid of
rationale. Through these steps, Meteor achieves significant improvements in
vision language performances across multiple evaluation benchmarks requiring
diverse capabilities, without scaling up the model size or employing additional
vision encoders and computer vision models. |
Introduces Meteor, an efficient large language and vision model (LLVM) that leverages the Mamba architecture and a novel "traversal of rationale" concept to embed and utilize multifaceted rationales for enhanced understanding and answering capabilities in vision-language tasks. |
Addresses the need for efficient LLVMs that can implicitly embed multifaceted information (image understanding, common-sense knowledge, non-object concept comprehension, etc.) without relying on model scaling or additional vision encoders/models during inference. |
Combines a Mamba architecture for embedding lengthy rationales with a pretrained multimodal language model (MLM) trained on a curated dataset of 1.1M question-rationale-answer triples. Introduces "traversal of rationale" using special tokens to effectively convey rationale information to the MLM without explicit rationale access during inference. |
Meteor significantly outperforms existing open- and closed-source LLVMs on various benchmarks requiring diverse capabilities, including MME, MMB, and MM-Vet.
Ablation studies confirm the effectiveness of the Mamba architecture, rationale embedding, traversal of rationale, and the curated dataset in achieving superior performance.
Analysis of Meteor-Mamba reveals its ability to effectively embed rationales, enabling the model to leverage multifaceted information even without explicit rationale access during inference. |
Model size, while smaller than many large LLVMs, could still be prohibitive for users without high-end GPU resources.
Future work includes exploring layer-analyzing approaches like mixture of depths to further reduce model size while maintaining performance. |
large language and vision models, rationale-guided prediction, multifaceted rationale, mamba architecture, traversal of rationale |
2405.15491
Report |
GSDeformer: Direct Cage-based Deformation for 3D Gaussian Splatting |
Jiajun Huang, Hongchuan Yu |
We present GSDeformer, a method that achieves free-form deformation on 3D
Gaussian Splatting(3DGS) without requiring any architectural changes. Our
method extends cage-based deformation, a traditional mesh deformation method,
to 3DGS. This is done by converting 3DGS into a novel proxy point cloud
representation, where its deformation can be used to infer the transformations
to apply on the 3D gaussians making up 3DGS. We also propose an automatic cage
construction algorithm for 3DGS to minimize manual work. Our method does not
modify the underlying architecture of 3DGS. Therefore, any existing trained
vanilla 3DGS can be easily edited by our method. We compare the deformation
capability of our method against other existing methods, demonstrating the ease
of use and comparable quality of our method, despite being more direct and thus
easier to integrate with other concurrent developments on 3DGS. |
GSDeformer: the first method for free-form deformation of 3D Gaussian Splatting scenes without modifying the underlying architecture. |
Existing 3DGS deformation methods require architecture changes, limiting their use for editing pre-trained scenes or integration with other 3DGS techniques. |
1. Convert 3DGS to a proxy point cloud representation. 2. Deform the point cloud using cage-based deformation with user-defined cages. 3. Infer transformations from deformed points and apply them to original 3D Gaussians. |
Achieves high-quality deformation on synthetic and real-world 3DGS captures.
Produces comparable deformation quality to state-of-the-art methods like DeformingNeRF, SuGaR, and GaMeS.
Offers advantages by directly editing 3DGS without architecture modification, enabling easier integration and application to pre-trained models. |
Current implementation lacks real-time performance.
Future work includes exploring faster deformation schemes and incorporating color parameter transformations. |
3d gaussian splatting, deformation, cage-based deformation, scene manipulation, 3d scene editing |
2405.15475
Report |
Efficient Degradation-aware Any Image Restoration |
Eduard Zamfir, Zongwei Wu, Nancy Mehta, Danda Dani Paudel, Yulun Zhang, Radu Timofte |
Reconstructing missing details from degraded low-quality inputs poses a
significant challenge. Recent progress in image restoration has demonstrated
the efficacy of learning large models capable of addressing various
degradations simultaneously. Nonetheless, these approaches introduce
considerable computational overhead and complex learning paradigms, limiting
their practical utility. In response, we propose \textit{DaAIR}, an efficient
All-in-One image restorer employing a Degradation-aware Learner (DaLe) in the
low-rank regime to collaboratively mine shared aspects and subtle nuances
across diverse degradations, generating a degradation-aware embedding. By
dynamically allocating model capacity to input degradations, we realize an
efficient restorer integrating holistic and specific learning within a unified
model. Furthermore, DaAIR introduces a cost-efficient parameter update
mechanism that enhances degradation awareness while maintaining computational
efficiency. Extensive comparisons across five image degradations demonstrate
that our DaAIR outperforms both state-of-the-art All-in-One models and
degradation-specific counterparts, affirming our efficacy and practicality. The
source will be publicly made available at
\url{https://eduardzamfir.github.io/daair/} |
This paper proposes DaAIR, an efficient and accurate All-in-One image restoration model leveraging a novel Degradation-aware Learner (DaLe) to dynamically route model capacity to specific degradation experts while concurrently modeling shared information across degradation types within a low-rank framework. |
Existing image restoration methods often lack practicality due to their specialization in addressing a single degradation at a time, while recent All-in-One models suffer from high computational costs and complex learning paradigms. |
DaAIR utilizes a U-shaped architecture with DaLe integrated into each encoder block. DaLe comprises degradation-specific and agnostic experts, employing a routing mechanism to associate input features with their corresponding degradation experts. A self-learnable control mechanism, leveraging encoder knowledge, guides parameter updates in the decoder, enhancing restoration quality. |
DaAIR outperforms state-of-the-art All-in-One models on three degradation types (dehazing, deraining, and denoising), achieving an average improvement of 0.45 dB PSNR while being significantly more efficient.
The method also excels in a five degradation setting, including deblurring and low-light enhancement, surpassing previous approaches in both performance and efficiency.
Ablation studies confirm the efficacy of individual components, highlighting the importance of expert specialization, routing strategy, and self-learnable control for achieving superior restoration quality. |
The model currently relies on synthetically degraded images, which might limit its performance on realistic degradation scenarios.
Incorporating external inductive biases, like edge information or frequency constraints, could further enhance the model's ability to handle multiple degradation types simultaneously. |
image restoration, all-in-one restoration, degradation-aware learning, low-rank representation, self-learnable control |
2405.15463
Report |
PoinTramba: A Hybrid Transformer-Mamba Framework for Point Cloud Analysis |
Zicheng Wang, Zhenghao Chen, Yiming Wu, Zhen Zhao, Luping Zhou, Dong Xu |
Point cloud analysis has seen substantial advancements due to deep learning,
although previous Transformer-based methods excel at modeling long-range
dependencies on this task, their computational demands are substantial.
Conversely, the Mamba offers greater efficiency but shows limited potential
compared with Transformer-based methods. In this study, we introduce
PoinTramba, a pioneering hybrid framework that synergies the analytical power
of Transformer with the remarkable computational efficiency of Mamba for
enhanced point cloud analysis. Specifically, our approach first segments point
clouds into groups, where the Transformer meticulously captures intricate
intra-group dependencies and produces group embeddings, whose inter-group
relationships will be simultaneously and adeptly captured by efficient Mamba
architecture, ensuring comprehensive analysis. Unlike previous Mamba
approaches, we introduce a bi-directional importance-aware ordering (BIO)
strategy to tackle the challenges of random ordering effects. This innovative
strategy intelligently reorders group embeddings based on their calculated
importance scores, significantly enhancing Mamba's performance and optimizing
the overall analytical process. Our framework achieves a superior balance
between computational efficiency and analytical performance by seamlessly
integrating these advanced techniques, marking a substantial leap forward in
point cloud analysis. Extensive experiments on datasets such as ScanObjectNN,
ModelNet40, and ShapeNetPart demonstrate the effectiveness of our approach,
establishing a new state-of-the-art analysis benchmark on point cloud
recognition. For the first time, this paradigm leverages the combined strengths
of both Transformer and Mamba architectures, facilitating a new standard in the
field. The code is available at https://github.com/xiaoyao3302/PoinTramba. |
This paper presents PoinTramba, a novel hybrid framework for point cloud analysis that combines the strengths of Transformer and Mamba architectures. |
Existing Transformer-based methods, while effective for point cloud analysis, are computationally demanding. Conversely, Mamba offers efficiency but lags in performance. PoinTramba addresses these limitations by leveraging the strengths of both architectures. |
PoinTramba segments point clouds into groups and uses Transformer to capture intra-group dependencies, generating group embeddings. Then, a bi-directional importance-aware ordering (BIO) strategy is introduced to reorder group embeddings, followed by a Mamba encoder to capture inter-group relationships efficiently. Finally, importance-aware pooling extracts global features for analysis. |
PoinTramba achieves state-of-the-art performance on real-world object classification (ScanObjectNN) and synthetic object classification (ModelNet40).
The BIO strategy proves crucial for improving Mamba's performance on unordered point cloud data.
Ablation studies validate the effectiveness of each component in PoinTramba, including the hybrid architecture, BIO, and importance-aware pooling. |
The study primarily focuses on importance-aware ordering, leaving room to explore alternative sorting algorithms to further optimize Mamba's potential.
Further evaluation on a wider range of point cloud tasks is needed to comprehensively assess PoinTramba's capabilities. |
point cloud analysis, transformer, mamba, hybrid architecture, importance-aware ordering |
2405.15425
Report |
Volumetric Primitives for Modeling and Rendering Scattering and Emissive Media |
Jorge Condor, Sebastien Speierer, Lukas Bode, Aljaz Bozic, Simon Green, Piotr Didyk, Adrian Jarabo |
We propose a volumetric representation based on primitives to model
scattering and emissive media. Accurate scene representations enabling
efficient rendering are essential for many computer graphics applications.
General and unified representations that can handle surface and volume-based
representations simultaneously, allowing for physically accurate modeling,
remain a research challenge. Inspired by recent methods for scene
reconstruction that leverage mixtures of 3D Gaussians to model radiance fields,
we formalize and generalize the modeling of scattering and emissive media using
mixtures of simple kernel-based volumetric primitives. We introduce closed-form
solutions for transmittance and free-flight distance sampling for 3D Gaussian
kernels, and propose several optimizations to use our method efficiently within
any off-the-shelf volumetric path tracer by leveraging ray tracing for
efficiently querying the medium. We demonstrate our method as an alternative to
other forms of volume modeling (e.g. voxel grid-based representations) for
forward and inverse rendering of scattering media. Furthermore, we adapt our
method to the problem of radiance field optimization and rendering, and
demonstrate comparable performance to the state of the art, while providing
additional flexibility in terms of performance and usability. |
This paper introduces a novel volumetric representation for scattering and emissive media based on mixtures of kernel-based volumetric primitives, enabling efficient rendering and optimization within the radiative transfer framework. |
Current methods for representing volumetric media, like voxel grids, struggle with memory scalability and efficient light transport calculations. This new approach offers a compact representation and enables closed-form solutions for transmittance and emission, leading to faster rendering and easier integration into physics-based renderers. |
The authors leverage Gaussian kernels as their primitives, deriving closed-form expressions for transmittance, emission, and distance sampling. They utilize ray tracing to efficiently query these primitives and integrate their contributions along a ray, implementing their approach within a physics-based renderer. They also develop the adjoint of their method for inverse rendering applications. |
The Gaussian primitive-based representation significantly reduces memory consumption compared to voxel grids, while enabling efficient transmittance computations and achieving comparable rendering quality.
The method is applicable to inverse rendering, demonstrated through the reconstruction of a scattering smoke plume with significantly less memory than the reference grid.
For radiance field rendering, the approach achieves comparable quality to existing methods while allowing for greater control over rendering speed by leveraging techniques like early ray termination. |
The current implementation requires a GPU with hardware-accelerated ray tracing for optimal performance.
Future work includes exploring more general media types (e.g., anisotropic media), refining the optimization pipeline for complex scenes, and improving the color model for highly anisotropic kernels. |
volumetric rendering, radiance fields, gaussian primitives, inverse rendering, light transport |
2405.15364
Report |
NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer |
Meng You, Zhiyu Zhu, Hui Liu, Junhui Hou |
By harnessing the potent generative capabilities of pre-trained large video
diffusion models, we propose NVS-Solver, a new novel view synthesis (NVS)
paradigm that operates \textit{without} the need for training. NVS-Solver
adaptively modulates the diffusion sampling process with the given views to
enable the creation of remarkable visual experiences from single or multiple
views of static scenes or monocular videos of dynamic scenes. Specifically,
built upon our theoretical modeling, we iteratively modulate the score function
with the given scene priors represented with warped input views to control the
video diffusion process. Moreover, by theoretically exploring the boundary of
the estimation error, we achieve the modulation in an adaptive fashion
according to the view pose and the number of diffusion steps. Extensive
evaluations on both static and dynamic scenes substantiate the significant
superiority of our NVS-Solver over state-of-the-art methods both quantitatively
and qualitatively. \textit{ Source code in }
\href{https://github.com/ZHU-Zhiyu/NVS_Solver}{https://github.com/ZHU-Zhiyu/NVS$\_$Solver}. |
This paper introduces NVS-Solver, a novel training-free approach for novel view synthesis leveraging pre-trained large video diffusion models. |
NVS-Solver addresses limitations of existing methods in handling complex scene dynamics and generalizing to new scenes, while offering high-quality results without the need for extensive training. |
NVS-Solver modulates the reverse diffusion sampling process using prior information from given views. This modulation is performed adaptively based on an analysis of diffusion estimation error and intensity truncation error, ensuring accurate and visually pleasing novel view generation. |
NVS-Solver outperforms state-of-the-art NVS methods both qualitatively and quantitatively in single-view, multi-view, and dynamic scene scenarios.
The adaptive modulation of the score function is crucial for accurate view synthesis, correcting warping errors and non-Lambert reflections effectively.
Sufficient diffusion reverse steps are essential for accurate view pose estimation in the synthesized novel views. |
The current implementation of NVS-Solver requires longer processing time compared to existing methods.
Future work will focus on improving the processing speed and exploring the potential of NVS-Solver for pose controllable video diffusion model distillation. |
novel view synthesis, video diffusion models, training-free, adaptive modulation, score-based diffusion |
2405.15330
Report |
Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model |
Mingyang Yi, Aoxue Li, Yi Xin, Zhenguo Li |
Recently, the strong latent Diffusion Probabilistic Model (DPM) has been
applied to high-quality Text-to-Image (T2I) generation (e.g., Stable
Diffusion), by injecting the encoded target text prompt into the gradually
denoised diffusion image generator. Despite the success of DPM in practice, the
mechanism behind it remains to be explored. To fill this blank, we begin by
examining the intermediate statuses during the gradual denoising generation
process in DPM. The empirical observations indicate, the shape of image is
reconstructed after the first few denoising steps, and then the image is filled
with details (e.g., texture). The phenomenon is because the low-frequency
signal (shape relevant) of the noisy image is not corrupted until the final
stage in the forward process (initial stage of generation) of adding noise in
DPM. Inspired by the observations, we proceed to explore the influence of each
token in the text prompt during the two stages. After a series of experiments
of T2I generations conditioned on a set of text prompts. We conclude that in
the earlier generation stage, the image is mostly decided by the special token
[\texttt{EOS}] in the text prompt, and the information in the text prompt is
already conveyed in this stage. After that, the diffusion model completes the
details of generated images by information from themselves. Finally, we propose
to apply this observation to accelerate the process of T2I generation by
properly removing text guidance, which finally accelerates the sampling up to
25\%+. |
This paper investigates the working mechanism of text-to-image diffusion models, particularly focusing on how text prompts influence the image generation process. |
Understanding this mechanism can lead to improvements in text-to-image generation techniques, such as accelerating the sampling process without sacrificing image quality. |
The authors analyze the image reconstruction process in stable diffusion models, examining the role of frequency signals and the influence of different text prompt components (semantic tokens and the special token [EOS]). They conduct experiments by switching [EOS] tokens between prompts and varying the strength of text guidance during different stages of the denoising process. |
The image generation process exhibits a 'first overall shape then details' pattern, where the overall shape is reconstructed in the early stages of denoising and details are filled in later.
The special token [EOS] in the text prompt plays a dominant role in determining the overall shape of the generated image, conveying more information than semantic tokens.
The information from the text prompt is primarily conveyed in the early shape reconstruction stage of the denoising process. Later stages mainly refine details based on the established shape. |
The study primarily focuses on a single text-to-image model, Stable Diffusion, and further investigation is needed to generalize the findings to other diffusion-based models.
The impact of varying the number of [EOS] tokens and their positions within the prompt requires further exploration to fully understand their role. |
text-to-image generation, diffusion models, stable diffusion, text prompt engineering, frequency analysis |
2405.15321
Report |
SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance |
Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, Yijun Li, Ying-Cong Chen |
Recent advancements in text-to-image generation have been propelled by the
development of diffusion models and multi-modality learning. However, since
text is typically represented sequentially in these models, it often falls
short in providing accurate contextualization and structural control. So the
generated images do not consistently align with human expectations, especially
in complex scenarios involving multiple objects and relationships. In this
paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the
structured representation of scene graphs to rectify inaccuracies in the
original text embeddings. The SG-Adapter's explicit and non-fully connected
graph representation greatly improves the fully connected, transformer-based
text representations. This enhancement is particularly notable in maintaining
precise correspondence in scenarios involving multiple relationships. To
address the challenges posed by low-quality annotated datasets like Visual
Genome, we have manually curated a highly clean, multi-relational scene
graph-image paired dataset MultiRels. Furthermore, we design three metrics
derived from GPT-4V to effectively and thoroughly measure the correspondence
between images and scene graphs. Both qualitative and quantitative results
validate the efficacy of our approach in controlling the correspondence in
multiple relationships. |
This paper introduces the Scene Graph Adapter (SG-Adapter), which leverages scene graph knowledge to improve the contextual understanding and accuracy of text-to-image generation models, addressing the limitations of sequential text embeddings. |
Existing text-to-image generation models often misinterpret relationships between entities due to the sequential nature of text processing. This work aims to enhance the control and accuracy of these models by incorporating structured scene graph information. |
The SG-Adapter, designed as a transformer module, refines text embeddings using a novel triplet-token attention mechanism. This allows for precise mapping between textual elements and their visual representations in generated images. |
The SG-Adapter outperforms existing text-to-image generation methods in accurately depicting complex relationships between multiple entities, as demonstrated by both qualitative and quantitative evaluations.
The research contributes a new dataset, MultiRels, featuring multiple relations and high-quality annotations, crucial for training and evaluating multi-relational learning models.
The paper introduces three novel metrics derived from GPT-4V to effectively measure the correspondence between generated images and scene graphs, enabling a more accurate assessment of relation generation. |
The anonymization of human faces in the MultiRels dataset, while necessary for privacy, might introduce artifacts that could potentially impact image quality.
Future work could focus on exploring more sophisticated anonymization techniques and expanding the MultiRels dataset to encompass a wider range of relations and scenarios. |
text-to-image generation, scene graph, attention mechanism, relation correspondence, multi-relational learning |
2405.15313
Report |
Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion |
Aoxue Li, Mingyang Yi, Zhenguo Li |
Recently, text-to-image (T2I) editing has been greatly pushed forward by
applying diffusion models. Despite the visual promise of the generated images,
inconsistencies with the expected textual prompt remain prevalent. This paper
aims to systematically improve the text-guided image editing techniques based
on diffusion models, by addressing their limitations. Notably, the common idea
in diffusion-based editing firstly reconstructs the source image via inversion
techniques e.g., DDIM Inversion. Then following a fusion process that carefully
integrates the source intermediate (hidden) states (obtained by inversion) with
the ones of the target image. Unfortunately, such a standard pipeline fails in
many cases due to the interference of texture retention and the new characters
creation in some regions. To mitigate this, we incorporate human annotation as
an external knowledge to confine editing within a ``Mask-informed'' region.
Then we carefully Fuse the edited image with the source image and a constructed
intermediate image within the model's Self-Attention module. Extensive
empirical results demonstrate the proposed ``MaSaFusion'' significantly
improves the existing T2I editing techniques. |
This paper proposes MaSaFusion, a novel training-free method for enhancing text-to-image editing using diffusion models by incorporating human annotations to improve feature preservation and new feature generation. |
Existing diffusion-based text-to-image editing methods struggle with inconsistencies between the generated image and the expected textual prompt, especially when object shapes vary. |
MaSaFusion leverages human annotations (sketch and editing region) to construct an intermediate image with desired shape via T2I Adapter and fuses its self-attention maps with source and target images during the generation process. |
MaSaFusion outperforms existing training-free methods on MagicBrush dataset in terms of image-text and image-image alignment.
Using external knowledge like sketch maps and editing regions significantly improves editing quality.
Direct Inversion, though slightly less accurate, offers a significant speedup over Null-text Inversion for practical applications. |
The performance of MaSaFusion depends on the accuracy of human annotations (sketch & editing region).
MaSaFusion inherits limitations of Stable Diffusion, such as generating inconsistent facial features. |
text-to-image editing, diffusion models, human annotation, self-attention, t2i adapter |
2405.15305
Report |
Diff3DS: Generating View-Consistent 3D Sketch via Differentiable Curve Rendering |
Yibo Zhang, Lihong Wang, Changqing Zou, Tieru Wu, Rui Ma |
3D sketches are widely used for visually representing the 3D shape and
structure of objects or scenes. However, the creation of 3D sketch often
requires users to possess professional artistic skills. Existing research
efforts primarily focus on enhancing the ability of interactive sketch
generation in 3D virtual systems. In this work, we propose Diff3DS, a novel
differentiable rendering framework for generating view-consistent 3D sketch by
optimizing 3D parametric curves under various supervisions. Specifically, we
perform perspective projection to render the 3D rational B\'ezier curves into
2D curves, which are subsequently converted to a 2D raster image via our
customized differentiable rasterizer. Our framework bridges the domains of 3D
sketch and raster image, achieving end-toend optimization of 3D sketch through
gradients computed in the 2D image domain. Our Diff3DS can enable a series of
novel 3D sketch generation tasks, including textto-3D sketch and image-to-3D
sketch, supported by the popular distillation-based supervision, such as Score
Distillation Sampling (SDS). Extensive experiments have yielded promising
results and demonstrated the potential of our framework. |
This paper presents Diff3DS, a novel differentiable rendering framework for generating view-consistent 3D sketches from diverse inputs like text or single images. |
Existing methods for 3D sketch creation are primarily interactive and require professional skills, limiting their accessibility. This work introduces a user-friendly approach for generating view-consistent 3D sketches from commonly available inputs. |
The framework represents 3D sketches as a set of 3D rational Bézier curves. It uses perspective projection to obtain 2D curves, then utilizes a customized differentiable rasterizer to render these curves while preserving depth order. It employs Score Distillation Sampling (SDS) to leverage pre-trained 2D image generation models, enabling text or single image guided 3D sketch generation. |
Diff3DS is the first to achieve text-to-3D sketch generation, outperforming existing text-to-3D methods adapted for this task.
The framework successfully generates 3D sketches from single images, surpassing the performance of a multiview reconstruction-based baseline.
Ablation studies validate the contribution of key components like Time Annealing Schedule and Dynamic Noise Deletion. |
The framework inherits the sparse gradient issue from DiffVG, limiting its ability to optimize non-continuous parameters.
The current approach doesn't differentiate between view-independent and view-dependent curves, potentially limiting the expressiveness of generated 3D shapes. Future work could incorporate diverse curve representations. |
3d sketch generation, differentiable rendering, rational bézier curves, score distillation sampling, text-to-3d, image-to-3d |
2405.15304
Report |
Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient |
Yongliang Wu, Shiji Zhou, Mingzhuo Yang, Lianzhe Wang, Wenbo Zhu, Heng Chang, Xiao Zhou, Xu Yang |
Current text-to-image diffusion models have achieved groundbreaking results
in image generation tasks. However, the unavoidable inclusion of sensitive
information during pre-training introduces significant risks such as copyright
infringement and privacy violations in the generated images. Machine Unlearning
(MU) provides a effective way to the sensitive concepts captured by the model,
has been shown to be a promising approach to addressing these issues.
Nonetheless, existing MU methods for concept erasure encounter two primary
bottlenecks: 1) generalization issues, where concept erasure is effective only
for the data within the unlearn set, and prompts outside the unlearn set often
still result in the generation of sensitive concepts; and 2) utility drop,
where erasing target concepts significantly degrades the model's performance.
To this end, this paper first proposes a concept domain correction framework
for unlearning concepts in diffusion models. By aligning the output domains of
sensitive concepts and anchor concepts through adversarial training, we enhance
the generalizability of the unlearning results. Secondly, we devise a
concept-preserving scheme based on gradient surgery. This approach alleviates
the parts of the unlearning gradient that contradict the relearning gradient,
ensuring that the process of unlearning minimally disrupts the model's
performance. Finally, extensive experiments validate the effectiveness of our
model, demonstrating our method's capability to address the challenges of
concept unlearning in diffusion models while preserving model utility. |
This paper introduces a novel approach for unlearning concepts in text-to-image diffusion models, tackling the limitations of existing methods in terms of generalizability and utility drop. |
Unlearning concepts in diffusion models is crucial for addressing copyright infringement, privacy violations, and the generation of inappropriate content, which are significant concerns associated with pre-trained models. |
The proposed method utilizes a concept domain correction framework with adversarial training to align the output domains of target and anchor concepts, enhancing generalizability. It also employs a concept-preserving gradient strategy based on gradient surgery to minimize the impact of unlearning on the model's performance on other concepts. |
The method effectively unlearns specific instances, styles, and inappropriate content while preserving the integrity of other elements and concepts in the generated images.
Quantitative evaluations using CLIP Score, CLIP Accuracy, and FID demonstrate superior performance compared to existing methods, striking a balance between unlearning and retaining non-target concepts.
Experiments with multiple instance removal and the I2P benchmark showcase the method's capability to handle complex unlearning scenarios and effectively reduce the generation of inappropriate content. |
The method still relies on an anchor-based approach, demanding considerable computational overhead for data preparation.
Future work could explore the integration of the Latent Anchor method to optimize or bypass the data preparation process. |
machine unlearning, diffusion models, text-to-image synthesis, concept erasure, adversarial training, gradient surgery |
2405.15287
Report |
StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models |
Chengming Xu, Kai Hu, Donghao Luo, Jiangning Zhang, Wei Li, Yanhao Ge, Chengjie Wang |
Stylized Text-to-Image Generation (STIG) aims to generate images based on
text prompts and style reference images. We in this paper propose a novel
framework dubbed as StyleMaster for this task by leveraging pretrained Stable
Diffusion (SD), which tries to solve the previous problems such as insufficient
style and inconsistent semantics. The enhancement lies in two novel module,
namely multi-source style embedder and dynamic attention adapter. In order to
provide SD with better style embeddings, we propose the multi-source style
embedder considers both global and local level visual information along with
textual one, which provide both complementary style-related and
semantic-related knowledge. Additionally, aiming for better balance between the
adaptor capacity and semantic control, the proposed dynamic attention adapter
is applied to the diffusion UNet in which adaptation weights are dynamically
calculated based on the style embeddings. Two objective functions are
introduced to optimize the model together with denoising loss, which can
further enhance semantic and style consistency. Extensive experiments
demonstrate the superiority of StyleMaster over existing methods, rendering
images with variable target styles while successfully maintaining the semantic
information from the text prompts. |
This paper proposes AnyArt, a novel framework for Stylized Text-to-Image Generation (STIG) that addresses limitations of existing methods in achieving sufficient style and semantic consistency. |
STIG is crucial for applications like art creation and movie editing, offering greater flexibility and applicability than traditional style transfer methods. |
AnyArt leverages a multi-source style embedder to capture comprehensive style information from reference images while mitigating semantic leakage. It also employs a dynamic attention adapter to effectively integrate style embeddings into the diffusion process without compromising semantic fidelity. |
AnyArt significantly outperforms existing STIG methods in both one-shot and multi-shot settings, demonstrating superior style similarity and semantic consistency.
The multi-source style embedder effectively captures diverse style patterns while minimizing semantic leakage from reference images.
The dynamic attention adapter successfully balances style influence and semantic fidelity, ensuring that generated images adhere to both style and text prompts. |
The patch-level transformer used in the multi-source style embedder limits its efficiency in handling a large number of reference images.
The current method solely focuses on image-based style conditions, neglecting other potential modalities like text, videos, or 3D data. |
stylized text-to-image generation, stable diffusion, style embedding, dynamic attention adaptation, semantic consistency |
2405.15234
Report |
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models |
Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu |
Diffusion models (DMs) have achieved remarkable success in text-to-image
generation, but they also pose safety risks, such as the potential generation
of harmful content and copyright violations. The techniques of machine
unlearning, also known as concept erasing, have been developed to address these
risks. However, these techniques remain vulnerable to adversarial prompt
attacks, which can prompt DMs post-unlearning to regenerate undesired images
containing concepts (such as nudity) meant to be erased. This work aims to
enhance the robustness of concept erasing by integrating the principle of
adversarial training (AT) into machine unlearning, resulting in the robust
unlearning framework referred to as AdvUnlearn. However, achieving this
effectively and efficiently is highly nontrivial. First, we find that a
straightforward implementation of AT compromises DMs' image generation quality
post-unlearning. To address this, we develop a utility-retaining regularization
on an additional retain set, optimizing the trade-off between concept erasure
robustness and model utility in AdvUnlearn. Moreover, we identify the text
encoder as a more suitable module for robustification compared to UNet,
ensuring unlearning effectiveness. And the acquired text encoder can serve as a
plug-and-play robust unlearner for various DM types. Empirically, we perform
extensive experiments to demonstrate the robustness advantage of AdvUnlearn
across various DM unlearning scenarios, including the erasure of nudity,
objects, and style concepts. In addition to robustness, AdvUnlearn also
achieves a balanced tradeoff with model utility. To our knowledge, this is the
first work to systematically explore robust DM unlearning through AT, setting
it apart from existing methods that overlook robustness in concept erasing.
Codes are available at: https://github.com/OPTML-Group/AdvUnlearn |
This paper presents AdvUnlearn, a novel framework integrating adversarial training (AT) into diffusion model (DM) unlearning, enhancing the robustness of concept erasure against adversarial prompt attacks while preserving image generation quality. |
Existing concept erasure techniques in DMs are vulnerable to adversarial attacks, which can prompt the regeneration of undesired content. This underscores the need for more robust unlearning methods to ensure the safe and ethical deployment of DMs. |
The paper proposes a bi-level optimization approach for AdvUnlearn, addressing effectiveness and efficiency challenges. It introduces a utility-retaining regularization using an external retain prompt set to balance robustness and utility. It also identifies the text encoder as a more suitable module for robustification than UNet, allowing for plug-and-play robust unlearning across different DM types. |
AdvUnlearn significantly improves the robustness of concept-erased DMs against adversarial attacks, evidenced by reduced attack success rates across various unlearning scenarios (nudity, objects, style).
The framework effectively balances robustness with image generation quality, demonstrated by comparable FID and CLIP scores to the original DM, unlike some baselines that sacrifice utility for robustness.
AdvUnlearn's text encoder, finetuned for unlearning on one DM, demonstrates promising plug-and-play capability, transferring robustness to other DM types without additional finetuning. |
The computational cost of AdvUnlearn is high due to adversarial training and utility-retaining regularization, requiring further research into optimization for practical deployment.
While AdvUnlearn exhibits effectiveness in the studied scenarios, exploring its generalization to a wider range of concepts and attack strategies is crucial for future work. |
diffusion models, machine unlearning, concept erasing, adversarial prompt attacks, robustness |
2405.15232
Report |
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception |
Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui |
The development of large language models (LLMs) has significantly advanced
the emergence of large multimodal models (LMMs). While LMMs have achieved
tremendous success by promoting the synergy between multimodal comprehension
and creation, they often face challenges when confronted with
out-of-distribution data. This is primarily due to their reliance on image
encoders trained to encode images into task-relevant features, which may lead
them to disregard irrelevant details. Delving into the modeling capabilities of
diffusion models for images naturally prompts the question: Can diffusion
models serve as the eyes of large language models for image perception? In this
paper, we propose DEEM, a simple and effective approach that utilizes the
generative feedback of diffusion models to align the semantic distributions of
the image encoder. This addresses the drawbacks of previous methods that solely
relied on image encoders like ViT, thereby enhancing the model's resilience
against out-of-distribution samples and reducing visual hallucinations.
Importantly, this is achieved without requiring additional training modules and
with fewer training parameters. We extensively evaluated DEEM on both our newly
constructed RobustVQA benchmark and another well-known benchmark, POPE, for
object hallucination. Compared to the state-of-the-art interleaved content
generation models, DEEM exhibits enhanced robustness and a superior capacity to
alleviate model hallucinations while utilizing fewer trainable parameters, less
pre-training data (10%), and a smaller base model size. |
This paper presents DEEM, a novel method that leverages diffusion models to improve the robustness and hallucination recognition abilities of large multimodal models (LMMs) by aligning the semantic distributions of image encoders. |
Existing LMMs often struggle with out-of-distribution data due to their reliance on image encoders that disregard irrelevant details. DEEM addresses this limitation by using diffusion models as an additional "eye" for LMMs to correct potential semantic bias in image encoding. |
DEEM uses a three-stage training process: image-text alignment pre-training, image-text instruction fine-tuning, and mask-text instruction fine-tuning. It leverages a VFM-based image encoder, an LLM-based multimodal decoder, and a DM-based image decoder. A consistency semantic regularization term ensures the alignment between the image encoder's semantic information and the diffusion model's generative feedback. |
DEEM demonstrates enhanced robustness and a superior capacity to alleviate model hallucinations compared to state-of-the-art interleaved image-text modeling models.
DEEM achieves these improvements while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
After supervised fine-tuning, DEEM achieves competitive performance on various multimodal tasks, including visual question-answering, region-level image captioning, and text-to-image generation. |
While DEEM improves robustness, it cannot completely eliminate the robustness knowledge forgetting issue caused by subsequent fine-tuning.
Updating larger image encoders with DEEM can increase the training overhead, potentially limiting its applicability in certain scenarios. |
large multimodal models, diffusion models, robustness, hallucination recognition, semantic alignment |
2405.15223
Report |
iVideoGPT: Interactive VideoGPTs are Scalable World Models |
Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long |
World models empower model-based agents to interactively explore, reason, and
plan within imagined environments for real-world decision-making. However, the
high demand for interactivity poses challenges in harnessing recent
advancements in video generative models for developing world models at scale.
This work introduces Interactive VideoGPT (iVideoGPT), a scalable
autoregressive transformer framework that integrates multimodal signals--visual
observations, actions, and rewards--into a sequence of tokens, facilitating an
interactive experience of agents via next-token prediction. iVideoGPT features
a novel compressive tokenization technique that efficiently discretizes
high-dimensional visual observations. Leveraging its scalable architecture, we
are able to pre-train iVideoGPT on millions of human and robotic manipulation
trajectories, establishing a versatile foundation that is adaptable to serve as
interactive world models for a wide range of downstream tasks. These include
action-conditioned video prediction, visual planning, and model-based
reinforcement learning, where iVideoGPT achieves competitive performance
compared with state-of-the-art methods. Our work advances the development of
interactive general world models, bridging the gap between generative video
models and practical model-based reinforcement learning applications. |
Introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer architecture for building interactive world models by integrating visual observations, actions, and rewards into a token sequence, featuring a novel compressive tokenization technique for efficiency. |
Addresses the limitations of existing world models that struggle to balance interactivity and scalability, bridging the gap between generative video models and practical model-based reinforcement learning. |
Presents a two-phase approach: 1) pre-training iVideoGPT on a massive dataset of human and robotic manipulation trajectories for action-free video prediction, and 2) adapting the pre-trained model to downstream tasks like action-conditioned video prediction, visual planning, and model-based RL. |
Achieves competitive performance in video prediction on BAIR and RoboNet datasets compared to state-of-the-art methods.
Demonstrates effective visual planning capabilities, outperforming baselines in certain RoboDesk tasks and showing comparable performance to top models in the VP$^2$ benchmark.
Shows significant improvement in sample efficiency for visual model-based RL on Meta-World tasks, matching or surpassing DreamerV3 performance and showcasing the potential of decoupling model and policy learning with powerful world models. |
Limited diversity in current pre-training data, particularly in publicly available robotic datasets, calls for incorporating more extensive and diverse data sources.
Compressive tokenization's assumption that initial frames provide sufficient context for future predictions may not hold for scenarios with long videos and significant camera movements, suggesting further exploration of keyframe extraction techniques. |
world models, video prediction, reinforcement learning, transformers, computer vision |
2405.15217
Report |
NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation |
Vikas Thamizharasan, Difan Liu, Matthew Fisher, Nanxuan Zhao, Evangelos Kalogerakis, Michal Lukac |
The success of denoising diffusion models in representing rich data
distributions over 2D raster images has prompted research on extending them to
other data representations, such as vector graphics. Unfortunately due to their
variable structure and scarcity of vector training data, directly applying
diffusion models on this domain remains a challenging problem. Using
workarounds like optimization via Score Distillation Sampling (SDS) is also
fraught with difficulty, as vector representations are non trivial to directly
optimize and tend to result in implausible geometries such as redundant or
self-intersecting shapes. NIVeL addresses these challenges by reinterpreting
the problem on an alternative, intermediate domain which preserves the
desirable properties of vector graphics -- mainly sparsity of representation
and resolution-independence. This alternative domain is based on neural
implicit fields expressed in a set of decomposable, editable layers. Based on
our experiments, NIVeL produces text-to-vector graphics results of
significantly better quality than the state-of-the-art. |
This paper introduces NIVeL, a novel method for text-to-vector graphics generation that uses neural implicit vector layers. |
Existing methods struggle to generate high-quality vector graphics from text due to the variable structure of vector representations and the lack of large-scale training data. NIVeL addresses these challenges by using an intermediate, vector-like representation based on neural implicit fields. |
NIVeL represents shapes as 2D continuous implicit functions, organized in a layered structure, and leverages score distillation sampling (SDS) from a pre-trained image-based diffusion model to optimize the parameters of these implicit functions. A key innovation is the use of a low-frequency implicit RGB image generator for initialization, leading to more semantically meaningful layers and improved final results. |
NIVeL outperforms state-of-the-art methods like VectorFusion in terms of CLIP-based metrics and perceptual quality, as demonstrated by user studies.
The method effectively generates clean, editable, and semantically meaningful vector graphics from text prompts, even with a low parameter count.
Ablation studies highlight the importance of the proposed initialization strategy for achieving high-quality results and avoiding common failure modes. |
The representation is currently limited by a fixed upper bound on the number of layers.
Future work could explore a differentiable implicit-to-vector module for converting the implicit field to parametric curves. |
vector graphics, text-to-image synthesis, diffusion models, neural implicit fields, score distillation sampling |
2405.15176
Report |
MonoDETRNext: Next-generation Accurate and Efficient Monocular 3D Object Detection Method |
Pan Liao, Feng Yang, Di Wu, Liu Bo |
Monocular vision-based 3D object detection is crucial in various sectors, yet
existing methods face significant challenges in terms of accuracy and
computational efficiency. Building on the successful strategies in 2D detection
and depth estimation, we propose MonoDETRNext, which seeks to optimally balance
precision and processing speed. Our methodology includes the development of an
efficient hybrid visual encoder, enhancement of depth prediction mechanisms,
and introduction of an innovative query generation strategy, augmented by an
advanced depth predictor. Building on MonoDETR, MonoDETRNext introduces two
variants: MonoDETRNext-F, which emphasizes speed, and MonoDETRNext-A, which
focuses on precision. We posit that MonoDETRNext establishes a new benchmark in
monocular 3D object detection and opens avenues for future research. We
conducted an exhaustive evaluation demonstrating the model's superior
performance against existing solutions. Notably, MonoDETRNext-A demonstrated a
4.60% improvement in the AP3D metric on the KITTI test benchmark over MonoDETR,
while MonoDETRNext-F showed a 2.21% increase. Additionally, the computational
efficiency of MonoDETRNext-F slightly exceeds that of its predecessor. |
Proposes MonoDETRNext, a monocular 3D object detection model with two variants: MonoDETRNext-F (speed-focused) and MonoDETRNext-A (accuracy-focused), improving upon MonoDETR. |
Monocular 3D object detection is crucial for applications with limited resources, but existing methods struggle with accuracy and efficiency. |
Develops an efficient hybrid visual encoder, enhances depth prediction mechanisms, and introduces a novel query generation strategy augmented by an advanced depth predictor. |
MonoDETRNext-A shows 4.60% improvement in AP3D on KITTI over MonoDETR.
MonoDETRNext-F shows 2.21% improvement in AP3D on KITTI over MonoDETR.
MonoDETRNext-F slightly surpasses MonoDETR in computational efficiency. |
Accuracy gap persists compared to multi-view or sensor fusion methods.
Limited dataset availability for evaluation and comparison with other monocular methods. |
3d object detection, monocular vision, depth prediction, efficient encoder, query generation |
2405.15125
Report |
HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting |
Yuanhao Cai, Zihao Xiao, Yixun Liang, Yulun Zhang, Xiaokang Yang, Yaoyao Liu, Alan Yuille |
High dynamic range (HDR) novel view synthesis (NVS) aims to create
photorealistic images from novel viewpoints using HDR imaging techniques. The
rendered HDR images capture a wider range of brightness levels containing more
details of the scene than normal low dynamic range (LDR) images. Existing HDR
NVS methods are mainly based on NeRF. They suffer from long training time and
slow inference speed. In this paper, we propose a new framework, High Dynamic
Range Gaussian Splatting (HDR-GS), which can efficiently render novel HDR views
and reconstruct LDR images with a user input exposure time. Specifically, we
design a Dual Dynamic Range (DDR) Gaussian point cloud model that uses
spherical harmonics to fit HDR color and employs an MLP-based tone-mapper to
render LDR color. The HDR and LDR colors are then fed into two Parallel
Differentiable Rasterization (PDR) processes to reconstruct HDR and LDR views.
To establish the data foundation for the research of 3D Gaussian
splatting-based methods in HDR NVS, we recalibrate the camera parameters and
compute the initial positions for Gaussian point clouds. Experiments
demonstrate that our HDR-GS surpasses the state-of-the-art NeRF-based method by
3.84 and 1.91 dB on LDR and HDR NVS while enjoying 1000x inference speed and
only requiring 6.3% training time. |
This paper introduces HDR-GS, the first Gaussian Splatting-based framework for efficient high dynamic range (HDR) novel view synthesis. |
Existing HDR novel view synthesis methods, primarily based on NeRF, suffer from long training times and slow inference speeds, limiting their practical applications. |
HDR-GS utilizes a Dual Dynamic Range (DDR) Gaussian point cloud model to jointly represent HDR and LDR colors. It employs spherical harmonics for HDR color and an MLP-based tone-mapper for LDR color rendering. Two parallel differentiable rasterization processes then generate HDR and LDR views from these colors. Additionally, the paper recalibrates camera parameters and utilizes SfM points for initializing 3D Gaussians, addressing limitations of previous datasets. |
HDR-GS outperforms state-of-the-art NeRF-based methods by 1.91 dB on HDR novel view synthesis.
HDR-GS achieves a 1000x faster inference speed compared to NeRF-based counterparts.
HDR-GS significantly reduces training time, requiring only 6.3% of the time needed for SOTA methods. |
The paper mainly focuses on static scenes.
Future work could explore the application of HDR-GS in dynamic scene modeling. |
novel view synthesis, high dynamic range imaging, gaussian splatting, 3d reconstruction, computer vision |
2405.15118
Report |
GS-Hider: Hiding Messages into 3D Gaussian Splatting |
Xuanyu Zhang, Jiarui Meng, Runyi Li, Zhipei Xu, Yongbing Zhang, Jian Zhang |
3D Gaussian Splatting (3DGS) has already become the emerging research focus
in the fields of 3D scene reconstruction and novel view synthesis. Given that
training a 3DGS requires a significant amount of time and computational cost,
it is crucial to protect the copyright, integrity, and privacy of such 3D
assets. Steganography, as a crucial technique for encrypted transmission and
copyright protection, has been extensively studied. However, it still lacks
profound exploration targeted at 3DGS. Unlike its predecessor NeRF, 3DGS
possesses two distinct features: 1) explicit 3D representation; and 2)
real-time rendering speeds. These characteristics result in the 3DGS point
cloud files being public and transparent, with each Gaussian point having a
clear physical significance. Therefore, ensuring the security and fidelity of
the original 3D scene while embedding information into the 3DGS point cloud
files is an extremely challenging task. To solve the above-mentioned issue, we
first propose a steganography framework for 3DGS, dubbed GS-Hider, which can
embed 3D scenes and images into original GS point clouds in an invisible manner
and accurately extract the hidden messages. Specifically, we design a coupled
secured feature attribute to replace the original 3DGS's spherical harmonics
coefficients and then use a scene decoder and a message decoder to disentangle
the original RGB scene and the hidden message. Extensive experiments
demonstrated that the proposed GS-Hider can effectively conceal multimodal
messages without compromising rendering quality and possesses exceptional
security, robustness, capacity, and flexibility. Our project is available at:
https://xuanyuzhang21.github.io/project/gshider. |
This paper introduces GS-Hider, a novel steganography framework for 3D Gaussian Splatting (3DGS) capable of concealing 3D scenes or images within other 3D scenes. |
Protecting the copyright and privacy of 3D assets is crucial due to the high cost of rendering 3DGS. This method offers a solution for secure communication and copyright protection of 3D scenes. |
GS-Hider replaces the original 3DGS spherical harmonics coefficients with a coupled secured feature attribute. It then utilizes a scene decoder and a private message decoder to disentangle the original and hidden content from the coupled features. |
GS-Hider achieves high fidelity, with minimal degradation in rendering quality compared to the original 3DGS.
The method ensures robust security, making it difficult for unauthorized users to extract the hidden content.
GS-Hider demonstrates large capacity and versatility by enabling the hiding of single images or even multiple 3D scenes within a single 3D scene. |
The current approach does not consider view dependency, potentially impacting rendering quality.
Rendering speed is slightly reduced due to high-dimensional feature rasterization and multi-layer convolutional decoding, though it remains within real-time requirements. |
3d gaussian splatting, steganography, copyright protection, secure communication, 3d scene reconstruction |
2405.15056
Report |
ElastoGen: 4D Generative Elastodynamics |
Yutao Feng, Yintong Shang, Xiang Feng, Lei Lan, Shandian Zhe, Tianjia Shao, Hongzhi Wu, Kun Zhou, Hao Su, Chenfanfu Jiang, Yin Yang |
We present ElastoGen, a knowledge-driven model that generates physically
accurate and coherent 4D elastodynamics. Instead of relying on petabyte-scale
data-driven learning, ElastoGen leverages the principles of physics-in-the-loop
and learns from established physical knowledge, such as partial differential
equations and their numerical solutions. The core idea of ElastoGen is
converting the global differential operator, corresponding to the nonlinear
elastodynamic equations, into iterative local convolution-like operations,
which naturally fit modern neural networks. Each network module is specifically
designed to support this goal rather than functioning as a black box. As a
result, ElastoGen is exceptionally lightweight in terms of both training
requirements and network scale. Additionally, due to its alignment with
physical procedures, ElastoGen efficiently generates accurate dynamics for a
wide range of hyperelastic materials and can be easily integrated with upstream
and downstream deep modules to enable end-to-end 4D generation. |
ElastoGen is a knowledge-driven model for generating physically accurate and coherent 4D elastodynamics, leveraging physical laws and principles instead of petabyte-scale data. |
Learning physical dynamics from observable data is challenging due to noise and agnostic underlying coherence. Existing deep models struggle with temporal consistency and require vast data. ElastoGen addresses these issues by incorporating established physical knowledge. |
ElastoGen converts the global differential operator of elastodynamic equations into iterative local convolution-like operations. It utilizes a neural metric with diffusion-based parameterization and a general subspace method for efficient matrix-free computation. |
ElastoGen generates accurate elastodynamics for various shapes and hyperelastic materials (Neo-Hookean, StVK, etc.) with minimal parameterization and lightweight training.
The model is compatible with different geometric representations, including voxels, implicit NeRFs, and complex explicit meshes.
Experiments demonstrate that ElastoGen produces results comparable to traditional FEM simulations while offering greater efficiency. |
ElastoGen currently lacks support for collisions, limiting its applicability in scenarios involving interacting objects.
The computational efficiency can be further improved, especially for large, sparse models where convolutions over empty voxels are computationally expensive. |
physics-based simulation, 4d generation, elastodynamics, diffusion models, neural networks |
2405.15020
Report |
AdjointDEIS: Efficient Gradients for Diffusion Models |
Zander W. Blasingame, Chen Liu |
The optimization of the latents and parameters of diffusion models with
respect to some differentiable metric defined on the output of the model is a
challenging and complex problem. The sampling for diffusion models is done by
solving either the probability flow ODE or diffusion SDE wherein a neural
network approximates the score function or related quantity, allowing a
numerical ODE/SDE solver to be used. However, na\"ive backpropagation
techniques are memory intensive, requiring the storage of all intermediate
states, and face additional complexity in handling the injected noise from the
diffusion term of the diffusion SDE. We propose a novel method based on the
stochastic adjoint sensitivity method to calculate the gradientwith respect to
the initial noise, conditional information, and model parameters by solving an
additional SDE whose solution is the gradient of the diffusion SDE. We exploit
the unique construction of diffusion SDEs to further simplify the formulation
of the adjoint diffusion SDE and use a change-of-variables to simplify the
solution to an exponentially weighted integral. Using this formulation we
derive a custom solver for the adjoint SDE as well as the simpler adjoint ODE.
The proposed adjoint diffusion solvers can efficiently compute the gradients
for both the probability flow ODE and diffusion SDE for latents and parameters
of the model. Lastly, we demonstrate the effectiveness of the adjoint diffusion
solvers onthe face morphing problem. |
The paper introduces AdjointDEIS, a novel method for calculating gradients of diffusion models with respect to latents and parameters by solving an adjoint diffusion SDE, enabling efficient optimization of diffusion models. |
Optimizing diffusion models for specific tasks is challenging due to memory-intensive backpropagation and complexities in handling diffusion noise. AdjointDEIS addresses these challenges, enabling guided generation and adaptation of pre-trained models. |
The authors leverage the stochastic adjoint sensitivity method to derive an adjoint probability flow ODE and its simplified formulation using exponential integrators. They propose custom first and second-order solvers for both ODE and SDE settings. |
AdjointDEIS is the first general backpropagation technique for diffusion models using SDE solvers, providing gradients for network weights, conditional information, and noisy states.
Custom solvers for the adjoint ODE/SDE demonstrate efficient computation of gradients.
AdjointDEIS shows effectiveness in guided generation, specifically for face morphing attacks, outperforming existing methods in visual quality and attack efficacy. |
Further analysis is needed to evaluate AdjointDEIS on diverse guided generation tasks.
Theoretical convergence rates for the proposed solvers are not yet established. |
diffusion models, adjoint sensitivity method, guided generation, face morphing attack, exponential integrators |
2405.14979
Report |
CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner |
Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, Xiaoxiao Long |
We present a novel generative 3D modeling system, coined CraftsMan, which can
generate high-fidelity 3D geometries with highly varied shapes, regular mesh
topologies, and detailed surfaces, and, notably, allows for refining the
geometry in an interactive manner. Despite the significant advancements in 3D
generation, existing methods still struggle with lengthy optimization
processes, irregular mesh topologies, noisy surfaces, and difficulties in
accommodating user edits, consequently impeding their widespread adoption and
implementation in 3D modeling software. Our work is inspired by the craftsman,
who usually roughs out the holistic figure of the work first and elaborates the
surface details subsequently. Specifically, we employ a 3D native diffusion
model, which operates on latent space learned from latent set-based 3D
representations, to generate coarse geometries with regular mesh topology in
seconds. In particular, this process takes as input a text prompt or a
reference image and leverages a powerful multi-view (MV) diffusion model to
generate multiple views of the coarse geometry, which are fed into our
MV-conditioned 3D diffusion model for generating the 3D geometry, significantly
improving robustness and generalizability. Following that, a normal-based
geometry refiner is used to significantly enhance the surface details. This
refinement can be performed automatically, or interactively with user-supplied
edits. Extensive experiments demonstrate that our method achieves high efficacy
in producing superior-quality 3D assets compared to existing methods. HomePage:
https://craftsman3d.github.io/, Code: https://github.com/wyysf-98/CraftsMan |
CraftsMan, a novel generative 3D modeling system that generates high-fidelity 3D geometries from a single image or text prompt, featuring regular mesh topologies, detailed surfaces, and interactive refinement capabilities. |
Existing 3D generation methods struggle with lengthy optimization processes, irregular mesh topologies, noisy surfaces, and difficulties in accommodating user edits, limiting their practical use. |
The system uses a two-stage process: (1) A 3D native diffusion model, conditioned on multi-view images from a multi-view diffusion model, generates coarse 3D geometries. (2) A normal-based geometry refiner, leveraging ControlNet-tile and surface normal map diffusion, enhances surface details either automatically or interactively based on user edits. |
Generates high-fidelity 3D geometries with regular mesh topologies and detailed surfaces in 30 seconds.
Exhibits superior quality and detail richness compared to existing 3D generative and reconstruction models, as demonstrated by qualitative and quantitative evaluations.
Offers interactive refinement tools, such as the Magic Normal Brush, allowing users to efficiently edit specific areas of the generated mesh. |
Limited controllability of the Latent Set Diffusion model.
Future work includes exploring texture generation for 3D meshes. |
3d generation, diffusion models, mesh refinement, interactive modeling, generative ai |
2405.14874
Report |
Investigating Robustness of Open-Vocabulary Foundation Object Detectors under Distribution Shifts |
Prakash Chandra Chhipa, Kanjar De, Meenakshi Subhash Chippa, Rajkumar Saini, Marcus Liwicki |
The challenge of Out-Of-Distribution (OOD) robustness remains a critical
hurdle towards deploying deep vision models. Open-vocabulary object detection
extends the capabilities of traditional object detection frameworks to
recognize and classify objects beyond predefined categories. Investigating OOD
robustness in open-vocabulary object detection is essential to increase the
trustworthiness of these models. This study presents a comprehensive robustness
comparison of zero-shot capabilities of three recent open-vocabulary foundation
object detection models, namely OWL-ViT, YOLO World, and Grounding DINO.
Experiments carried out on the COCO-O and COCO-C benchmarks encompassing
distribution shifts highlight the challenges of the models' robustness. Source
code shall be made available to the research community on GitHub. |
This paper presents a comparative robustness analysis of three state-of-the-art open-vocabulary object detection models (OWL-ViT, YOLO World, and Grounding DINO) under out-of-distribution (OOD) conditions. |
Investigating OOD robustness in open-vocabulary object detection is crucial for increasing the trustworthiness and reliability of these models in real-world applications where they might encounter unseen data. |
The authors evaluate the zero-shot performance of the models on the COCO-O and COCO-C benchmarks, which introduce distribution shifts through various image degradations and corruptions. |
All three models exhibit significant performance drops on OOD data, indicating a need for improved robustness.
Grounding DINO demonstrates the highest robustness, maintaining performance closer to its original COCO results compared to the other models.
The study highlights the increasing susceptibility of the models to performance degradation as the severity of corruptions increases. |
The study primarily focuses on zero-shot evaluation and could be extended to include few-shot learning scenarios.
Future work could explore techniques to enhance the robustness of open-vocabulary object detectors, such as prompt engineering or incorporating robustness-enhancing training strategies. |
open-vocabulary object detection, out-of-distribution robustness, zero-shot learning, distribution shift, computer vision |
2405.14871
Report |
NeRF-Casting: Improved View-Dependent Appearance with Consistent Reflections |
Dor Verbin, Pratul P. Srinivasan, Peter Hedman, Ben Mildenhall, Benjamin Attal, Richard Szeliski, Jonathan T. Barron |
Neural Radiance Fields (NeRFs) typically struggle to reconstruct and render
highly specular objects, whose appearance varies quickly with changes in
viewpoint. Recent works have improved NeRF's ability to render detailed
specular appearance of distant environment illumination, but are unable to
synthesize consistent reflections of closer content. Moreover, these techniques
rely on large computationally-expensive neural networks to model outgoing
radiance, which severely limits optimization and rendering speed. We address
these issues with an approach based on ray tracing: instead of querying an
expensive neural network for the outgoing view-dependent radiance at points
along each camera ray, our model casts reflection rays from these points and
traces them through the NeRF representation to render feature vectors which are
decoded into color using a small inexpensive network. We demonstrate that our
model outperforms prior methods for view synthesis of scenes containing shiny
objects, and that it is the only existing NeRF method that can synthesize
photorealistic specular appearance and reflections in real-world scenes, while
requiring comparable optimization time to current state-of-the-art view
synthesis models. |
The paper introduces NeRF-Casting, a novel NeRF-based method that uses ray tracing to improve the rendering of specular reflections in 3D scenes. |
Existing NeRF models struggle to efficiently and accurately reconstruct and render scenes containing highly specular, glossy objects. |
Instead of relying on large, expensive MLPs, NeRF-Casting casts reflection rays from points along camera rays, tracing them through the learned NeRF representation. This allows for the synthesis of consistent reflections from both near-field and distant scene content. To enhance efficiency and prevent aliasing, the method employs directional sampling and feature downweighting techniques. |
Outperforms prior methods in view synthesis of scenes with shiny objects, particularly excelling in synthesizing high-quality reflections of nearby content.
Demonstrates a qualitative improvement over existing techniques in achieving realistic and consistent motion of reflections as the camera moves through the scene.
Achieves comparable optimization time to state-of-the-art view synthesis models while requiring less compute during inference. |
Limitations: Struggles to render semi-transparent surfaces due to reflecting from a single expected termination point per ray.
Future work: Addressing the visibility of the camera in reflections, which is not currently accounted for by the model. |
neural radiance fields, view synthesis, reflections, ray tracing, specular appearance |
2405.14868
Report |
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis |
Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, Carl Vondrick |
Accurate reconstruction of complex dynamic scenes from just a single
viewpoint continues to be a challenging task in computer vision. Current
dynamic novel view synthesis methods typically require videos from many
different camera viewpoints, necessitating careful recording setups, and
significantly restricting their utility in the wild as well as in terms of
embodied AI applications. In this paper, we propose $\textbf{GCD}$, a
controllable monocular dynamic view synthesis pipeline that leverages
large-scale diffusion priors to, given a video of any scene, generate a
synchronous video from any other chosen perspective, conditioned on a set of
relative camera pose parameters. Our model does not require depth as input, and
does not explicitly model 3D scene geometry, instead performing end-to-end
video-to-video translation in order to achieve its goal efficiently. Despite
being trained on synthetic multi-view video data only, zero-shot real-world
generalization experiments show promising results in multiple domains,
including robotics, object permanence, and driving environments. We believe our
framework can potentially unlock powerful applications in rich dynamic scene
understanding, perception for robotics, and interactive 3D video viewing
experiences for virtual reality. |
This paper presents Generative Camera Dolly (GCD), a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to generate synchronous videos from arbitrary perspectives given a single video of a scene and relative camera pose parameters. |
Accurate reconstruction of complex dynamic scenes from a single viewpoint is crucial for applications in robotics, autonomous driving, and immersive VR experiences. Existing methods often require multi-view videos or are limited to small viewpoint changes, restricting their practical utility. |
GCD leverages a pre-trained video diffusion model (Stable Video Diffusion) and fine-tunes it on paired videos from simulations. The model utilizes a novel micro-conditioning mechanism to control camera parameters and learns to generate videos from novel viewpoints by gradually interpolating between source and target camera poses. |
GCD achieves state-of-the-art results on monocular dynamic view synthesis, outperforming baselines by a large margin on Kubric-4D and ParallelDomain-4D datasets.
The model demonstrates strong generalization capabilities, producing plausible novel views for various real-world videos, including driving, indoor, and robotic manipulation scenes.
GCD effectively handles large camera viewpoint changes, revealing unseen portions of the scene and reconstructing occluded objects. |
GCD may struggle with out-of-distribution real-world videos, particularly those involving highly deformable objects or complex human motion.
The model's performance can be sensitive to the choice of camera trajectory and interpolation method. |
dynamic view synthesis, video diffusion models, monocular depth estimation, camera pose control, scene understanding |
2405.14866
Report |
Tele-Aloha: A Low-budget and High-authenticity Telepresence System Using Sparse RGB Cameras |
Hanzhang Tu, Ruizhi Shao, Xue Dong, Shunyuan Zheng, Hao Zhang, Lili Chen, Meili Wang, Wenyu Li, Siyan Ma, Shengping Zhang, Boyao Zhou, Yebin Liu |
In this paper, we present a low-budget and high-authenticity bidirectional
telepresence system, Tele-Aloha, targeting peer-to-peer communication
scenarios. Compared to previous systems, Tele-Aloha utilizes only four sparse
RGB cameras, one consumer-grade GPU, and one autostereoscopic screen to achieve
high-resolution (2048x2048), real-time (30 fps), low-latency (less than 150ms)
and robust distant communication. As the core of Tele-Aloha, we propose an
efficient novel view synthesis algorithm for upper-body. Firstly, we design a
cascaded disparity estimator for obtaining a robust geometry cue. Additionally
a neural rasterizer via Gaussian Splatting is introduced to project latent
features onto target view and to decode them into a reduced resolution.
Further, given the high-quality captured data, we leverage weighted blending
mechanism to refine the decoded image into the final resolution of 2K.
Exploiting world-leading autostereoscopic display and low-latency iris
tracking, users are able to experience a strong three-dimensional sense even
without any wearable head-mounted display device. Altogether, our telepresence
system demonstrates the sense of co-presence in real-life experiments,
inspiring the next generation of communication. |
Tele-Aloha, a low-budget and high-authenticity bidirectional telepresence system using sparse RGB cameras for peer-to-peer communication. |
Existing telepresence systems are often expensive, require complex hardware setups, and rely on depth sensors that can be sensitive to environmental factors. |
Utilizes four sparse RGB cameras, a consumer-grade GPU, and an autostereoscopic screen. Develops a novel view synthesis algorithm with a cascaded disparity estimator for robust geometry cues and a neural rasterizer based on 3D Gaussian Splatting for high-quality rendering. |
Achieves high-resolution (2048x2048), real-time (30 fps), low-latency (less than 150ms) performance.
Produces competitive depth maps compared to TOF sensors using only RGB cameras.
Outperforms other efficient novel view synthesis algorithms in terms of rendering quality on a synthetic dataset. |
System may fail on specular objects due to challenges in disparity estimation.
Potential issues with inaccurate background segmentation can lead to artifacts. |
telepresence, videoconferencing, novel view synthesis, 3d gaussian splatting, rgb-only |
2405.14858
Report |
Mamba-R: Vision Mamba ALSO Needs Registers |
Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie |
Similar to Vision Transformers, this paper identifies artifacts also present
within the feature maps of Vision Mamba. These artifacts, corresponding to
high-norm tokens emerging in low-information background areas of images, appear
much more severe in Vision Mamba -- they exist prevalently even with the
tiny-sized model and activate extensively across background regions. To
mitigate this issue, we follow the prior solution of introducing register
tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional
inference paradigm, two key modifications are introduced: 1) evenly inserting
registers throughout the input token sequence, and 2) recycling registers for
final decision predictions. We term this new architecture Mamba-R. Qualitative
observations suggest, compared to vanilla Vision Mamba, Mamba-R's feature maps
appear cleaner and more focused on semantically meaningful regions.
Quantitatively, Mamba-R attains stronger performance and scales better. For
example, on the ImageNet benchmark, our base-size Mamba-R attains 82.9%
accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide
the first successful scaling to the large model size (i.e., with 341M
parameters), attaining a competitive accuracy of 83.2% (84.5% if finetuned with
384x384 inputs). Additional validation on the downstream semantic segmentation
task also supports Mamba-R's efficacy. |
This paper identifies severe feature artifacts in Vision Mamba models, similar to but worse than those in ViTs, and proposes Mamba®—a novel architecture that incorporates register tokens to mitigate this issue. |
Addressing the artifact issue in Vision Mamba is crucial as these artifacts hinder feature extraction, limit scalability, and negatively impact performance. |
The paper introduces Mamba®, which builds upon Vision Mamba by incorporating two key modifications: 1) evenly inserting register tokens throughout the input token sequence and 2) recycling registers for final decision predictions. |
Mamba® effectively suppresses artifacts, resulting in cleaner feature maps that focus on semantically meaningful image regions.
Quantitatively, Mamba® significantly outperforms vanilla Vision Mamba on ImageNet, achieving 82.9% accuracy for the Base model.
Mamba® exhibits superior scalability compared to previous Vision Mamba models, effectively scaling to a Large size with 341M parameters and achieving 83.2% accuracy on ImageNet. |
The paper primarily focuses on image classification, leaving exploration of Mamba® in other vision tasks for future work.
Further investigation into the interpretability of register tokens, particularly their potential for multi-head-like behavior, is warranted. |
vision mamba, state space models, feature artifacts, register tokens, image classification, semantic segmentation |
2405.14857
Report |
Semantica: An Adaptable Image-Conditioned Diffusion Model |
Manoj Kumar, Neil Houlsby, Emiel Hoogeboom |
We investigate the task of adapting image generative models to different
datasets without finetuneing. To this end, we introduce Semantica, an
image-conditioned diffusion model capable of generating images based on the
semantics of a conditioning image. Semantica is trained exclusively on
web-scale image pairs, that is it receives a random image from a webpage as
conditional input and models another random image from the same webpage. Our
experiments highlight the expressivity of pretrained image encoders and
necessity of semantic-based data filtering in achieving high-quality image
generation. Once trained, it can adaptively generate new images from a dataset
by simply using images from that dataset as input. We study the transfer
properties of Semantica on ImageNet, LSUN Churches, LSUN Bedroom and SUN397. |
This paper introduces Semantica, an image-conditioned diffusion model that adapts to different datasets without finetuning by leveraging semantic information from a conditioning image. |
Adapting generative models to new datasets usually requires finetuning, which is impractical for large models and datasets. This paper explores an alternative based on in-context learning. |
Semantica consists of a pretrained image encoder (DINOv2) and a diffusion model. It is trained on image pairs from the same webpage, learning to generate images that share semantic content. Data filtering based on semantic similarity is used to improve performance. |
Token-level conditioning from the image encoder outperforms global feature conditioning.
Semantic data filtering significantly improves generation quality.
Semantica generalizes well to unseen datasets, outperforming a label-conditioned baseline on out-of-distribution datasets. |
Training Semantica requires significant computational resources.
The model relies on a frozen encoder, which could limit performance. |
generative models, diffusion models, image generation, transfer learning, in-context learning |
2405.14855
Report |
Synergistic Global-space Camera and Human Reconstruction from Videos |
Yizhou Zhao, Tuanfeng Y. Wang, Bhiksha Raj, Min Xu, Jimei Yang, Chun-Hao Paul Huang |
Remarkable strides have been made in reconstructing static scenes or human
bodies from monocular videos. Yet, the two problems have largely been
approached independently, without much synergy. Most visual SLAM methods can
only reconstruct camera trajectories and scene structures up to scale, while
most HMR methods reconstruct human meshes in metric scale but fall short in
reasoning with cameras and scenes. This work introduces Synergistic Camera and
Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically,
we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and
scene point clouds using camera-frame HMR as a strong prior, addressing depth,
scale, and dynamic ambiguities. Conditioning on the dense scene recovered, we
further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by
incorporating spatio-temporal coherency and dynamic scene constraints.
Together, they lead to consistent reconstructions of camera trajectories, human
meshes, and dense scene point clouds in a common world frame. Project page:
https://paulchhuang.github.io/synchmr |
This paper introduces SynCHMR, a novel pipeline that reconstructs metric-scale camera trajectories, human meshes, and dense scene point clouds from monocular videos by jointly optimizing human mesh recovery and SLAM. |
Existing methods for reconstructing humans and scenes from videos often treat these problems independently, leading to inconsistencies and ambiguities in scale, depth, and dynamic movements. |
The pipeline uses camera-frame human mesh estimates as a prior to disambiguate SLAM and calibrate depth. Subsequently, a Scene-aware SMPL Denoiser refines human mesh poses in the world frame by leveraging the reconstructed dynamic scene. |
SynCHMR outperforms state-of-the-art methods in global human motion estimation on EgoBody dataset.
Human-aware Metric SLAM effectively calibrates monocular depth and improves camera pose estimation.
Scene-aware SMPL Denoiser effectively leverages scene information to improve human mesh denoising. |
The method currently uses an approximated focal length, potentially limiting accuracy in cases with significant perspective distortion.
Handling subjects with body shapes not well-represented in the SMPL model remains an open challenge. |
human mesh recovery, slam, 3d human reconstruction, scene reconstruction, monocular video |
2405.14854
Report |
TerDiT: Ternary Diffusion Models with Transformers |
Xudong Lu, Aojun Zhou, Ziyi Lin, Qi Liu, Yuhui Xu, Renrui Zhang, Yafei Wen, Shuai Ren, Peng Gao, Junchi Yan, Hongsheng Li |
Recent developments in large-scale pre-trained text-to-image diffusion models
have significantly improved the generation of high-fidelity images,
particularly with the emergence of diffusion models based on transformer
architecture (DiTs). Among these diffusion models, diffusion transformers have
demonstrated superior image generation capabilities, boosting lower FID scores
and higher scalability. However, deploying large-scale DiT models can be
expensive due to their extensive parameter numbers. Although existing research
has explored efficient deployment techniques for diffusion models such as model
quantization, there is still little work concerning DiT-based models. To tackle
this research gap, in this paper, we propose TerDiT, a quantization-aware
training (QAT) and efficient deployment scheme for ternary diffusion models
with transformers. We focus on the ternarization of DiT networks and scale
model sizes from 600M to 4.2B. Our work contributes to the exploration of
efficient deployment strategies for large-scale DiT models, demonstrating the
feasibility of training extremely low-bit diffusion transformer models from
scratch while maintaining competitive image generation capacities compared to
full-precision models. Code will be available at
https://github.com/Lucky-Lance/TerDiT. |
Presents TerDiT, a novel quantization-aware training and efficient deployment scheme for ternary diffusion transformer models, significantly reducing model size and memory consumption. |
Large-scale DiT models, while powerful, are expensive to deploy due to their extensive parameter count. Quantization offers a solution, but existing methods are either limited to U-Net models or rely on less effective post-training techniques. |
Leveraging quantization-aware training (QAT), the method ternarizes DiT network weights and introduces RMS normalization within the adaLN module to enhance training stability and performance. Deployment is achieved using a 2-bit implementation for practical efficiency. |
TerDiT-4.2B achieves comparable image generation quality to full-precision DiT-XL/2 on ImageNet 256x256 benchmark, even with fewer training images.
Deployment efficiency is significantly improved, with over tenfold reduction in checkpoint size and about sixfold reduction in inference memory consumption compared to full-precision counterparts.
The introduced RMS normalized adaLN module is shown to accelerate convergence and enhance performance compared to directly ternarizing the adaLN module. |
Training ternary DiT models remains less stable and more time-consuming than full-precision networks, demanding further research to improve training efficiency.
Experiments are limited to ImageNet 256x256 resolution and label-conditioned generation due to computational resource constraints, leaving exploration of higher resolutions and text-to-image generation for future work. |
diffusion models, quantization, ternary networks, diffusion transformer (dit), efficient deployment |
2405.14832
Report |
Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer |
Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao |
Generating high-quality 3D assets from text and images has long been
challenging, primarily due to the absence of scalable 3D representations
capable of capturing intricate geometry distributions. In this work, we
introduce Direct3D, a native 3D generative model scalable to in-the-wild input
images, without requiring a multiview diffusion model or SDS optimization. Our
approach comprises two primary components: a Direct 3D Variational Auto-Encoder
(D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently
encodes high-resolution 3D shapes into a compact and continuous latent triplane
space. Notably, our method directly supervises the decoded geometry using a
semi-continuous surface sampling strategy, diverging from previous methods
relying on rendered images as supervision signals. D3D-DiT models the
distribution of encoded 3D latents and is specifically designed to fuse
positional information from the three feature maps of the triplane latent,
enabling a native 3D generative model scalable to large-scale 3D datasets.
Additionally, we introduce an innovative image-to-3D generation pipeline
incorporating semantic and pixel-level image conditions, allowing the model to
produce 3D shapes consistent with the provided conditional image input.
Extensive experiments demonstrate the superiority of our large-scale
pre-trained Direct3D over previous image-to-3D approaches, achieving
significantly better generation quality and generalization ability, thus
establishing a new state-of-the-art for 3D content creation. Project page:
https://nju-3dv.github.io/projects/Direct3D/. |
This paper introduces Direct3D, a novel image-to-3D generation method leveraging a native 3D diffusion model directly trained on a large-scale 3D dataset, bypassing the need for multi-view diffusion or SDS optimization. |
Existing 3D generation methods struggle to achieve both high fidelity and generalizability due to limitations in 3D representations and reliance on indirect generation from multi-view images. Direct3D addresses these limitations, enabling high-quality 3D asset creation from in-the-wild images. |
Direct3D employs a two-stage approach: 1) D3D-VAE encodes 3D shapes into a compact triplane latent space with direct geometry supervision, and 2) D3D-DiT, a 3D diffusion transformer, generates 3D shapes from this latent space conditioned on input images, incorporating both pixel-level and semantic-level image information. |
Direct3D outperforms existing image-to-3D approaches in terms of generation quality and generalization ability on the GSO dataset.
The method generalizes well to text-to-3D generation by utilizing text-to-image models, producing high-quality meshes consistent with the text prompts.
Ablation studies demonstrate the effectiveness of the explicit triplane latent representation, semi-continuous surface sampling strategy, and the D3D-DiT architecture for achieving superior performance. |
Direct3D is currently limited to generating individual or multiple objects and cannot handle large-scale scene generation.
Future research will explore extending the method to enable large-scale scene generation. |
3d generation, image-to-3d, text-to-3d, diffusion models, variational autoencoder |
2405.14828
Report |
Good Seed Makes a Good Crop: Discovering Secret Seeds in Text-to-Image Diffusion Models |
Katherine Xu, Lingzhi Zhang, Jianbo Shi |
Recent advances in text-to-image (T2I) diffusion models have facilitated
creative and photorealistic image synthesis. By varying the random seeds, we
can generate various images for a fixed text prompt. Technically, the seed
controls the initial noise and, in multi-step diffusion inference, the noise
used for reparameterization at intermediate timesteps in the reverse diffusion
process. However, the specific impact of the random seed on the generated
images remains relatively unexplored. In this work, we conduct a large-scale
scientific study into the impact of random seeds during diffusion inference.
Remarkably, we reveal that the best 'golden' seed achieved an impressive FID of
21.60, compared to the worst 'inferior' seed's FID of 31.97. Additionally, a
classifier can predict the seed number used to generate an image with over
99.9% accuracy in just a few epochs, establishing that seeds are highly
distinguishable based on generated images. Encouraged by these findings, we
examined the influence of seeds on interpretable visual dimensions. We find
that certain seeds consistently produce grayscale images, prominent sky
regions, or image borders. Seeds also affect image composition, including
object location, size, and depth. Moreover, by leveraging these 'golden' seeds,
we demonstrate improved image generation such as high-fidelity inference and
diversified sampling. Our investigation extends to inpainting tasks, where we
uncover some seeds that tend to insert unwanted text artifacts. Overall, our
extensive analyses highlight the importance of selecting good seeds and offer
practical utility for image generation. |
This paper presents the first large-scale analysis of random seeds in text-to-image diffusion models, revealing their significant impact on image quality, style, and composition. |
Understanding the role of seeds enables the development of simple yet effective techniques to enhance image generation during inference without requiring model retraining or fine-tuning. |
The authors generated over 46 million images using two diffusion models and various text prompts, then analyzed the impact of seeds on image features, style representations, and object compositions. |
Seeds are highly discriminative, with a classifier achieving over 99.9% accuracy in predicting the seed from generated images.
Specific seeds consistently produce stylistic patterns (e.g., grayscale, sky regions) and compositional elements (e.g., object location, size).
Leveraging ‘golden’ seeds significantly improves image quality and human preference scores compared to random sampling. |
The study primarily focuses on 1,024 seeds due to budget constraints, potentially limiting the generalizability of findings.
The research relies on pretrained models trained on large-scale web data, which may contain biases and errors that could influence the results. |
text-to-image synthesis, diffusion models, random seeds, image quality, image composition |
2405.14793
Report |
SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow |
Yihan Wang, Lahav Lipson, Jia Deng |
We introduce SEA-RAFT, a more simple, efficient, and accurate RAFT for
optical flow. Compared with RAFT, SEA-RAFT is trained with a new loss (mixture
of Laplace). It directly regresses an initial flow for faster convergence in
iterative refinements and introduces rigid-motion pre-training to improve
generalization. SEA-RAFT achieves state-of-the-art accuracy on the Spring
benchmark with a 3.69 endpoint-error (EPE) and a 0.36 1-pixel outlier rate
(1px), representing 22.9% and 17.8% error reduction from best published
results. In addition, SEA-RAFT obtains the best cross-dataset generalization on
KITTI and Spring. With its high efficiency, SEA-RAFT operates at least 2.3x
faster than existing methods while maintaining competitive performance. The
code is publicly available at https://github.com/princeton-vl/SEA-RAFT. |
SEA-RAFT, a simpler, more efficient, and accurate variant of RAFT for optical flow estimation. |
Achieves state-of-the-art accuracy and speed, making it useful for real-world high-resolution optical flow. |
Introduces a mixture of Laplace loss to handle ambiguous cases, directly regresses initial flow for faster convergence, employs rigid-flow pre-training on TartanAir for better generalization, and simplifies RAFT's architecture. |
Achieves state-of-the-art accuracy on Spring benchmark, outperforming the next best method by a large margin (18% error reduction on 1px outlier rate and 24% on endpoint error).
Obtains the best cross-dataset generalization on KITTI and Spring.
Operates at least 2.3x faster than existing methods while maintaining competitive performance. |
Zero-shot performance on Sintel's final pass is not as competitive.
Explore the reasons behind the performance gap on Sintel's final pass and investigate further improvements. |
optical flow, raft, deep learning, computer vision, mixture of laplace loss |
2405.14785
Report |
EditWorld: Simulating World Dynamics for Instruction-Following Image Editing |
Ling Yang, Bohan Zeng, Jiaming Liu, Hong Li, Minghao Xu, Wentao Zhang, Shuicheng Yan |
Diffusion models have significantly improved the performance of image
editing. Existing methods realize various approaches to achieve high-quality
image editing, including but not limited to text control, dragging operation,
and mask-and-inpainting. Among these, instruction-based editing stands out for
its convenience and effectiveness in following human instructions across
diverse scenarios. However, it still focuses on simple editing operations like
adding, replacing, or deleting, and falls short of understanding aspects of
world dynamics that convey the realistic dynamic nature in the physical world.
Therefore, this work, EditWorld, introduces a new editing task, namely
world-instructed image editing, which defines and categorizes the instructions
grounded by various world scenarios. We curate a new image editing dataset with
world instructions using a set of large pretrained models (e.g., GPT-3.5,
Video-LLava and SDXL). To enable sufficient simulation of world dynamics for
image editing, our EditWorld trains model in the curated dataset, and improves
instruction-following ability with designed post-edit strategy. Extensive
experiments demonstrate our method significantly outperforms existing editing
methods in this new task. Our dataset and code will be available at
https://github.com/YangLing0818/EditWorld |
This paper proposes "world-instructed image editing," a new image editing task focusing on real-world and virtual-world dynamics beyond simple object manipulation. |
Existing instruction-based image editing methods struggle to simulate realistic physical dynamics, limiting their ability to handle complex editing scenarios grounded in real-world logic. |
The authors curate a new dataset with world instructions, utilizing GPT-3.5, SDXL, ControlNet, Video-LLava, and human re-checking. They finetune an InstructPix2Pix model and propose a "post-edit" strategy to refine results and preserve non-edited areas. |
The proposed method significantly outperforms existing methods in CLIP score and a newly introduced "MLLM score" across various instruction categories.
Qualitative analysis demonstrates the method's ability to handle complex edits grounded in world dynamics, surpassing baselines in visual quality and instruction following.
Ablation study confirms the effectiveness of "post-edit" in preserving non-edited areas while maintaining editing quality. |
The current dataset, while diverse, is limited in size and lacks precise editing examples for complex scenarios.
Accurately evaluating subtle differences in world-instructed edits remains challenging, requiring further research in multimodal difference recognition. |
image editing, diffusion models, world dynamics, instruction following, multimodal learning |
2405.14739
Report |
FLoRA: Low-Rank Core Space for N-dimension |
Chongjie Si, Xuehui Wang, Xue Yang, Zhengqin Xu, Qingyun Li, Jifeng Dai, Yu Qiao, Xiaokang Yang, Wei Shen |
Adapting pre-trained foundation models for various downstream tasks has been
prevalent in artificial intelligence. Due to the vast number of tasks and high
costs, adjusting all parameters becomes unfeasible. To mitigate this, several
fine-tuning techniques have been developed to update the pre-trained model
weights in a more resource-efficient manner, such as through low-rank
adjustments. Yet, almost all of these methods focus on linear weights,
neglecting the intricacies of parameter spaces in higher dimensions like 4D.
Alternatively, some methods can be adapted for high-dimensional parameter space
by compressing changes in the original space into two dimensions and then
employing low-rank matrix decomposition. However, these approaches destructs
the structural integrity of the involved high-dimensional spaces. To tackle the
diversity of dimensional spaces across different foundation models and provide
a more precise representation of the changes within these spaces, this paper
introduces a generalized parameter-efficient fine-tuning framework, FLoRA,
designed for various dimensional parameter space. Specifically, utilizing
Tucker decomposition, FLoRA asserts that changes in each dimensional parameter
space are based on a low-rank core space which maintains the consistent
topological structure with the original space. It then models the changes
through this core space alongside corresponding weights to reconstruct
alterations in the original space. FLoRA effectively preserves the structural
integrity of the change of original N-dimensional parameter space, meanwhile
decomposes it via low-rank tensor decomposition. Extensive experiments on
computer vision, natural language processing and multi-modal tasks validate
FLoRA's effectiveness. Codes are available at
https://github.com/SJTU-DeepVisionLab/FLoRA. |
This paper proposes FLoRA, a novel parameter-efficient fine-tuning framework that utilizes Tucker decomposition to adapt pre-trained foundation models for diverse downstream tasks while preserving the structural integrity of parameter spaces in various dimensions. |
Adapting large pre-trained models to various downstream tasks is computationally expensive. Existing fine-tuning techniques often focus on linear weights and neglect the structural intricacies of higher-dimensional parameter spaces, leading to suboptimal performance. |
FLoRA leverages Tucker decomposition to represent changes in parameter spaces using a low-rank core tensor and corresponding weights, effectively preserving the topological structure of the original parameter space. This approach is applied to both linear and convolutional layers, demonstrating its versatility across different model architectures. |
FLoRA consistently outperforms state-of-the-art parameter-efficient fine-tuning methods, including LoRA and DoRA, across computer vision, natural language processing, and multi-modal tasks.
Empirical analysis reveals that FLoRA's low-rank representation captures task-specific information more effectively than competing methods, leading to improved performance.
FLoRA demonstrates comparable training efficiency to existing methods while achieving superior results, highlighting its practicality for real-world applications. |
The scaling factor in FLoRA currently needs to be tuned for different model backbones.
Future work could explore a unified scaling strategy across diverse architectures. |
parameter-efficient fine-tuning, foundation models, tucker decomposition, low-rank tensor decomposition, structural integrity preservation |
2405.14705
Report |
Learning Multi-dimensional Human Preference for Text-to-Image Generation |
Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang |
Current metrics for text-to-image models typically rely on statistical
metrics which inadequately represent the real preference of humans. Although
recent work attempts to learn these preferences via human annotated images,
they reduce the rich tapestry of human preference to a single overall score.
However, the preference results vary when humans evaluate images with different
aspects. Therefore, to learn the multi-dimensional human preferences, we
propose the Multi-dimensional Preference Score (MPS), the first
multi-dimensional preference scoring model for the evaluation of text-to-image
models. The MPS introduces the preference condition module upon CLIP model to
learn these diverse preferences. It is trained based on our Multi-dimensional
Human Preference (MHP) Dataset, which comprises 918,315 human preference
choices across four dimensions (i.e., aesthetics, semantic alignment, detail
quality and overall assessment) on 607,541 images. The images are generated by
a wide range of latest text-to-image models. The MPS outperforms existing
scoring methods across 3 datasets in 4 dimensions, enabling it a promising
metric for evaluating and improving text-to-image generation. |
This paper introduces the Multi-dimensional Preference Score (MPS), a novel model for evaluating text-to-image models by considering multi-dimensional human preferences. |
Existing evaluation metrics for text-to-image models often rely on statistical measures that do not fully capture the diverse preferences of humans. |
The authors create the Multi-dimensional Human Preference (MHP) dataset, containing images annotated with preferences across aesthetics, detail quality, semantic alignment, and overall assessment. They then develop MPS, which leverages a condition mask to focus on prompt elements relevant to specific preference dimensions when predicting scores. |
MPS outperforms existing scoring methods on three datasets in predicting overall preference.
MPS demonstrates superior performance in evaluating multi-dimensional preferences compared to methods primarily focused on overall scores.
Visualization reveals that MPS attends to different image and prompt regions based on the given preference condition, highlighting its ability to capture diverse preferences. |
The current preference condition setting in MPS relies on predefined word sets, which might not encompass the full spectrum of human preferences.
Future work can explore personalized preference learning, enabling MPS to adapt to individual user preferences. |
text-to-image generation, evaluation metrics, human preferences, multi-dimensional preference score, vision-language models |
2405.14701
Report |
High Fidelity Scene Text Synthesis |
Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin |
Scene text synthesis involves rendering specified texts onto arbitrary
images. Current methods typically formulate this task in an end-to-end manner
but lack effective character-level guidance during training. Besides, their
text encoders, pre-trained on a single font type, struggle to adapt to the
diverse font styles encountered in practical applications. Consequently, these
methods suffer from character distortion, repetition, and absence, particularly
in polystylistic scenarios. To this end, this paper proposes DreamText for
high-fidelity scene text synthesis. Our key idea is to reconstruct the
diffusion training process, introducing more refined guidance tailored to this
task, to expose and rectify the model's attention at the character level and
strengthen its learning of text regions. This transformation poses a hybrid
optimization challenge, involving both discrete and continuous variables. To
effectively tackle this challenge, we employ a heuristic alternate optimization
strategy. Meanwhile, we jointly train the text encoder and generator to
comprehensively learn and utilize the diverse font present in the training
dataset. This joint training is seamlessly integrated into the alternate
optimization process, fostering a synergistic relationship between learning
character embedding and re-estimating character attention. Specifically, in
each step, we first encode potential character-generated position information
from cross-attention maps into latent character masks. These masks are then
utilized to update the representation of specific characters in the current
step, which, in turn, enables the generator to correct the character's
attention in the subsequent steps. Both qualitative and quantitative results
demonstrate the superiority of our method to the state of the art. |
This paper proposes DreamText, a novel diffusion-based model for high-fidelity scene text synthesis that addresses the limitations of existing methods in accurately rendering text within complex scenes. |
Existing methods struggle with character distortion, repetition, and absence due to insufficient character-level guidance during training and a limited ability to adapt to diverse font styles. |
DreamText reconstructs the diffusion training process by introducing refined guidance through latent character masks. It employs a heuristic alternate optimization strategy to address the hybrid optimization problem and jointly trains the text encoder and generator to learn diverse font styles. |
DreamText effectively alleviates character repetition, absence, and distortion issues.
The heuristic alternate optimization strategy fosters a synergistic relationship between learning character representation and re-estimating character attention.
A balanced supervision strategy strikes a balance between constraining the model and allowing flexibility in estimating optimal character positions. |
DreamText currently lacks the capability to modify multiple regions within an image simultaneously.
The generation of realistic text raises privacy concerns, demanding robust safeguards and ethical guidelines. |
scene text synthesis, diffusion models, character attention, font diversity, heuristic optimization |
2405.14677
Report |
RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance |
Zhicheng Sun, Zhenhao Yang, Yang Jin, Haozhe Chi, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Di Zhang, Yang Song, Kun Gai, Yadong Mu |
Customizing diffusion models to generate identity-preserving images from
user-provided reference images is an intriguing new problem. The prevalent
approaches typically require training on extensive domain-specific images to
achieve identity preservation, which lacks flexibility across different use
cases. To address this issue, we exploit classifier guidance, a training-free
technique that steers diffusion models using an existing classifier, for
personalized image generation. Our study shows that based on a recent rectified
flow framework, the major limitation of vanilla classifier guidance in
requiring a special classifier can be resolved with a simple fixed-point
solution, allowing flexible personalization with off-the-shelf image
discriminators. Moreover, its solving procedure proves to be stable when
anchored to a reference flow trajectory, with a convergence guarantee. The
derived method is implemented on rectified flow with different off-the-shelf
image discriminators, delivering advantageous personalization results for human
faces, live subjects, and certain objects. Code is available at
https://github.com/feifeiobama/RectifID. |
The paper proposes a training-free personalized image generation method called anchored classifier guidance, which customizes rectified flow using off-the-shelf image discriminators. |
The method addresses limitations of existing personalized image generation techniques that require extensive domain-specific training data or fine-tuning, enabling greater flexibility and identity consistency. |
The method approximates rectified flow as ideally straight, reformulating classifier guidance as a fixed-point problem solved iteratively. It anchors the flow trajectory to a reference trajectory for improved stability and convergence. |
The training-free method achieves state-of-the-art performance in face-centric personalization benchmarks, surpassing training-based methods in identity preservation.
The approach demonstrates flexibility by effectively personalizing images with various subjects beyond human faces, including animals and regularly shaped objects.
The method successfully extends to multi-subject personalization, composing multiple subjects into an image while maintaining identity and visual quality. |
Theoretical guarantees are limited to ideal rectified flow and may not generalize to complex flow-based models.
The method's effectiveness is currently limited for objects with large structural variations, and its inference time is not yet as fast as some training-based methods. |
personalized image generation, rectified flow, classifier guidance, diffusion models, training-free |
2405.14633
Report |
Flatten Anything: Unsupervised Neural Surface Parameterization |
Qijian Zhang, Junhui Hou, Wenping Wang, Ying He |
Surface parameterization plays an essential role in numerous computer
graphics and geometry processing applications. Traditional parameterization
approaches are designed for high-quality meshes laboriously created by
specialized 3D modelers, thus unable to meet the processing demand for the
current explosion of ordinary 3D data. Moreover, their working mechanisms are
typically restricted to certain simple topologies, thus relying on cumbersome
manual efforts (e.g., surface cutting, part segmentation) for pre-processing.
In this paper, we introduce the Flatten Anything Model (FAM), an unsupervised
neural architecture to achieve global free-boundary surface parameterization
via learning point-wise mappings between 3D points on the target geometric
surface and adaptively-deformed UV coordinates within the 2D parameter domain.
To mimic the actual physical procedures, we ingeniously construct
geometrically-interpretable sub-networks with specific functionalities of
surface cutting, UV deforming, unwrapping, and wrapping, which are assembled
into a bi-directional cycle mapping framework. Compared with previous methods,
our FAM directly operates on discrete surface points without utilizing
connectivity information, thus significantly reducing the strict requirements
for mesh quality and even applicable to unstructured point cloud data. More
importantly, our FAM is fully-automated without the need for pre-cutting and
can deal with highly-complex topologies, since its learning process adaptively
finds reasonable cutting seams and UV boundaries. Extensive experiments
demonstrate the universality, superiority, and inspiring potential of our
proposed neural surface parameterization paradigm. The code will be publicly
available. |
Introduces FAM, an unsupervised neural architecture for global free-boundary surface parameterization, learning point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. |
Addresses limitations of traditional parameterization methods that require high-quality meshes, manual pre-processing, and struggle with complex topologies. |
Utilizes a bi-directional cycle mapping framework with sub-networks mimicking surface cutting, UV deforming, unwrapping, and wrapping, trained by minimizing various loss functions and enforcing differential geometric constraints. |
Outperforms SLIM qualitatively and quantitatively in UV unwrapping and texture mapping on open surface models.
Demonstrates universality and robustness in parameterizing surfaces with varying geometric and topological complexities.
Shows the effectiveness of the bi-directional cycle mapping framework through ablation studies. |
Current per-model overfitting limits generalization ability from existing UV unwrapping data.
Future work includes incorporating advanced properties like shape symmetry, cutting seam visibility, and seamless parameterization. |
surface parameterization, uv unwrapping, deep learning, cycle mapping, unsupervised learning |
2405.14582
Report |
PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Poses |
Yong Zhong, Min Zhao, Zebin You, Xiaofeng Yu, Changwang Zhang, Chongxuan Li |
In this paper, we introduce PoseCrafter, a one-shot method for personalized
video generation following the control of flexible poses. Built upon Stable
Diffusion and ControlNet, we carefully design an inference process to produce
high-quality videos without the corresponding ground-truth frames. First, we
select an appropriate reference frame from the training video and invert it to
initialize all latent variables for generation. Then, we insert the
corresponding training pose into the target pose sequences to enhance
faithfulness through a trained temporal attention module. Furthermore, to
alleviate the face and hand degradation resulting from discrepancies between
poses of training videos and inference poses, we implement simple latent
editing through an affine transformation matrix involving facial and hand
landmarks. Extensive experiments on several datasets demonstrate that
PoseCrafter achieves superior results to baselines pre-trained on a vast
collection of videos under 8 commonly used metrics. Besides, PoseCrafter can
follow poses from different individuals or artificial edits and simultaneously
retain the human identity in an open-domain training video. |
PoseCrafter is a one-shot method for generating personalized videos that follow flexible pose control, requiring only fine-tuning on a single video. |
Existing methods struggle with data requirements, computational costs, and reliance on real video frames corresponding to target poses. This work offers a more efficient and flexible solution for personalized video generation. |
The method uses Stable Diffusion and ControlNet, enhanced by a novel inference process involving reference-frame selection and insertion for faithfulness, and latent editing for refining face and hand details. |
PoseCrafter outperforms baselines pre-trained on large video datasets on 8 common metrics, including MagicAnimate and Disco.
It can effectively follow flexible pose control, including poses from the same or different individuals, and artificially designed poses.
The method exhibits strong performance in preserving identity and details from the training video, even with limited training data. |
Video quality is limited by the capabilities of the underlying ControlNet and diffusion models, particularly with complex poses.
Large differences between training and inference poses can lead to degradation in generated video quality. Further research on constructing pseudo reference videos is needed. |
personalized video generation, pose guidance, one-shot learning, diffusion models, latent editing |
2405.14580
Report |
LDM: Large Tensorial SDF Model for Textured Mesh Generation |
Rengan Xie, Wenting Zheng, Kai Huang, Yizheng Chen, Qi Wang, Qi Ye, Wei Chen, Yuchi Huo |
Previous efforts have managed to generate production-ready 3D assets from
text or images. However, these methods primarily employ NeRF or 3D Gaussian
representations, which are not adept at producing smooth, high-quality
geometries required by modern rendering pipelines. In this paper, we propose
LDM, a novel feed-forward framework capable of generating high-fidelity,
illumination-decoupled textured mesh from a single image or text prompts. We
firstly utilize a multi-view diffusion model to generate sparse multi-view
inputs from single images or text prompts, and then a transformer-based model
is trained to predict a tensorial SDF field from these sparse multi-view image
inputs. Finally, we employ a gradient-based mesh optimization layer to refine
this model, enabling it to produce an SDF field from which high-quality
textured meshes can be extracted. Extensive experiments demonstrate that our
method can generate diverse, high-quality 3D mesh assets with corresponding
decomposed RGB textures within seconds. |
LDM, a novel feed-forward framework that generates high-fidelity, illumination-decoupled textured mesh from a single image or text prompts within seconds. |
Existing methods for generating 3D assets from text or images often produce low-quality geometries or lack illumination-decoupled textures, limiting their use in applications requiring high-quality assets. |
A multi-view diffusion model generates sparse multi-view images, a transformer-based model predicts a tensorial SDF field, and a gradient-based mesh optimization layer refines the SDF field for high-quality mesh extraction. |
LDM generates high-quality 3D mesh assets with decomposed RGB textures in seconds.
Tensorial SDF representation enhances object surface quality and convergence speed.
Two-stage training strategy (volume rendering followed by gradient-based mesh optimization) improves geometric details and texture clarity. |
Limited resolution of tensorial SDF tokens constrains final 3D asset resolution.
Illumination decoupling module not designed for complex materials like translucent surfaces. |
3d generation, text-to-3d, image-to-3d, tensorial sdf, illumination decoupling |
2405.14554
Report |
UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge |
Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang |
Large vision-language models (LVLMs) are ignorant of the up-to-date
knowledge, such as LLaVA series, because they cannot be updated frequently due
to the large amount of resources required, and therefore fail in many cases.
For example, if a LVLM was released on January 2024, and it wouldn't know the
detailed plot of the new movie Dune 2, which wasn't released until February
2024. To solve the problem, a promising solution is to provide LVLMs with
up-to-date knowledge via internet search during inference, i.e.,
internet-augmented generation (IAG), which is already integrated in some
closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics
underpinning them remain a mystery. In this paper, we propose a plug-and-play
framework, for augmenting existing LVLMs in handling visual question answering
(VQA) about up-to-date knowledge, dubbed UDKAG. A hierarchical filtering model
is trained to effectively and efficiently find the most helpful content from
the websites returned by a search engine to prompt LVLMs with up-to-date
knowledge. To train the model and evaluate our framework's performance, we
propose a pipeline to automatically generate news-related VQA samples to
construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is
introduced to label the usefulness of website/content for VQA samples to
construct the training set. Experimental results demonstrate the effectiveness
of our framework, outperforming GPT-4V by about 25% in accuracy. |
This paper introduces UDKAG, an open-source framework to augment Large Vision-Language Models (LVLMs) with up-to-date knowledge for Visual Question Answering (VQA) tasks. |
Existing LVLMs are often outdated due to infrequent updates, limiting their ability to answer questions about recent events or information. UDKAG aims to address this limitation by integrating internet search into the VQA process. |
UDKAG employs a hierarchical filtering model: 1) A website filter scores and filters websites based on titles and snippets, 2) A content filter selects helpful content segments from the filtered websites. This content is then used to prompt LVLMs for more accurate answers. |
UDKAG significantly improves the accuracy of various LVLMs on the UDK-VQA dataset, specifically designed for evaluating VQA performance on up-to-date knowledge.
The hierarchical filtering model effectively identifies and extracts relevant information from websites, outperforming simpler IAG methods.
Diversity selection within the framework ensures that LVLMs receive varied content, preventing bias and improving answer accuracy. |
The hierarchical filtering model is trained separately from the LVLMs, potentially limiting performance. Future work could explore end-to-end training for better integration.
The current implementation focuses on VQA tasks. Expanding the framework to other vision-language tasks would further enhance its applicability. |
vision-language models, visual question answering, internet-augmented generation, up-to-date knowledge, hierarchical filtering |
2405.14480
Report |
Scalable Visual State Space Model with Fractal Scanning |
Lv Tang, HaoKe Xiao, Peng-Tao Jiang, Hao Zhang, Jinwei Chen, Bo Li |
Foundational models have significantly advanced in natural language
processing (NLP) and computer vision (CV), with the Transformer architecture
becoming a standard backbone. However, the Transformer's quadratic complexity
poses challenges for handling longer sequences and higher resolution images. To
address this challenge, State Space Models (SSMs) like Mamba have emerged as
efficient alternatives, initially matching Transformer performance in NLP tasks
and later surpassing Vision Transformers (ViTs) in various CV tasks. To improve
the performance of SSMs, one crucial aspect is effective serialization of image
patches. Existing methods, relying on linear scanning curves, often fail to
capture complex spatial relationships and produce repetitive patterns, leading
to biases. To address these limitations, we propose using fractal scanning
curves for patch serialization. Fractal curves maintain high spatial proximity
and adapt to different image resolutions, avoiding redundancy and enhancing
SSMs' ability to model complex patterns accurately. We validate our method in
image classification, detection, and segmentation tasks, and the superior
performance validates its effectiveness. |
This paper introduces a novel approach to enhance State Space Models (SSMs) for image processing by employing fractal scanning curves for image patch serialization, which surpasses the limitations of traditional linear scanning methods. |
Effective serialization of image patches is crucial for SSMs in computer vision, as it directly impacts their ability to capture and model intricate spatial relationships within images. Existing linear scanning methods often fail to adequately preserve these relationships. |
The study leverages the Hilbert curve, a type of fractal curve, for its inherent ability to maintain spatial and structural consistency across varying scales. A novel shifting operation is further implemented to refine the fractal curve, enhancing local adjacency and continuity during pixel serialization. |
FractalMamba, the proposed model, outperforms several benchmark models, including those based on CNNs and ViTs, in image classification, object detection, and semantic segmentation tasks.
FractalMamba exhibits superior scalability and efficiency when processing images of increasing resolutions, maintaining consistent performance with a near-linear increase in computational complexity.
The implementation of a shifting operation on the fractal curves further improves the model's performance by mitigating the loss of local proximity information. |
The generalizability of fractal scanning mechanisms across diverse visual data and tasks requires further investigation, as their performance may vary depending on dataset characteristics.
Future research can explore additional fractal scanning methods and their combinations to potentially uncover even more effective serialization strategies for enhancing SSM performance. |
state space models, fractal scanning curves, image serialization, computer vision, hilbert curve |
2405.14475
Report |
MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes |
Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, Qiang Xu |
While controllable generative models for images and videos have achieved
remarkable success, high-quality models for 3D scenes, particularly in
unbounded scenarios like autonomous driving, remain underdeveloped due to high
data acquisition costs. In this paper, we introduce MagicDrive3D, a novel
pipeline for controllable 3D street scene generation that supports
multi-condition control, including BEV maps, 3D objects, and text descriptions.
Unlike previous methods that reconstruct before training the generative models,
MagicDrive3D first trains a video generation model and then reconstructs from
the generated data. This innovative approach enables easily controllable
generation and static scene acquisition, resulting in high-quality scene
reconstruction. To address the minor errors in generated content, we propose
deformable Gaussian splatting with monocular depth initialization and
appearance modeling to manage exposure discrepancies across viewpoints.
Validated on the nuScenes dataset, MagicDrive3D generates diverse, high-quality
3D driving scenes that support any-view rendering and enhance downstream tasks
like BEV segmentation. Our results demonstrate the framework's superior
performance, showcasing its transformative potential for autonomous driving
simulation and beyond. |
\methodname is a novel pipeline for controllable 3D street scene generation that supports multi-condition control, including BEV maps, 3D objects, and text descriptions. |
High-quality controllable generative models for 3D scenes, particularly in unbounded scenarios like autonomous driving, remain underdeveloped due to high data acquisition costs. |
\methodname first trains a video generation model with enhanced inter-frame consistency using relative pose embedding. Then, it reconstructs the 3D scene using an enhanced deformable Gaussian Splatting technique, accounting for local dynamics and exposure discrepancies in the generated views. |
\methodname generates realistic 3D street scenes with multi-dimensional controllability, as demonstrated by qualitative results and FID scores.
The framework enhances the quality of video and scene generation, surpassing baseline methods like MagicDrive and 3DGS in metrics like FVD and FID.
Generated scenes from \methodname can be used to improve the viewpoint robustness of perception tasks like BEV segmentation. |
The model may struggle to generate complex objects or scenes with high texture detail due to limitations in the reconstruction method.
Future work could focus on addressing these limitations and improving the quality and robustness of generated scenes. |
3d scene generation, autonomous driving, controllable generation, gaussian splatting, view synthesis |
2405.14458
Report |
YOLOv10: Real-Time End-to-End Object Detection |
Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, Guiguang Ding |
Over the past years, YOLOs have emerged as the predominant paradigm in the
field of real-time object detection owing to their effective balance between
computational cost and detection performance. Researchers have explored the
architectural designs, optimization objectives, data augmentation strategies,
and others for YOLOs, achieving notable progress. However, the reliance on the
non-maximum suppression (NMS) for post-processing hampers the end-to-end
deployment of YOLOs and adversely impacts the inference latency. Besides, the
design of various components in YOLOs lacks the comprehensive and thorough
inspection, resulting in noticeable computational redundancy and limiting the
model's capability. It renders the suboptimal efficiency, along with
considerable potential for performance improvements. In this work, we aim to
further advance the performance-efficiency boundary of YOLOs from both the
post-processing and model architecture. To this end, we first present the
consistent dual assignments for NMS-free training of YOLOs, which brings
competitive performance and low inference latency simultaneously. Moreover, we
introduce the holistic efficiency-accuracy driven model design strategy for
YOLOs. We comprehensively optimize various components of YOLOs from both
efficiency and accuracy perspectives, which greatly reduces the computational
overhead and enhances the capability. The outcome of our effort is a new
generation of YOLO series for real-time end-to-end object detection, dubbed
YOLOv10. Extensive experiments show that YOLOv10 achieves state-of-the-art
performance and efficiency across various model scales. For example, our
YOLOv10-S is 1.8$\times$ faster than RT-DETR-R18 under the similar AP on COCO,
meanwhile enjoying 2.8$\times$ smaller number of parameters and FLOPs. Compared
with YOLOv9-C, YOLOv10-B has 46\% less latency and 25\% fewer parameters for
the same performance. |
This paper presents YOLOv10, a new generation of YOLO series for real-time end-to-end object detection, achieving state-of-the-art performance and efficiency. |
Existing YOLO models suffer from suboptimal efficiency and accuracy due to the reliance on non-maximum suppression (NMS) for post-processing and computational redundancy in model architecture. |
The authors propose consistent dual assignments for NMS-free training and a holistic efficiency-accuracy driven model design strategy. This includes a lightweight classification head, spatial-channel decoupled downsampling, rank-guided block design, large-kernel convolution, and a partial self-attention module. |
YOLOv10 significantly outperforms previous state-of-the-art models in terms of computation-accuracy trade-offs.
YOLOv10-S is 1.8x faster than RT-DETR-R18 under similar AP on COCO.
Compared to YOLOv9-C, YOLOv10-B has 46% less latency and 25% fewer parameters for the same performance. |
The paper does not investigate the pretraining of YOLOv10 on large-scale datasets.
There is still a performance gap between NMS-free training and the original one-to-many training using NMS, especially in small models. |
object detection, yolo, real-time, end-to-end, nms-free |
2405.14455
Report |
TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing |
Teng Xu, Jiamin Chen, Peng Chen, Youjia Zhang, Junqing Yu, Wei Yang |
Editing objects within a scene is a critical functionality required across a
broad spectrum of applications in computer vision and graphics. As 3D Gaussian
Splatting (3DGS) emerges as a frontier in scene representation, the effective
modification of 3D Gaussian scenes has become increasingly vital. This process
entails accurately retrieve the target objects and subsequently performing
modifications based on instructions. Though available in pieces, existing
techniques mainly embed sparse semantics into Gaussians for retrieval, and rely
on an iterative dataset update paradigm for editing, leading to over-smoothing
or inconsistency issues. To this end, this paper proposes a systematic
approach, namely TIGER, for coherent text-instructed 3D Gaussian retrieval and
editing. In contrast to the top-down language grounding approach for 3D
Gaussians, we adopt a bottom-up language aggregation strategy to generate a
denser language embedded 3D Gaussians that supports open-vocabulary retrieval.
To overcome the over-smoothing and inconsistency issues in editing, we propose
a Coherent Score Distillation (CSD) that aggregates a 2D image editing
diffusion model and a multi-view diffusion model for score distillation,
producing multi-view consistent editing with much finer details. In various
experiments, we demonstrate that our TIGER is able to accomplish more
consistent and realistic edits than prior work. |
TIGER, a novel framework for text-instructed retrieval and editing of 3D Gaussian scenes. |
Editing objects in 3D scenes is crucial for various applications, and 3D Gaussian Splatting (3DGS) is gaining prominence as a scene representation method. Existing editing methods for 3D Gaussian scenes have limitations such as over-smoothing and inconsistency. |
TIGER uses a bottom-up language aggregation strategy with MaskCLIP and FeatUp for open-vocabulary 3D Gaussian retrieval. For editing, it proposes Coherent Score Distillation (CSD) that integrates SDS losses from an image editing diffusion model (InstructPix2Pix) and a multi-view diffusion model (MVDream). |
TIGER enables accurate open-vocabulary retrieval of 3D Gaussian objects.
CSD facilitates multi-view consistent editing of 3D Gaussian scenes.
TIGER produces more realistic and detailed edits compared to prior art. |
The language understanding is limited by the “bag-of-words” problem inherited from MaskCLIP.
The editing process depends on pre-trained 2D diffusion models and can take up to 30 minutes for complex edits. |
3d gaussian splatting, 3d scene editing, text-guided editing, score distillation sampling, multi-view consistency |
2405.14452
Report |
JointRF: End-to-End Joint Optimization for Dynamic Neural Radiance Field Representation and Compression |
Zihan Zheng, Houqiang Zhong, Qiang Hu, Xiaoyun Zhang, Li Song, Ya Zhang, Yanfeng Wang |
Neural Radiance Field (NeRF) excels in photo-realistically static scenes,
inspiring numerous efforts to facilitate volumetric videos. However, rendering
dynamic and long-sequence radiance fields remains challenging due to the
significant data required to represent volumetric videos. In this paper, we
propose a novel end-to-end joint optimization scheme of dynamic NeRF
representation and compression, called JointRF, thus achieving significantly
improved quality and compression efficiency against the previous methods.
Specifically, JointRF employs a compact residual feature grid and a coefficient
feature grid to represent the dynamic NeRF. This representation handles large
motions without compromising quality while concurrently diminishing temporal
redundancy. We also introduce a sequential feature compression subnetwork to
further reduce spatial-temporal redundancy. Finally, the representation and
compression subnetworks are end-to-end trained combined within the JointRF.
Extensive experiments demonstrate that JointRF can achieve superior compression
performance across various datasets. |
Presents JointRF, a novel end-to-end learning scheme for jointly optimizing dynamic NeRF representation and compression, achieving better quality and higher compression efficiency for volumetric videos. |
Rendering dynamic and long-sequence radiance fields is challenging due to significant data requirements. JointRF addresses this by efficiently representing and compressing dynamic NeRF, improving quality and efficiency. |
JointRF uses a compact residual feature grid and a coefficient feature grid to represent dynamic NeRF, minimizing temporal redundancy. It introduces a sequential feature compression subnetwork and employs end-to-end training with simulated quantization and entropy model-based bitrate estimation. |
JointRF achieves superior compression performance across various datasets, as demonstrated by quantitative comparisons.
It outperforms state-of-the-art methods in rate-distortion performance, with significant BDBR reductions compared to ReRF.
Ablation studies confirm the effectiveness of the dynamic residual representation, compression module, and joint optimization strategy. |
The current implementation primarily focuses on optimizing storage size and might benefit from exploring techniques like keyframe selection and adaptive group size to further enhance streaming efficiency.
Investigating the application of JointRF in related domains like immersive video streaming and 3D telepresence presents promising avenues for future work. |
volumetric videos, dynamic nerf, compression, end-to-end joint optimization, rate-distortion performance |
2405.14430
Report |
PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models |
Jiannan Wang, Jiarui Fang, Aoyu Li, PengCheng Yang |
This paper introduces PipeFusion, a novel approach that harnesses multi-GPU
parallelism to address the high computational and latency challenges of
generating high-resolution images with diffusion transformers (DiT) models.
PipeFusion splits images into patches and distributes the network layers across
multiple devices. It employs a pipeline parallel manner to orchestrate
communication and computations. By leveraging the high similarity between the
input from adjacent diffusion steps, PipeFusion eliminates the waiting time in
the pipeline by reusing the one-step stale feature maps to provide context for
the current step. Our experiments demonstrate that it can generate higher image
resolution where existing DiT parallel approaches meet OOM. PipeFusion
significantly reduces the required communication bandwidth, enabling DiT
inference to be hosted on GPUs connected via PCIe rather than the more costly
NVLink infrastructure, which substantially lowers the overall operational
expenses for serving DiT models. Our code is publicly available at
https://github.com/PipeFusion/PipeFusion. |
This paper proposes PipeFusion, a novel pipelined parallel approach for Diffusion Transformers (DiT) inference that reduces communication bandwidth and memory demands by leveraging input similarities across diffusion steps. |
High-resolution image and long video generation with DiT models face high computational and latency challenges, demanding multi-GPU parallelism. Existing approaches are limited by high communication costs and memory requirements, often necessitating costly NVLink interconnects. |
PipeFusion splits images into patches, distributes DiT layers across multiple devices, and orchestrates computation and communication in a pipeline. It reuses one-step stale feature maps to provide context for the current step, eliminating pipeline waiting time. |
PipeFusion significantly reduces communication bandwidth, enabling DiT inference on PCIe-connected GPUs, thus reducing operational costs.
Experiments show PipeFusion achieves similar or better latency compared to other parallel approaches, especially on higher resolutions.
PipeFusion maintains high image quality, comparable to the original serial implementation. |
The effectiveness of PipeFusion diminishes on systems with high bandwidth interconnects like NVLink where communication cost is less of a bottleneck.
Uneven partitioning of DiT layers across devices can lead to additional overhead, requiring further optimization. |
diffusion models, parallel computing, image generation, diffusion transformers, pipeline parallelism |
2405.14343
Report |
Efficient Visual State Space Model for Image Deblurring |
Lingshun Kong, Jiangxin Dong, Ming-Hsuan Yang, Jinshan Pan |
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have
achieved excellent performance in image restoration. ViTs typically yield
superior results in image restoration compared to CNNs due to their ability to
capture long-range dependencies and input-dependent characteristics. However,
the computational complexity of Transformer-based models grows quadratically
with the image resolution, limiting their practical appeal in high-resolution
image restoration tasks. In this paper, we propose a simple yet effective
visual state space model (EVSSM) for image deblurring, leveraging the benefits
of state space models (SSMs) to visual data. In contrast to existing methods
that employ several fixed-direction scanning for feature extraction, which
significantly increases the computational cost, we develop an efficient visual
scan block that applies various geometric transformations before each SSM-based
module, capturing useful non-local information and maintaining high efficiency.
Extensive experimental results show that the proposed EVSSM performs favorably
against state-of-the-art image deblurring methods on benchmark datasets and
real-captured images. |
This paper introduces EVSSM, an efficient visual state space model for image deblurring, featuring an efficient visual scan block that employs geometric transformations to enhance non-local information exploration with minimal computational overhead. |
Existing image deblurring methods, especially Transformer-based ones, often struggle to balance computational efficiency with capturing long-range dependencies crucial for high-quality restoration, especially at high resolutions. |
The EVSSM employs a hierarchical encoder-decoder framework with efficient visual scan blocks. These blocks utilize geometric transformations (flip, transpose) before each scan, allowing for diverse contextual information capture from different directions without the computational burden of multi-directional scanning. |
EVSSM outperforms state-of-the-art image deblurring methods on benchmark datasets (GoPro, HIDE, RealBlur) achieving higher PSNR and SSIM values.
The proposed efficient visual scan block effectively captures non-local information, leading to better restoration of image structures and details compared to methods with limited global context modeling.
EVSSM demonstrates lower computational complexity and faster runtime than competing methods while maintaining superior deblurring performance. |
The current implementation explores limited geometric transformations (flip, transpose).
Future work will investigate more sophisticated transformations like polar coordinate transformations to further enhance spatial information characterization using SSMs. |
image deblurring, state space models, visual scan block, geometric transformations, deep learning |
2405.14338
Report |
MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models |
Jiuming Liu, Jinru Han, Lihao Liu, Angelica I. Aviles-Rivero, Chaokang Jiang, Zhe Liu, Hesheng Wang |
Point cloud videos effectively capture real-world spatial geometries and
temporal dynamics, which are essential for enabling intelligent agents to
understand the dynamically changing 3D world we live in. Although static 3D
point cloud processing has witnessed significant advancements, designing an
effective 4D point cloud video backbone remains challenging, mainly due to the
irregular and unordered distribution of points and temporal inconsistencies
across frames. Moreover, recent state-of-the-art 4D backbones predominantly
rely on transformer-based architectures, which commonly suffer from large
computational costs due to their quadratic complexity, particularly when
processing long video sequences. To address these challenges, we propose a
novel 4D point cloud video understanding backbone based on the recently
advanced State Space Models (SSMs). Specifically, our backbone begins by
disentangling space and time in raw 4D sequences, and then establishing
spatio-temporal correlations using our newly developed Intra-frame Spatial
Mamba and Inter-frame Temporal Mamba blocks. The Intra-frame Spatial Mamba
module is designed to encode locally similar or related geometric structures
within a certain temporal searching stride, which can effectively capture
short-term dynamics. Subsequently, these locally correlated tokens are
delivered to the Inter-frame Temporal Mamba module, which globally integrates
point features across the entire video with linear complexity, further
establishing long-range motion dependencies. Experimental results on human
action recognition and 4D semantic segmentation tasks demonstrate the
superiority of our proposed method. Especially, for long video sequences, our
proposed Mamba-based method has an 87.5% GPU memory reduction, 5.36 times
speed-up, and much higher accuracy (up to +10.4%) compared with
transformer-based counterparts on MSR-Action3D dataset. |
This paper introduces MAMBA4D, a novel 4D point cloud video understanding backbone based entirely on state space models (SSMs), addressing the limitations of traditional CNN and transformer-based methods in terms of efficiency and long-range dependency modeling. |
Developing effective learning backbones for dynamic 4D point cloud sequences is crucial for enabling intelligent agents to understand the dynamically changing 3D world, impacting applications like robotics, AR/VR, and SLAM systems. Existing methods struggle with efficiency due to the irregular nature of point clouds and limitations in capturing long-range temporal dependencies. |
MAMBA4D disentangles space and time in 4D sequences and employs two novel modules: Intra-frame Spatial Mamba and Inter-frame Temporal Mamba. The former captures short-term local structures within temporally grouped point tubes, while the latter integrates features globally across the entire video with linear complexity to establish long-range dependencies. The authors also investigate different spatio-temporal scanning strategies within the Inter-frame Temporal Mamba. |
MAMBA4D outperforms CNN- and transformer-based methods in 3D action recognition on the MSR-Action3D dataset, achieving higher accuracy and exhibiting superior efficiency (87.5% GPU memory reduction, 5.36x faster) compared to the transformer baseline.
MAMBA4D demonstrates competitive performance in 4D semantic segmentation on the Synthia 4D dataset, surpassing existing methods on most sub-sequences and achieving a 0.19 mIoU improvement over the baseline.
Ablation studies validate the contribution of individual components in MAMBA4D, including the spatial and temporal modeling modules, the number of blocks, and the choice of spatio-temporal scanning strategies. |
While excelling in long video processing, MAMBA4D's accuracy for short video inputs is slightly lower compared to transformer-based methods.
Future work will explore the application of MAMBA4D to other 4D tasks such as point-based object tracking, 4D point cloud prediction, and multi-frame scene flow. |
4d point cloud video understanding, state space models, spatio-temporal modeling, action recognition, semantic segmentation |
2405.14294
Report |
Tuning-free Universally-Supervised Semantic Segmentation |
Xiaobo Yang, Xiaojin Gong |
This work presents a tuning-free semantic segmentation framework based on
classifying SAM masks by CLIP, which is universally applicable to various types
of supervision. Initially, we utilize CLIP's zero-shot classification ability
to generate pseudo-labels or perform open-vocabulary segmentation. However, the
misalignment between mask and CLIP text embeddings leads to suboptimal results.
To address this issue, we propose discrimination-bias aligned CLIP to closely
align mask and text embedding, offering an overhead-free performance gain. We
then construct a global-local consistent classifier to classify SAM masks,
which reveals the intrinsic structure of high-quality embeddings produced by
DBA-CLIP and demonstrates robustness against noisy pseudo-labels. Extensive
experiments validate the efficiency and effectiveness of our method, and we
achieve state-of-the-art (SOTA) or competitive performance across various
datasets and supervision types. |
This paper introduces a novel tuning-free semantic segmentation framework that classifies Segment Anything Model (SAM) masks using Contrastive Language-Image Pretraining (CLIP) for diverse supervision levels (fully, semi, weakly, open-vocabulary). |
This approach aims to leverage the zero-shot capabilities of CLIP and the efficiency of tuning-free methods to achieve accurate and adaptable semantic segmentation across different levels of supervision. |
The framework utilizes a discrimination-bias aligned CLIP (DBA-CLIP) to generate text-aligned mask embeddings and a global-local consistent classifier (GLCC) for robust classification, particularly under sparse supervision. |
The method achieves state-of-the-art or competitive results on PASCAL VOC 2012, COCO-Obj, MS COCO 2014, COCO-Stuff, and Cityscapes datasets.
DBA-CLIP significantly improves CLIP's zero-shot classification accuracy by aligning text and mask embeddings.
GLCC effectively mitigates noise in pseudo-labels, leading to more accurate segmentation, especially in weakly and semi-supervised settings. |
The method's performance is limited by the capabilities of the underlying foundational models (SAM and CLIP).
The high inference cost of foundational models may hinder deployment in resource-constrained environments. |
semantic segmentation, tuning-free, clip, sam, weakly supervised learning |
2405.14276
Report |
D-MiSo: Editing Dynamic 3D Scenes using Multi-Gaussians Soup |
Joanna Waczyńska, Piotr Borycki, Joanna Kaleta, Sławomir Tadeja, Przemysław Spurek |
Over the past years, we have observed an abundance of approaches for modeling
dynamic 3D scenes using Gaussian Splatting (GS). Such solutions use GS to
represent the scene's structure and the neural network to model dynamics. Such
approaches allow fast rendering and extracting each element of such a dynamic
scene. However, modifying such objects over time is challenging. SC-GS (Sparse
Controlled Gaussian Splatting) enhanced with Deformed Control Points partially
solves this issue. However, this approach necessitates selecting elements that
need to be kept fixed, as well as centroids that should be adjusted throughout
editing. Moreover, this task poses additional difficulties regarding the
re-productivity of such editing. To address this, we propose Dynamic
Multi-Gaussian Soup (D-MiSo), which allows us to model the mesh-inspired
representation of dynamic GS. Additionally, we propose a strategy of linking
parameterized Gaussian splats, forming a Triangle Soup with the estimated mesh.
Consequently, we can separately construct new trajectories for the 3D objects
composing the scene. Thus, we can make the scene's dynamic editable over time
or while maintaining partial dynamics. |
This paper introduces D-MiSo, a novel mesh-inspired Gaussian Splatting method for modeling and editing dynamic 3D scenes. |
Existing methods struggle with efficiently editing dynamic 3D scenes represented by Gaussian Splats. D-MiSo allows for intuitive and scalable object modifications over time. |
D-MiSo utilizes Multi-Gaussian components: larger Core-Gaussians (for global transformations) encompassing smaller Sub-Gaussians (for rendering). These are parameterized by two Triangle Soups, enabling mesh-like control. Two deformation networks handle object movement and detailed changes over time. |
D-MiSo achieves comparable or superior reconstruction quality to state-of-the-art methods on D-NeRF, NeRF-DS, and PanopticSports datasets.
The method enables three editing approaches: modifying the estimated mesh, directly editing the Sub-Triangle Soup, and transforming the object's space.
D-MiSo allows for intuitive object manipulation, including moving, scaling, rotating, duplicating, removing, and applying dynamic effects. |
Editing areas poorly represented in the training set remains challenging due to limitations of Triangle Soup.
Future work could explore more sophisticated meshing strategies or alternative representations for detailed editing. |
gaussian splatting, dynamic 3d scenes, scene editing, mesh-based representation, multi-gaussian components |
2405.14241
Report |
NeuroGauss4D-PCI: 4D Neural Fields and Gaussian Deformation Fields for Point Cloud Interpolation |
Chaokang Jiang, Dalong Du, Jiuming Liu, Siting Zhu, Zhenqiang Liu, Zhuang Ma, Zhujin Liang, Jie Zhou |
Point Cloud Interpolation confronts challenges from point sparsity, complex
spatiotemporal dynamics, and the difficulty of deriving complete 3D point
clouds from sparse temporal information. This paper presents NeuroGauss4D-PCI,
which excels at modeling complex non-rigid deformations across varied dynamic
scenes. The method begins with an iterative Gaussian cloud soft clustering
module, offering structured temporal point cloud representations. The proposed
temporal radial basis function Gaussian residual utilizes Gaussian parameter
interpolation over time, enabling smooth parameter transitions and capturing
temporal residuals of Gaussian distributions. Additionally, a 4D Gaussian
deformation field tracks the evolution of these parameters, creating continuous
spatiotemporal deformation fields. A 4D neural field transforms low-dimensional
spatiotemporal coordinates ($x,y,z,t$) into a high-dimensional latent space.
Finally, we adaptively and efficiently fuse the latent features from neural
fields and the geometric features from Gaussian deformation fields.
NeuroGauss4D-PCI outperforms existing methods in point cloud frame
interpolation, delivering leading performance on both object-level (DHB) and
large-scale autonomous driving datasets (NL-Drive), with scalability to
auto-labeling and point cloud densification tasks. The source code is released
at https://github.com/jiangchaokang/NeuroGauss4D-PCI. |
This paper presents NeuroGauss4D-PCI, a novel 4D spatio-temporal modeling method for point cloud frame interpolation that excels at modeling complex non-rigid deformations across varied dynamic scenes by adaptively fusing a 4D neural field and a 4D Gaussian deformation field. |
Point cloud frame interpolation (PCI) is crucial for various applications, including autonomous driving and virtual reality, but faces challenges due to the inherent sparsity of point cloud data, the complexity of modeling spatiotemporal dynamics, and the difficulty of generalizing from sparse temporal samples. |
NeuroGauss4D-PCI represents point clouds through iterative Gaussian soft clustering and a 4D neural field. A temporal radial basis function Gaussian residual module captures temporal dynamics of Gaussian parameters, while a 4D Gaussian deformation field models their spatiotemporal variations. Finally, a fast latent-geometric fusion module combines features from the 4D neural field and the 4D Gaussian deformation field. |
NeuroGauss4D-PCI achieves state-of-the-art performance on both object-level (DHB) and large-scale autonomous driving datasets (NL-Drive).
The method effectively handles challenges like non-rigid deformations, large-scale motions, occlusions, and non-uniform data distributions.
NeuroGauss4D-PCI demonstrates scalability to tasks like LiDAR-camera temporal synchronization, point cloud densification, and 4D automatic annotation. |
The model's interpretability is limited due to the integration of various features and the use of deep neural networks.
The runtime optimization process, similar to NeRF, is computationally demanding, accounting for nearly 90% of the total processing time. |
point cloud interpolation, 4d spatio-temporal modeling, gaussian deformation field, neural field, autonomous driving |
2405.14224
Report |
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis |
Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu |
Diffusion models have achieved great success in image generation, with the
backbone evolving from U-Net to Vision Transformers. However, the computational
cost of Transformers is quadratic to the number of tokens, leading to
significant challenges when dealing with high-resolution images. In this work,
we propose Diffusion Mamba (DiM), which combines the efficiency of Mamba, a
sequence model based on State Space Models (SSM), with the expressive power of
diffusion models for efficient high-resolution image synthesis. To address the
challenge that Mamba cannot generalize to 2D signals, we make several
architecture designs including multi-directional scans, learnable padding
tokens at the end of each row and column, and lightweight local feature
enhancement. Our DiM architecture achieves inference-time efficiency for
high-resolution images. In addition, to further improve training efficiency for
high-resolution image generation with DiM, we investigate ``weak-to-strong''
training strategy that pretrains DiM on low-resolution images ($256\times 256$)
and then finetune it on high-resolution images ($512 \times 512$). We further
explore training-free upsampling strategies to enable the model to generate
higher-resolution images (e.g., $1024\times 1024$ and $1536\times 1536$)
without further fine-tuning. Experiments demonstrate the effectiveness and
efficiency of our DiM. |
Proposes Diffusion Mamba (DiM), a Mamba-based diffusion model for efficient high-resolution image synthesis, by combining the efficiency of Mamba with the expressive power of diffusion models. |
Addresses the computational challenges of Transformer-based diffusion models in high-resolution image generation due to their quadratic complexity. |
Introduces architectural designs like multi-directional scans, learnable padding tokens, and lightweight local feature enhancement to adapt Mamba for 2D image data. Employs a 'weak-to-strong' training strategy, pretraining on low-resolution images and fine-tuning on high-resolution images. |
DiM achieves inference-time efficiency for high-resolution images, outperforming Transformers at resolutions above 1280x1280.
Pretraining on low-resolution images and fine-tuning on high-resolution images significantly reduces training time and computational cost.
Training-free upsampling techniques enable DiM to generate even higher resolution images (e.g., 1024x1024, 1536x1536) without further fine-tuning. |
DiM, while faster at very high resolutions, is slightly less efficient than Transformers at resolutions below 1024x1024.
The model still faces challenges in generating images with complex details, particularly for human subjects and in avoiding repeating patterns during upsampling.
Future work could focus on optimizing DiM's efficiency at lower resolutions and improving its ability to handle complex details. |
image generation, diffusion models, state space models, mamba, high-resolution |
2405.14206
Report |
LG-VQ: Language-Guided Codebook Learning |
Guotao Liang, Baoquan Zhang, Yaowei Wang, Xutao Li, Yunming Ye, Huaibin Wang, Chuyao Luo, Kola Ye, linfeng Luo |
Vector quantization (VQ) is a key technique in high-resolution and
high-fidelity image synthesis, which aims to learn a codebook to encode an
image with a sequence of discrete codes and then generate an image in an
auto-regression manner. Although existing methods have shown superior
performance, most methods prefer to learn a single-modal codebook (\emph{e.g.},
image), resulting in suboptimal performance when the codebook is applied to
multi-modal downstream tasks (\emph{e.g.}, text-to-image, image captioning) due
to the existence of modal gaps. In this paper, we propose a novel
language-guided codebook learning framework, called LG-VQ, which aims to learn
a codebook that can be aligned with the text to improve the performance of
multi-modal downstream tasks. Specifically, we first introduce pre-trained text
semantics as prior knowledge, then design two novel alignment modules
(\emph{i.e.}, Semantic Alignment Module, and Relationship Alignment Module) to
transfer such prior knowledge into codes for achieving codebook text alignment.
In particular, our LG-VQ method is model-agnostic, which can be easily
integrated into existing VQ models. Experimental results show that our method
achieves superior performance on reconstruction and various multi-modal
downstream tasks. |
This paper proposes LG-VQ, a language-guided codebook learning framework for VQ models that aligns codebooks with text semantics to enhance performance in multi-modal downstream tasks. |
Existing VQ methods primarily focus on single-modal codebooks, leading to suboptimal performance in multi-modal tasks due to modal gaps and a lack of high-level semantics. |
LG-VQ leverages pre-trained text semantics from CLIP and introduces two novel alignment modules: a Semantic Alignment Module (global semantic alignment and masked text prediction) and a Relationship Alignment Module (transfers semantic relationships between words to codes). |
LG-VQ outperforms baseline VQ models in image reconstruction across multiple datasets, as evidenced by FID and PSNR scores.
The method demonstrates strong performance in multi-modal downstream tasks like text-to-image synthesis, image captioning, and VQA, highlighting the effectiveness of the text-aligned codebook.
Ablation studies confirm the individual contributions of the semantic and relationship alignment modules to the improved performance. |
The current approach assumes a one-to-one mapping between words and codes, potentially overlooking more complex relationships.
While LG-VQ significantly enhances VQ performance in visual text reasoning, there is still a performance gap compared to dedicated image captioning or VQA models. |
vector quantization, multi-modal learning, codebook learning, vision-language representation learning, image generation |
2405.14201
Report |
FreeTuner: Any Subject in Any Style with Training-free Diffusion |
Youcan Xu, Zhen Wang, Jun Xiao, Wei Liu, Long Chen |
With the advance of diffusion models, various personalized image generation
methods have been proposed. However, almost all existing work only focuses on
either subject-driven or style-driven personalization. Meanwhile,
state-of-the-art methods face several challenges in realizing compositional
personalization, i.e., composing different subject and style concepts, such as
concept disentanglement, unified reconstruction paradigm, and insufficient
training data. To address these issues, we introduce FreeTuner, a flexible and
training-free method for compositional personalization that can generate any
user-provided subject in any user-provided style (see Figure 1). Our approach
employs a disentanglement strategy that separates the generation process into
two stages to effectively mitigate concept entanglement. FreeTuner leverages
the intermediate features within the diffusion model for subject concept
representation and introduces style guidance to align the synthesized images
with the style concept, ensuring the preservation of both the subject's
structure and the style's aesthetic features. Extensive experiments have
demonstrated the generation ability of FreeTuner across various personalization
settings. |
FreeTuner, a training-free method for compositional personalization in image generation, enabling the synthesis of user-provided subjects in user-provided styles using diffusion models. |
Addresses limitations of existing personalization methods that focus on either subject-driven or style-driven generation, failing to effectively compose both aspects. |
Employs a two-stage disentanglement strategy: 1) Content generation stage leverages intermediate features from diffusion models for subject representation. 2) Style generation stage introduces style guidance based on pre-trained encoders (e.g., VGG-19) to align the output with desired style aesthetics. |
Achieves high-quality compositional personalization, preserving both subject structure and style aesthetics.
Outperforms existing methods like B-LoRA in composing subjects and styles, showing superior visual fidelity.
Demonstrates generalizability by integrating seamlessly with other diffusion-based methods like ControlNet and BoxDiff. |
Reliance on null-text inversion for accurate feature extraction increases computational time compared to standard inversion methods.
Limited style transfer capability restricted to the employed visual encoder. |
image generation, compositional personalization, diffusion models, style transfer, training-free methods |
2405.14174
Report |
Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model |
Yuheng Shi, Minjing Dong, Chang Xu |
Despite the significant achievements of Vision Transformers (ViTs) in various
vision tasks, they are constrained by the quadratic complexity. Recently, State
Space Models (SSMs) have garnered widespread attention due to their global
receptive field and linear complexity with respect to the input length,
demonstrating substantial potential across fields including natural language
processing and computer vision. To improve the performance of SSMs in vision
tasks, a multi-scan strategy is widely adopted, which leads to significant
redundancy of SSMs. For a better trade-off between efficiency and performance,
we analyze the underlying reasons behind the success of the multi-scan
strategy, where long-range dependency plays an important role. Based on the
analysis, we introduce Multi-Scale Vision Mamba (MSVMamba) to preserve the
superiority of SSMs in vision tasks with limited parameters. It employs a
multi-scale 2D scanning technique on both original and downsampled feature
maps, which not only benefits long-range dependency learning but also reduces
computational costs. Additionally, we integrate a Convolutional Feed-Forward
Network (ConvFFN) to address the lack of channel mixing. Our experiments
demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model
achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance
mAP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.6% mIoU
with single-scale testing on ADE20K.Code is available at
\url{https://github.com/YuHengsss/MSVMamba}. |
This paper proposes Multi-Scale Vision Mamba (MSVMamba), an efficient and scalable State Space Model (SSM) for vision tasks, addressing the quadratic complexity issue of Vision Transformers (ViTs) and redundancy in multi-scan SSMs. |
This work is important because it improves the efficiency and performance of SSMs in vision tasks by addressing the long-range dependency limitations of existing methods. |
The paper introduces a Multi-Scale 2D (MS2D) scanning strategy and incorporates a Convolutional Feed-Forward Network (ConvFFN) to enhance channel mixing within the MSVMamba architecture. |
MSVMamba achieves competitive results on ImageNet-1K, outperforming similar-sized models while using fewer computational resources.
In object detection and instance segmentation tasks on COCO, MSVMamba surpasses Swin Transformer and other SSM-based models in terms of accuracy.
For semantic segmentation on ADE20K, MSVMamba demonstrates superior performance compared to competing models like Swin, ConvNeXt, and VMamba. |
The scalability of the multi-scale design needs further exploration, especially for larger model sizes where the improvement might be marginal.
Future work could investigate the application of MSVMamba in other vision tasks beyond classification, detection, and segmentation. |
state space models, vision transformers, computer vision, multi-scale modeling, long-range dependencies |
2405.14129
Report |
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability |
Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai |
Multimodal Large Language Models (MLLMs) are widely regarded as crucial in
the exploration of Artificial General Intelligence (AGI). The core of MLLMs
lies in their capability to achieve cross-modal alignment. To attain this goal,
current MLLMs typically follow a two-phase training paradigm: the pre-training
phase and the instruction-tuning phase. Despite their success, there are
shortcomings in the modeling of alignment capabilities within these models.
Firstly, during the pre-training phase, the model usually assumes that all
image-text pairs are uniformly aligned, but in fact the degree of alignment
between different image-text pairs is inconsistent. Secondly, the instructions
currently used for finetuning incorporate a variety of tasks, different tasks's
instructions usually require different levels of alignment capabilities, but
previous MLLMs overlook these differentiated alignment needs. To tackle these
issues, we propose a new multimodal large language model AlignGPT. In the
pre-training stage, instead of treating all image-text pairs equally, we assign
different levels of alignment capabilities to different image-text pairs. Then,
in the instruction-tuning phase, we adaptively combine these different levels
of alignment capabilities to meet the dynamic alignment needs of different
instructions. Extensive experimental results show that our model achieves
competitive performance on 12 benchmarks. |
This paper presents AlignGPT, a new multimodal large language model that enhances the alignment capabilities of existing MLLMs by considering varying degrees of alignment in image-text pairs. |
Existing MLLMs often assume uniform alignment between image-text pairs during pre-training and overlook the different alignment requirements of various tasks during instruction-tuning. |
AlignGPT introduces controllable alignment levels during pre-training based on CLIP scores and adaptively combines global and local alignment capabilities during instruction-tuning according to the specific needs of each instruction. |
AlignGPT achieves competitive performance on 12 benchmarks, outperforming several state-of-the-art MLLMs.
Higher image resolutions lead to improved model performance in most multimodal tasks.
The choice of large language model significantly impacts AlignGPT's performance, with larger models and those fine-tuned on instructional data generally performing better. |
AlignGPT might not excel in text-centric scenarios that demand a strong focus on text understanding.
Future work can explore the impact of different visual backbones and more sophisticated gate network architectures. |
multimodal large language model, cross-modal alignment, visual question answering, instruction tuning, clip score |
2405.14119
Report |
PuTR: A Pure Transformer for Decoupled and Online Multi-Object Tracking |
Chongwei Liu, Haojie Li, Zhihui Wang, Rui Xu |
Recent advances in Multi-Object Tracking (MOT) have achieved remarkable
success in short-term association within the decoupled tracking-by-detection
online paradigm. However, long-term tracking still remains a challenging task.
Although graph-based approaches can address this issue by modeling trajectories
as a graph in the decoupled manner, their non-online nature poses obstacles for
real-time applications. In this paper, we demonstrate that the trajectory graph
is a directed acyclic graph, which can be represented by an object sequence
arranged by frame and a binary adjacency matrix. It is a coincidence that the
binary matrix matches the attention mask in the Transformer, and the object
sequence serves exactly as a natural input sequence. Intuitively, we propose
that a pure Transformer can naturally unify short- and long-term associations
in a decoupled and online manner. Our experiments show that a classic
Transformer architecture naturally suits the association problem and achieves a
strong baseline compared to existing foundational methods across four datasets:
DanceTrack, SportsMOT, MOT17, and MOT20, as well as superior generalizability
in domain shift. Moreover, the decoupled property also enables efficient
training and inference. This work pioneers a promising Transformer-based
approach for the MOT task, and provides code to facilitate further research.
https://github.com/chongweiliu/PuTR |
This paper proposes PuTR, a pure Transformer architecture for online multi-object tracking (MOT) association, unifying short- and long-term association in a decoupled manner. |
Existing online MOT methods struggle with long-term tracking, while offline graph-based methods lack real-time applicability. This work explores the potential of Transformers to address both short- and long-term association in a unified and efficient framework. |
The authors leverage the natural alignment between the trajectory graph and the Transformer's attention mechanism. Objects are tokenized and fed into a Transformer with a modified attention mask and positional encodings to handle temporal and spatial relationships. A relative affinity matrix is used for association, eliminating the need for a fixed ID dictionary. |
PuTR achieves strong baseline performance compared to existing foundational methods on DanceTrack, SportsMOT, MOT17, and MOT20 datasets.
It exhibits superior generalization ability in domain shift scenarios, maintaining consistent performance across datasets without fine-tuning.
The decoupled nature allows for efficient training (under 1 hour on a single GPU) and inference (up to 90 FPS). |
The context length of the Transformer limits the model's ability to handle long sequences, requiring further exploration of long context modeling.
The current model primarily relies on appearance cues, and incorporating motion information could enhance performance, particularly for small objects. |
multi-object tracking, transformer, association, online tracking, long-term tracking |
2405.14101
Report |
Enhancing Image Layout Control with Loss-Guided Diffusion Models |
Zakaria Patel, Kirill Serkh |
Diffusion models are a powerful class of generative models capable of
producing high-quality images from pure noise. In particular, conditional
diffusion models allow one to specify the contents of the desired image using a
simple text prompt. Conditioning on a text prompt alone, however, does not
allow for fine-grained control over the composition and layout of the final
image, which instead depends closely on the initial noise distribution. While
most methods which introduce spatial constraints (e.g., bounding boxes) require
fine-tuning, a smaller and more recent subset of these methods are
training-free. They are applicable whenever the prompt influences the model
through an attention mechanism, and generally fall into one of two categories.
The first entails modifying the cross-attention maps of specific tokens
directly to enhance the signal in certain regions of the image. The second
works by defining a loss function over the cross-attention maps, and using the
gradient of this loss to guide the latent. While previous work explores these
as alternative strategies, we provide an interpretation for these methods which
highlights their complimentary features, and demonstrate that it is possible to
obtain superior performance when both methods are used in concert. |
Presents injection loss guidance (iLGD), a training-free method for layout control in text-to-image generation using diffusion models. |
Existing methods for controlling image layout often degrade image quality or require expensive fine-tuning. iLGD addresses these limitations by combining the strengths of attention injection and loss guidance. |
iLGD biases the latent representation of the image towards a desired layout using attention injection and refines it further with a loss function applied to the attention maps. |
iLGD generates images that adhere more closely to the prescribed layout compared to using injection alone.
iLGD maintains better image quality than methods relying solely on loss guidance (e.g., BoxDiff).
iLGD achieves higher scores on perceptual quality metrics (CLIP-IQA) while maintaining comparable layout accuracy (YOLOv4) and text-image similarity (T2I-Sim) to other methods. |
The method's sensitivity to the initial random seed requires further investigation.
Exploring alternative loss functions and injection strategies could further enhance layout control and image quality. |
diffusion models, layout control, text-to-image generation, attention injection, loss guidance |
2405.14024
Report |
Two Heads are Better Than One: Neural Networks Quantization with 2D Hilbert Curve-based Output Representation |
Mykhailo Uss, Ruslan Yermolenko, Olena Kolodiazhna, Oleksii Shashko, Ivan Safonov, Volodymyr Savin, Yoonjae Yeo, Seowon Ji, Jaeyun Jeong |
Quantization is widely used to increase deep neural networks' (DNN) memory,
computation, and power efficiency. Various techniques, such as post-training
quantization and quantization-aware training, have been proposed to improve
quantization quality. We introduce a novel approach for DNN quantization that
uses a redundant representation of DNN's output. We represent the target
quantity as a point on a 2D parametric curve. The DNN model is modified to
predict 2D points that are mapped back to the target quantity at a
post-processing stage. We demonstrate that this mapping can reduce quantization
error. For the low-order parametric Hilbert curve, Depth-From-Stereo task, and
two models represented by U-Net architecture and vision transformer, we
achieved a quantization error reduction by about 5 times for the INT8 model at
both CPU and DSP delegates. This gain comes with a minimal inference time
increase (less than 7%). Our approach can be applied to other tasks, including
segmentation, object detection, and key-points prediction. |
This paper introduces a novel DNN quantization method that reduces quantization error by representing the output as a point on a 2D low-order Hilbert curve, exploiting the redundancy in this representation. |
Quantization is crucial for deploying DNNs on devices with limited resources, but it often leads to quality degradation. This method offers a way to mitigate this degradation and improve the accuracy of quantized models. |
The authors modify the DNN architecture to predict points on a Hilbert curve instead of a scalar output. They introduce a new loss function to guide the training process and utilize lookup tables for efficient mapping between 1D and 2D representations. |
The proposed method reduces quantization error by a factor of ≈5 for INT8 models on both CPU and DSP, achieving near-FP32 accuracy for the Depth-From-Stereo task.
The Hilbert curve representation effectively increases the bit-width of the output, improving the representation of spatial details in the quantized model.
The method incurs a minimal increase in inference time (<7%) without noticeable impact on power consumption. |
The approach is currently limited to models predicting bounded quantities and may not correct large quantization errors (outliers).
Further research is needed to explore its application to other tasks, quantization techniques, and higher-dimensional representations. |
quantization-aware training, space-filling curve, hilbert curve, depth-from-stereo, snapdragon neural processing engine |
2405.13956
Report |
Attention as an RNN |
Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Mohamed Osama Ahmed, Yoshua Bengio, Greg Mori |
The advent of Transformers marked a significant breakthrough in sequence
modelling, providing a highly performant architecture capable of leveraging GPU
parallelism. However, Transformers are computationally expensive at inference
time, limiting their applications, particularly in low-resource settings (e.g.,
mobile and embedded devices). Addressing this, we (1) begin by showing that
attention can be viewed as a special Recurrent Neural Network (RNN) with the
ability to compute its \textit{many-to-one} RNN output efficiently. We then (2)
show that popular attention-based models such as Transformers can be viewed as
RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models
cannot be updated efficiently with new tokens, an important property in
sequence modelling. Tackling this, we (3) introduce a new efficient method of
computing attention's \textit{many-to-many} RNN output based on the parallel
prefix scan algorithm. Building on the new attention formulation, we (4)
introduce \textbf{Aaren}, an attention-based module that can not only (i) be
trained in parallel (like Transformers) but also (ii) be updated efficiently
with new tokens, requiring only constant memory for inferences (like
traditional RNNs). Empirically, we show Aarens achieve comparable performance
to Transformers on $38$ datasets spread across four popular sequential problem
settings: reinforcement learning, event forecasting, time series
classification, and time series forecasting tasks while being more time and
memory-efficient. |
This paper introduces Aaren, an attention-based module for sequence modeling that achieves comparable performance to Transformers while being more time and memory efficient. |
Transformers, while powerful, are computationally expensive at inference time, limiting their use in low-resource settings like mobile devices. Aaren addresses this limitation. |
The paper first presents attention as a special type of Recurrent Neural Network (RNN) and then introduces a new method to efficiently compute attention's RNN output based on the parallel prefix scan algorithm. Aaren builds upon this formulation. |
Aarens achieve comparable performance to Transformers across 38 datasets in four problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting.
Aarens demonstrate significant improvements in time and memory efficiency compared to Transformers, using constant memory regardless of the number of tokens processed.
Aarens can efficiently update with new tokens at inference time, making them particularly well-suited for streaming data scenarios common in sequence modeling. |
Aarens use input-independent attention queries, potentially limiting their expressiveness in large sequence models compared to Transformers.
Future work could explore applying Aarens to more complex sequence modeling tasks, such as natural language processing, to further investigate their capabilities and limitations. |
attention mechanism, sequence modeling, recurrent neural networks, parallel prefix scan, efficient inference |
2405.13951
Report |
Text Prompting for Multi-Concept Video Customization by Autoregressive Generation |
Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh |
We present a method for multi-concept customization of pretrained
text-to-video (T2V) models. Intuitively, the multi-concept customized video can
be derived from the (non-linear) intersection of the video manifolds of the
individual concepts, which is not straightforward to find. We hypothesize that
sequential and controlled walking towards the intersection of the video
manifolds, directed by text prompting, leads to the solution. To do so, we
generate the various concepts and their corresponding interactions,
sequentially, in an autoregressive manner. Our method can generate videos of
multiple custom concepts (subjects, action and background) such as a teddy bear
running towards a brown teapot, a dog playing violin and a teddy bear swimming
in the ocean. We quantitatively evaluate our method using videoCLIP and DINO
scores, in addition to human evaluation. Videos for results presented in this
paper can be found at https://github.com/divyakraman/MultiConceptVideo2024. |
This paper introduces a novel approach for multi-concept customization of pretrained text-to-video (T2V) models using autoregressive generation, allowing users to generate videos featuring multiple customized concepts and their interactions. |
Existing T2V models struggle with generating long videos featuring consistent subjects and their interactions, especially when dealing with multiple customized concepts. This work addresses this limitation by enabling more control and flexibility in video generation. |
The method involves finetuning a pretrained T2V model with adapter layers on input images/videos representing customized concepts. Then, it leverages the autoregressive nature of the model to sequentially generate video frames, introducing and controlling the appearance and interactions of multiple customized concepts over time. |
The approach effectively generates customized videos with multiple interacting subjects, demonstrating significant improvements over baseline methods.
Quantitative evaluations using videoCLIP and DINO scores, along with human evaluation, showcase the effectiveness in generating customized concepts and their interactions.
The method also proves useful for single-concept customization when compositionality is desired, offering a promising direction for future research. |
Extending the method beyond three concepts and achieving finer control over interactions remain challenging.
Improving the quality of generated videos relies heavily on advancements in video foundation models and superresolution techniques. |
text-to-video generation, multi-concept customization, autoregressive generation, video personalization, generative ai |
2405.13943
Report |
DoGaussian: Distributed-Oriented Gaussian Splatting for Large-Scale 3D Reconstruction Via Gaussian Consensus |
Yu Chen, Gim Hee Lee |
The recent advances in 3D Gaussian Splatting (3DGS) show promising results on
the novel view synthesis (NVS) task. With its superior rendering performance
and high-fidelity rendering quality, 3DGS is excelling at its previous NeRF
counterparts. The most recent 3DGS method focuses either on improving the
instability of rendering efficiency or reducing the model size. On the other
hand, the training efficiency of 3DGS on large-scale scenes has not gained much
attention. In this work, we propose DoGaussian, a method that trains 3DGS
distributedly. Our method first decomposes a scene into K blocks and then
introduces the Alternating Direction Method of Multipliers (ADMM) into the
training procedure of 3DGS. During training, our DoGaussian maintains one
global 3DGS model on the master node and K local 3DGS models on the slave
nodes. The K local 3DGS models are dropped after training and we only query the
global 3DGS model during inference. The training time is reduced by scene
decomposition, and the training convergence and stability are guaranteed
through the consensus on the shared 3D Gaussians. Our method accelerates the
training of 3DGS by 6+ times when evaluated on large-scale scenes while
concurrently achieving state-of-the-art rendering quality. Our project page is
available at https://aibluefisher.github.io/DoGaussian. |
This paper introduces DoGaussian, a distributed training approach for 3D Gaussian Splatting (3DGS) aimed at improving the efficiency of large-scale scene reconstruction. |
Training 3DGS on large scenes poses challenges due to high GPU memory requirements and long training times. DoGaussian addresses these issues by enabling efficient distributed training. |
DoGaussian decomposes the scene into blocks, assigns training data to each block, and uses the Alternating Direction Method of Multipliers (ADMM) to ensure consistency among shared 3D Gaussians during training. |
DoGaussian accelerates the training of 3DGS by 6+ times compared to the original 3DGS on large-scale scenes.
The method maintains high-fidelity rendering quality, achieving state-of-the-art results in novel view synthesis.
Ablation studies demonstrate the effectiveness of individual components, such as 3D Gaussian consensus and adaptive penalty parameters. |
The current implementation relies on an RPC module for communication, which might limit flexibility compared to decentralized approaches.
Future work could explore incorporating level-of-detail (LOD) techniques to further reduce GPU memory consumption during training. |
3d gaussian splatting, large-scale 3d reconstruction, distributed training, novel view synthesis, admm |
2405.13870
Report |
FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition |
Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, Chunhua Shen |
Benefiting from large-scale pre-trained text-to-image (T2I) generative
models, impressive progress has been achieved in customized image generation,
which aims to generate user-specified concepts. Existing approaches have
extensively focused on single-concept customization and still encounter
challenges when it comes to complex scenarios that involve combining multiple
concepts. These approaches often require retraining/fine-tuning using a few
images, leading to time-consuming training processes and impeding their swift
implementation. Furthermore, the reliance on multiple images to represent a
singular concept increases the difficulty of customization. To this end, we
propose FreeCustom, a novel tuning-free method to generate customized images of
multi-concept composition based on reference concepts, using only one image per
concept as input. Specifically, we introduce a new multi-reference
self-attention (MRSA) mechanism and a weighted mask strategy that enables the
generated image to access and focus more on the reference concepts. In
addition, MRSA leverages our key finding that input concepts are better
preserved when providing images with context interactions. Experiments show
that our method's produced images are consistent with the given concepts and
better aligned with the input text. Our method outperforms or performs on par
with other training-based methods in terms of multi-concept composition and
single-concept customization, but is simpler. Codes can be found at
https://github.com/aim-uofa/FreeCustom. |
This paper presents FreeCustom, a novel tuning-free method for generating customized images with multi-concept composition, using only one image per concept as input. |
Existing customization methods struggle with multi-concept scenarios, often requiring time-consuming retraining and exhibiting poor identity preservation. FreeCustom addresses these limitations by enabling fast, high-quality generation without any training. |
The method employs a dual-path architecture with a multi-reference self-attention (MRSA) mechanism and a weighted mask strategy. This enables the generated image to effectively integrate and focus on features from multiple input reference concepts. |
FreeCustom achieves comparable results to state-of-the-art methods in single-concept customization and shows significant advantages in multi-concept composition.
The method demonstrates high fidelity in preserving reference concept identities and strong alignment with input text prompts.
FreeCustom is significantly faster than training-based methods, achieving high-quality results in seconds without any preprocessing. |
The method currently lacks an explicit module for perceiving the structure of input reference concepts.
Future work will explore incorporating techniques like image adapters to address this limitation. |
image generation, customization, text-to-image, diffusion models, multi-concept composition |
2405.13865
Report |
ReVideo: Remake a Video with Motion and Content Control |
Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, Jian Zhang |
Despite significant advancements in video generation and editing using
diffusion models, achieving accurate and localized video editing remains a
substantial challenge. Additionally, most existing video editing methods
primarily focus on altering visual content, with limited research dedicated to
motion editing. In this paper, we present a novel attempt to Remake a Video
(ReVideo) which stands out from existing methods by allowing precise video
editing in specific areas through the specification of both content and motion.
Content editing is facilitated by modifying the first frame, while the
trajectory-based motion control offers an intuitive user interaction
experience. ReVideo addresses a new task involving the coupling and training
imbalance between content and motion control. To tackle this, we develop a
three-stage training strategy that progressively decouples these two aspects
from coarse to fine. Furthermore, we propose a spatiotemporal adaptive fusion
module to integrate content and motion control across various sampling steps
and spatial locations. Extensive experiments demonstrate that our ReVideo has
promising performance on several accurate video editing applications, i.e., (1)
locally changing video content while keeping the motion constant, (2) keeping
content unchanged and customizing new motion trajectories, (3) modifying both
content and motion trajectories. Our method can also seamlessly extend these
applications to multi-area editing without specific training, demonstrating its
flexibility and robustness. |
This paper introduces ReVideo, a novel method for accurate and localized content and motion editing in videos. |
Existing video editing techniques struggle with precise local control, especially for motion, limiting their ability for realistic and creative edits. |
ReVideo utilizes a three-stage training strategy to decouple content and motion control, along with a spatiotemporal adaptive fusion module for harmonious integration within a diffusion model framework. |
ReVideo enables localized content changes while preserving motion or introducing custom trajectories.
It surpasses existing methods in user-specified editing accuracy, as demonstrated by both visual and quantitative comparisons.
The method shows robustness to irregular editing regions and multi-area editing tasks without specific training. |
The quality of regenerated video segments depends on the base model's capabilities, which can lead to artifacts.
Future work includes extending ReVideo to handle longer videos and address error accumulation over time. |
video editing, diffusion models, motion editing, local editing, spatiotemporal fusion |
2405.13800
Report |
Dense Connector for MLLMs |
Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang |
Do we fully leverage the potential of visual encoder in Multimodal Large
Language Models (MLLMs)? The recent outstanding performance of MLLMs in
multimodal understanding has garnered broad attention from both academia and
industry. In the current MLLM rat race, the focus seems to be predominantly on
the linguistic side. We witness the rise of larger and higher-quality
instruction datasets, as well as the involvement of larger-sized LLMs. Yet,
scant attention has been directed towards the visual signals utilized by MLLMs,
often assumed to be the final high-level features extracted by a frozen visual
encoder. In this paper, we introduce the Dense Connector - a simple, effective,
and plug-and-play vision-language connector that significantly enhances
existing MLLMs by leveraging multi-layer visual features, with minimal
additional computational overhead. Furthermore, our model, trained solely on
images, showcases remarkable zero-shot capabilities in video understanding as
well. Experimental results across various vision encoders, image resolutions,
training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse
architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility
and scalability of our approach, achieving state-of-the-art performance on
across 19 image and video benchmarks. We hope that this work will provide
valuable experience and serve as a basic module for future MLLM development. |
This paper introduces the Dense Connector, a plug-and-play module enhancing visual perception in Multimodal Large Language Models (MLLMs) by densely integrating multi-layer visual features. |
Current MLLM research focuses heavily on the language side, neglecting the potential of visual encoders. This work aims to leverage the overlooked "free lunch" of offline multi-layer features for enhanced visual representation without significant computational overhead. |
The Dense Connector leverages pre-trained vision encoders and LLMs, connected by a learnable MLP. It offers three instantiations for multi-layer feature integration: Sparse Token Integration (STI), Sparse Channel Integration (SCI), and Dense Channel Integration (DCI). The paper conducts experiments across various vision encoders, LLM sizes, datasets, and image resolutions. |
Dense Connector significantly improves MLLM performance across 11 image and 8 video benchmarks, achieving state-of-the-art results on several.
The approach demonstrates versatility and scalability across different visual encoders, LLM sizes (2B→70B), and training datasets.
Densely integrated multi-layer features prove more effective than solely using the final-layer features. |
The current Dense Connector instantiations do not introduce additional learnable parameters, potentially limiting its effectiveness.
Future work will explore more complex and effective Dense Connector implementations and investigate efficient visual-language model connection methods for better modality alignment. |
multimodal large language models, vision-language models, dense connector, multi-layer visual features, visual understanding |
2405.13748
Report |
Monocular Gaussian SLAM with Language Extended Loop Closure |
Tian Lan, Qinwei Lin, Haoqian Wang |
Recently,3DGaussianSplattinghasshowngreatpotentialin visual Simultaneous
Localization And Mapping (SLAM). Existing methods have achieved encouraging
results on RGB-D SLAM, but studies of the monocular case are still scarce.
Moreover, they also fail to correct drift errors due to the lack of loop
closure and global optimization. In this paper, we present MG-SLAM, a monocular
Gaussian SLAM with a language-extended loop closure module capable of
performing drift-corrected tracking and high-fidelity reconstruction while
achieving a high-level understanding of the environment. Our key idea is to
represent the global map as 3D Gaussian and use it to guide the estimation of
the scene geometry, thus mitigating the efforts of missing depth information.
Further, an additional language-extended loop closure module which is based on
CLIP feature is designed to continually perform global optimization to correct
drift errors accumulated as the system runs. Our system shows promising results
on multiple challenging datasets in both tracking and mapping and even
surpasses some existing RGB-D methods. |
This paper presents MG-SLAM, a novel monocular Gaussian SLAM system that leverages 3D Gaussian representations for high-fidelity scene reconstruction and incorporates a language-extended loop closure module for drift-corrected tracking. |
This work addresses the limitations of existing monocular SLAM systems in achieving both accurate tracking and photo-realistic reconstruction, particularly over long sequences where drift errors accumulate. The integration of language understanding further expands the system's potential applications. |
The system builds upon DPVO, a deep-learning-based visual odometry. It initializes and optimizes 3D Gaussians using predicted patch depths and employs a sliding window strategy for training. A CLIP feature-based loop closure module detects loops and enables global optimization on a back-end graph, correcting drift errors. |
MG-SLAM achieves competitive tracking accuracy on Replica, ScanNet, TUM RGB-D, and EuRoC datasets, outperforming some existing RGB-D methods.
The system demonstrates high-fidelity rendering quality, surpassing previous NeRF-based SLAM approaches.
The language-extended loop closure module enables text-to-trajectory querying, highlighting its potential for high-level scene understanding. |
The performance of the loop closure module might degrade in highly cluttered indoor environments.
Future work could explore the integration of semantic information into the mapping process for enhanced scene understanding and navigation. |
slam, 3d gaussian splatting, scene reconstruction, loop closure, clip |
2405.13729
Report |
ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models |
Rui Xu, Jiepeng Wang, Hao Pan, Yang Liu, Xin Tong, Shiqing Xin, Changhe Tu, Taku Komura, Wenping Wang |
In this paper, we study an under-explored but important factor of diffusion
generative models, i.e., the combinatorial complexity. Data samples are
generally high-dimensional, and for various structured generation tasks, there
are additional attributes which are combined to associate with data samples. We
show that the space spanned by the combination of dimensions and attributes is
insufficiently sampled by existing training scheme of diffusion generative
models, causing degraded test time performance. We present a simple fix to this
problem by constructing stochastic processes that fully exploit the
combinatorial structures, hence the name ComboStoc. Using this simple strategy,
we show that network training is significantly accelerated across diverse data
modalities, including images and 3D structured shapes. Moreover, ComboStoc
enables a new way of test time generation which uses insynchronized time steps
for different dimensions and attributes, thus allowing for varying degrees of
control over them. |
This paper presents ComboStochastic (
ame), a novel approach to enhance diffusion generative models by explicitly considering the combinatorial complexity arising from dimensions and attributes of data samples. |
Existing diffusion models lack sufficient training in regions of the path space where dimensions/attributes have asynchronous schedules, leading to poor performance when sampling these regions during inference. |
The authors introduce asynchronous time steps for different dimensions and attributes during training, enabling the network to explore a wider range of data representations and learn correlations more effectively. |
ame consistently improves FID scores compared to baseline SiT models for image generation on ImageNet.
ame proves crucial for generating structured 3D shapes, significantly outperforming baseline methods and producing meaningful results.
Asynchronous time steps enable novel applications such as controlled image generation with varying degrees of preservation and structured 3D shape completion/assembly. |
Quantifying the severity of the undersampling problem in standard diffusion models is left for future work.
Exploring better batch time step scheduling for image generation training is an area for future improvement. |
diffusion generative models, combinatorial complexity, asynchronous time steps, image generation, structured 3d shape generation |
2405.13722
Report |
InstaDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos |
Yujun Shi, Jun Hao Liew, Hanshu Yan, Vincent Y. F. Tan, Jiashi Feng |
Accuracy and speed are critical in image editing tasks. Pan et al. introduced
a drag-based image editing framework that achieves pixel-level control using
Generative Adversarial Networks (GANs). A flurry of subsequent studies enhanced
this framework's generality by leveraging large-scale diffusion models.
However, these methods often suffer from inordinately long processing times
(exceeding 1 minute per edit) and low success rates. Addressing these issues
head on, we present InstaDrag, a rapid approach enabling high quality
drag-based image editing in ~1 second. Unlike most previous methods, we
redefine drag-based editing as a conditional generation task, eliminating the
need for time-consuming latent optimization or gradient-based guidance during
inference. In addition, the design of our pipeline allows us to train our model
on large-scale paired video frames, which contain rich motion information such
as object translations, changing poses and orientations, zooming in and out,
etc. By learning from videos, our approach can significantly outperform
previous methods in terms of accuracy and consistency. Despite being trained
solely on videos, our model generalizes well to perform local shape
deformations not presented in the training data (e.g., lengthening of hair,
twisting rainbows, etc.). Extensive qualitative and quantitative evaluations on
benchmark datasets corroborate the superiority of our approach. The code and
model will be released at https://github.com/magic-research/InstaDrag. |
InstaDrag, a fast and high-quality drag-based image editing approach that achieves results in under one second. |
Existing drag-based image editing methods suffer from slow processing times and low success rates, limiting practical use. |
Reframes drag-based editing as a conditional generation task using a reference-only architecture and point embedding network trained on large-scale video data. |
Achieves state-of-the-art accuracy in point following and appearance preservation as measured by Mean Distance and Image Fidelity metrics.
Significantly faster than previous methods, achieving editing speeds of under one second, even faster with acceleration techniques.
Generalizes well to out-of-domain editing instructions not explicitly present in the training data, such as local deformations. |
Inherits limitations of Stable Diffusion V1.5, such as difficulties with details in complex features.
Future work could explore using larger diffusion models like SDXL for improved detail. |
image editing, drag-based editing, diffusion models, video data, conditional generation |
2405.13685
Report |
Prompt Mixing in Diffusion Models using the Black Scholes Algorithm |
Divya Kothandaraman, Ming Lin, Dinesh Manocha |
We introduce a novel approach for prompt mixing, aiming to generate images at
the intersection of multiple text prompts using pre-trained text-to-image
diffusion models. At each time step during diffusion denoising, our algorithm
forecasts predictions w.r.t. the generated image and makes informed text
conditioning decisions. To do so, we leverage the connection between diffusion
models (rooted in non-equilibrium thermodynamics) and the Black-Scholes model
for pricing options in Finance, and draw analogies between the variables in
both contexts to derive an appropriate algorithm for prompt mixing using the
Black Scholes model. Specifically, the parallels between diffusion models and
the Black-Scholes model enable us to leverage properties related to the
dynamics of the Markovian model derived in the Black-Scholes algorithm. Our
prompt-mixing algorithm is data-efficient, meaning it does not need additional
training. Furthermore, it operates without human intervention or hyperparameter
tuning. We highlight the benefits of our approach by comparing it qualitatively
and quantitatively to other prompt mixing techniques, including linear
interpolation, alternating prompts, step-wise prompt switching, and CLIP-guided
prompt selection across various scenarios such as single object per text
prompt, multiple objects per text prompt and objects against backgrounds. Code
is available at https://github.com/divyakraman/BlackScholesDiffusion2024. |
This paper introduces a novel prompt mixing technique for text-to-image diffusion models, inspired by the Black-Scholes model from finance, which dynamically conditions on the most relevant text prompt during each denoising step to generate images reflecting multiple input concepts. |
Prompt mixing is important for generating images that blend different textual concepts, going beyond simple combinations. Existing techniques often require manual effort, lack dynamic prompt prioritization, or overlook diffusion dynamics. |
The method draws an analogy between diffusion models and the Black-Scholes model, treating image generation as "asset acquisition." It uses the CLIP score as a measure of "stock price" and leverages diffusion dynamics to calculate Black-Scholes variables. At each denoising step, it conditions the model on the prompt with the lowest Black-Scholes score, indicating the concept requiring most attention. |
The proposed Black-Scholes-based method outperforms baselines like linear interpolation, alternating prompts, and CLIP-guided selection in generating realistic and concept-blending images.
It effectively preserves individual characteristics from multiple prompts while minimizing unrealistic artifacts.
Quantitative evaluation using CLIP scores demonstrates superior performance compared to other techniques. |
The reliance on CLIP scores for evaluation has limitations as it might not capture subtle quality differences and is prone to biases.
The study focuses on two-prompt mixing, limiting its generalizability to a larger number of prompts. |
prompt mixing, text-to-image diffusion, black-scholes model, clip score, generative ai |
2405.13637
Report |
Curriculum Direct Preference Optimization for Diffusion and Consistency Models |
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah |
Direct Preference Optimization (DPO) has been proposed as an effective and
efficient alternative to reinforcement learning from human feedback (RLHF). In
this paper, we propose a novel and enhanced version of DPO based on curriculum
learning for text-to-image generation. Our method is divided into two training
stages. First, a ranking of the examples generated for each prompt is obtained
by employing a reward model. Then, increasingly difficult pairs of examples are
sampled and provided to a text-to-image generative (diffusion or consistency)
model. Generated samples that are far apart in the ranking are considered to
form easy pairs, while those that are close in the ranking form hard pairs. In
other words, we use the rank difference between samples as a measure of
difficulty. The sampled pairs are split into batches according to their
difficulty levels, which are gradually used to train the generative model. Our
approach, Curriculum DPO, is compared against state-of-the-art fine-tuning
approaches on three benchmarks, outperforming the competing methods in terms of
text alignment, aesthetics and human preference. Our code is available at
https://anonymous.4open.science/r/Curriculum-DPO-EE14. |
Proposes Curriculum DPO, a novel training regime for diffusion and consistency models that enhances Direct Preference Optimization (DPO) with curriculum learning for improved text-to-image generation. |
Aims to address the limitations of existing DPO methods that randomly sample image pairs during training, leading to suboptimal performance in text alignment, aesthetics, and human preference. |
Implements a two-stage training process: 1) uses a reward model to rank generated images by preference, 2) creates easy-to-hard image pairs based on ranking difference and trains the generative model progressively with these pairs. |
Curriculum DPO outperforms state-of-the-art fine-tuning methods (DPO, DDPO) in text alignment, aesthetics, and human preference scores across three benchmarks.
Subjective human evaluation confirms Curriculum DPO generates significantly more preferred samples compared to baselines.
Ablation studies demonstrate the effectiveness of curriculum learning and the impact of hyperparameter choices. |
Introduces additional hyperparameters (e.g., number of batches, iterations per batch) requiring tuning.
Doesn't address the inherent limitation of text-to-image models in disambiguating words with multiple meanings, which can lead to poor generation results in certain cases. |
text-to-image generation, curriculum learning, direct preference optimization, diffusion models, consistency models |
2405.13540
Report |
Directly Denoising Diffusion Model |
Dan Zhang, Jingjing Wang, Feng Luo |
In this paper, we present the Directly Denoising Diffusion Model (DDDM): a
simple and generic approach for generating realistic images with few-step
sampling, while multistep sampling is still preserved for better performance.
DDDMs require no delicately designed samplers nor distillation on pre-trained
distillation models. DDDMs train the diffusion model conditioned on an
estimated target that was generated from previous training iterations of its
own. To generate images, samples generated from the previous time step are also
taken into consideration, guiding the generation process iteratively. We
further propose Pseudo-LPIPS, a novel metric loss that is more robust to
various values of hyperparameter. Despite its simplicity, the proposed approach
can achieve strong performance in benchmark datasets. Our model achieves FID
scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling
respectively, surpassing those obtained from GANs and distillation-based
models. By extending the sampling to 1000 steps, we further reduce FID score to
1.79, aligning with state-of-the-art methods in the literature. For ImageNet
64x64, our approach stands as a competitive contender against leading models. |
This paper presents Directly Denoising Diffusion Models (DDDM), a novel approach for generating high-quality images with both single-step and multi-step sampling, without needing specially designed samplers or distillation. |
Diffusion models typically require many steps for high-quality generation, making them slow. DDDM enables both efficient single-step generation comparable to GANs and improved quality with iterative sampling. |
DDDM iteratively refines an estimate of the original data by training a neural network to approximate the solution of the probability flow ODE. It uses a novel Pseudo-LPIPS loss function for robustness. |
DDDM achieves FID scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling respectively, surpassing GANs and distillation methods.
On ImageNet 64x64, DDDM is competitive with leading models, showing strong FID scores and improved precision/recall compared to iCT.
Increasing sampling steps in DDDM consistently improves FID, demonstrating the benefit of its iterative approach. |
DDDM's training incurs additional memory overhead due to storing data estimates.
Evaluation might be biased by using ImageNet features in both LPIPS and FID. |
diffusion models, image generation, fast sampling, pseudo-lpips, iterative solution |
2405.13473
Report |
Class-Conditional self-reward mechanism for improved Text-to-Image models |
Safouane El Ghazouali, Arnaud Gucciardi, Umberto Michelucci |
Self-rewarding have emerged recently as a powerful tool in the field of
Natural Language Processing (NLP), allowing language models to generate
high-quality relevant responses by providing their own rewards during training.
This innovative technique addresses the limitations of other methods that rely
on human preferences. In this paper, we build upon the concept of
self-rewarding models and introduce its vision equivalent for Text-to-Image
generative AI models. This approach works by fine-tuning diffusion model on a
self-generated self-judged dataset, making the fine-tuning more automated and
with better data quality. The proposed mechanism makes use of other pre-trained
models such as vocabulary based-object detection, image captioning and is
conditioned by the a set of object for which the user might need to improve
generated data quality. The approach has been implemented, fine-tuned and
evaluated on stable diffusion and has led to a performance that has been
evaluated to be at least 60\% better than existing commercial and research
Text-to-image models. Additionally, the built self-rewarding mechanism allowed
a fully automated generation of images, while increasing the visual quality of
the generated images and also more efficient following of prompt instructions.
The code used in this work is freely available on
https://github.com/safouaneelg/SRT2I. |
This paper introduces a novel 'class-conditional self-rewarding' (CCSR) mechanism for automating the optimization of Text-to-Image (T2I) models, enhancing their ability to generate images that accurately reflect specific object classes and prompts. |
Current T2I model fine-tuning often relies on human feedback and reinforcement learning, which can be resource-intensive and prone to biases. This paper aims to automate this process and improve the quality of generated images. |
The CCSR mechanism utilizes a multi-step process: 1) LLM-based prompt generation, 2) Multi-image generation from prompts, 3) Image-to-Text (I2T) based self-judging of generated images, 4) Open-vocabulary object detection for filtering, 5) Selection of optimal image-prompt pairs, 6) Fine-tuning of the T2I model (Stable Diffusion) using the selected pairs. |
The CCSR mechanism leads to a significant improvement in the quality of generated images, particularly in terms of realism, prompt adherence, and accurate depiction of the targeted object class.
Fine-tuning Stable Diffusion with the CCSR-generated dataset resulted in a higher CLIP score compared to the baseline Stable Diffusion and a fine-tuned SDXS model.
The proposed method allows for complete automation of T2I model improvement without requiring human intervention. |
The 'class-conditional' nature of the mechanism might limit its generalizability to broader image-text relationships beyond the specifically trained object classes.
Continuous application of the CCSR loop with diverse classes is suggested for enhancing the model's overall generalizability. |
text-to-image synthesis, self-rewarding models, diffusion models, image captioning, open-vocabulary object detection |
2405.13360
Report |
How to Trace Latent Generative Model Generated Images without Artificial Watermark? |
Zhenting Wang, Vikash Sehwag, Chen Chen, Lingjuan Lyu, Dimitris N. Metaxas, Shiqing Ma |
Latent generative models (e.g., Stable Diffusion) have become more and more
popular, but concerns have arisen regarding potential misuse related to images
generated by these models. It is, therefore, necessary to analyze the origin of
images by inferring if a particular image was generated by a specific latent
generative model. Most existing methods (e.g., image watermark and model
fingerprinting) require extra steps during training or generation. These
requirements restrict their usage on the generated images without such extra
operations, and the extra required operations might compromise the quality of
the generated images. In this work, we ask whether it is possible to
effectively and efficiently trace the images generated by a specific latent
generative model without the aforementioned requirements. To study this
problem, we design a latent inversion based method called LatentTracer to trace
the generated images of the inspected model by checking if the examined images
can be well-reconstructed with an inverted latent input. We leverage gradient
based latent inversion and identify a encoder-based initialization critical to
the success of our approach. Our experiments on the state-of-the-art latent
generative models, such as Stable Diffusion, show that our method can
distinguish the images generated by the inspected model and other images with a
high accuracy and efficiency. Our findings suggest the intriguing possibility
that today's latent generative generated images are naturally watermarked by
the decoder used in the source models. Code:
https://github.com/ZhentingWang/LatentTracer. |
This paper introduces LatentTracer, an alteration-free method for tracing images generated by a specific latent generative model. It leverages latent inversion, focusing on the inherent watermarking properties of the model's decoder. |
Tracing the origin of images generated by latent generative models is crucial to address potential misuse, such as the spread of harmful content or intellectual property infringement. |
LatentTracer utilizes a gradient-based optimization approach to reconstruct the examined image by inverting the latent input of the inspected model's decoder. The key innovation lies in using the encoder to initialize the optimization process, significantly enhancing effectiveness and efficiency. |
LatentTracer achieves high accuracy (over 93%) in distinguishing between images generated by the inspected model and those from other models, even with similar architectures.
The method proves effective in differentiating between model-generated images and real images.
LatentTracer exhibits efficiency, outperforming existing alteration-free methods in terms of runtime. |
The method's performance in scenarios where models share the same autoencoder requires further investigation.
Future work could explore the robustness against strong post-processing techniques that significantly alter the image while preserving visual quality. |
image origin attribution, latent generative models, latent inversion, alteration-free watermarking, responsible ai |
2405.13337
Report |
Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for Vision Transformer |
Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He |
The Vision Transformer (ViT) has gained prominence for its superior
relational modeling prowess. However, its global attention mechanism's
quadratic complexity poses substantial computational burdens. A common remedy
spatially groups tokens for self-attention, reducing computational
requirements. Nonetheless, this strategy neglects semantic information in
tokens, possibly scattering semantically-linked tokens across distinct groups,
thus compromising the efficacy of self-attention intended for modeling
inter-token dependencies. Motivated by these insights, we introduce a fast and
balanced clustering method, named \textbf{S}emantic \textbf{E}quitable
\textbf{C}lustering (SEC). SEC clusters tokens based on their global semantic
relevance in an efficient, straightforward manner. In contrast to traditional
clustering methods requiring multiple iterations, our method achieves token
clustering in a single pass. Additionally, SEC regulates the number of tokens
per cluster, ensuring a balanced distribution for effective parallel processing
on current computational platforms without necessitating further optimization.
Capitalizing on SEC, we propose a versatile vision backbone, SecViT.
Comprehensive experiments in image classification, object detection, instance
segmentation, and semantic segmentation validate to the effectiveness of
SecViT. Remarkably, SecViT attains an impressive \textbf{84.2\%} image
classification accuracy with only \textbf{27M} parameters and \textbf{4.4G}
FLOPs, without the need for for additional supervision or data. Code will be
available at \url{https://github.com/qhfan/SecViT}. |
This paper introduces Semantic Equitable Clustering (SEC), a novel, efficient single-pass clustering method that groups tokens based on global semantic relevance for Vision Transformers (ViT), leading to enhanced computational efficiency and performance in various vision tasks. |
The quadratic complexity of global attention in ViTs poses significant computational challenges. While token grouping methods address this, they often overlook semantic relationships, hindering effective modeling of inter-token dependencies. SEC offers a solution by efficiently clustering tokens based on semantic information, optimizing both computational efficiency and performance. |
SEC employs global pooling to derive a global semantic token. It then calculates cosine similarity between this token and all others, sorting them based on similarity scores. Tokens with similar scores are grouped into clusters, ensuring an equal distribution for efficient parallel processing. |
SecViT, built upon SEC, consistently surpasses previous state-of-the-art models in image classification across different model scales.
Directly replacing attention mechanisms in Swin-Transformer and FasterViT with SEC leads to significant performance gains in image classification.
SecViT exhibits impressive performance on downstream tasks such as object detection, instance segmentation, and semantic segmentation. |
Computational constraints limit experimentation with larger models and datasets like ImageNet-21k.
Future work involves exploring the scalability of SEC on larger datasets and models. |
vision transformer, token clustering, semantic equitable clustering, computational efficiency, computer vision |
2405.13335
Report |
Vision Transformer with Sparse Scan Prior |
Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He |
In recent years, Transformers have achieved remarkable progress in computer
vision tasks. However, their global modeling often comes with substantial
computational overhead, in stark contrast to the human eye's efficient
information processing. Inspired by the human eye's sparse scanning mechanism,
we propose a \textbf{S}parse \textbf{S}can \textbf{S}elf-\textbf{A}ttention
mechanism ($\rm{S}^3\rm{A}$). This mechanism predefines a series of Anchors of
Interest for each token and employs local attention to efficiently model the
spatial information around these anchors, avoiding redundant global modeling
and excessive focus on local information. This approach mirrors the human eye's
functionality and significantly reduces the computational load of vision
models. Building on $\rm{S}^3\rm{A}$, we introduce the \textbf{S}parse
\textbf{S}can \textbf{Vi}sion \textbf{T}ransformer (SSViT). Extensive
experiments demonstrate the outstanding performance of SSViT across a variety
of tasks. Specifically, on ImageNet classification, without additional
supervision or training data, SSViT achieves top-1 accuracies of
\textbf{84.4\%/85.7\%} with \textbf{4.4G/18.2G} FLOPs. SSViT also excels in
downstream tasks such as object detection, instance segmentation, and semantic
segmentation. Its robustness is further validated across diverse datasets. Code
will be available at \url{https://github.com/qhfan/SSViT}. |
This paper proposes Sparse Scan Self-Attention (S³A), a novel self-attention mechanism inspired by the sparse scanning mechanism of the human eye, and builds Sparse Scan Vision Transformer (SSViT) based on it. |
Existing Vision Transformer models often suffer from high computational costs associated with their self-attention mechanisms. While several strategies have been proposed to improve efficiency, they often deviate significantly from the efficient visual information processing employed by the human eye. |
The S³A mechanism defines Anchors of Interest (AoI) for each token and uses local attention to model spatial information around these anchors. This approach, mimicking the human eye, reduces computational load while effectively capturing both local and global information. Extensive experiments are conducted on ImageNet classification, object detection, instance segmentation, and semantic segmentation to demonstrate SSViT's effectiveness and efficiency. |
SSViT achieves state-of-the-art accuracy on ImageNet classification with significantly reduced computational cost compared to previous models.
The model excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation, outperforming counterparts across different benchmark datasets.
SSViT demonstrates strong robustness against out-of-distribution data, showcasing its ability to generalize well beyond the training dataset. |
Computational constraints limited the exploration of SSViT on larger models and datasets, such as ImageNet-21k.
Future work will focus on validating the performance of SSViT on such large-scale datasets. |
vision transformer, self-attention, sparse scan, computer vision, deep learning |
2405.13218
Report |
Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction |
Maciej Kilian, Varun Japan, Luke Zettlemoyer |
Nearly every recent image synthesis approach, including diffusion,
masked-token prediction, and next-token prediction, uses a Transformer network
architecture. Despite this common backbone, there has been no direct, compute
controlled comparison of how these approaches affect performance and
efficiency. We analyze the scalability of each approach through the lens of
compute budget measured in FLOPs. We find that token prediction methods, led by
next-token prediction, significantly outperform diffusion on prompt following.
On image quality, while next-token prediction initially performs better,
scaling trends suggest it is eventually matched by diffusion. We compare the
inference compute efficiency of each approach and find that next token
prediction is by far the most efficient. Based on our findings we recommend
diffusion for applications targeting image quality and low latency; and
next-token prediction when prompt following or throughput is more important. |
This paper presents a compute-controlled comparison of transformer-based diffusion, masked-token prediction, and next-token prediction for latent image synthesis. |
Despite the common Transformer backbone in recent image synthesis approaches, there lacks a direct comparison of their performance and efficiency under controlled compute budgets. |
The authors train a grid of models using these approaches, varying model sizes, dataset sizes, and autoencoder configurations. They evaluate the models based on training compute (FLOPs), final loss, CLIP score, and FID, analyzing scalability and trade-offs. |
Token-based methods, especially next-token prediction, outperform diffusion on prompt following (CLIP score), indicating better controllability.
While next-token prediction achieves better image quality (FID) at lower compute budgets, scaling trends suggest diffusion might eventually match it.
Next-token prediction exhibits superior inference compute efficiency but can suffer from high latency in low-volume settings due to autoregressive sampling. |
The study primarily focuses on pretraining and does not cover finetuning or distillation stages.
The analysis is limited to loss and perceptual metrics, excluding potential downstream task evaluation or comparisons with other emerging approaches. |
image synthesis, diffusion models, token prediction, transformers, compute efficiency |
2405.13195
Report |
CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers |
Andrew Marmon, Grant Schindler, José Lezama, Dan Kondratyuk, Bryan Seybold, Irfan Essa |
We extend multimodal transformers to include 3D camera motion as a
conditioning signal for the task of video generation. Generative video models
are becoming increasingly powerful, thus focusing research efforts on methods
of controlling the output of such models. We propose to add virtual 3D camera
controls to generative video methods by conditioning generated video on an
encoding of three-dimensional camera movement over the course of the generated
video. Results demonstrate that we are (1) able to successfully control the
camera during video generation, starting from a single frame and a camera
signal, and (2) we demonstrate the accuracy of the generated 3D camera paths
using traditional computer vision methods. |
This paper introduces a method for controlling 3D camera motion in video generation models by conditioning the output on an encoding of 3D camera movement. |
This work addresses the limitations of existing video generation models, which often entangle scene dynamics and camera movement. Explicit control over camera motion enables more realistic and controllable video generation. |
The authors extend a token-based video transformer model by incorporating 3D camera path information as a new modality. They generate training data using NeRF scenes to provide ground truth camera paths and fine-tune the model to follow these paths during video generation. |
The method successfully controls the 3D camera movement during video generation, starting from a single frame and a camera signal.
Generated videos exhibit parallax and maintain the in-painting and out-painting abilities of the pre-trained video generation model.
There is a trade-off between controlling camera motion and preserving scene motion from the pre-trained model. |
The model exhibits reduced scene motion when fine-tuned for camera control, likely due to the lack of scene dynamics in the NeRF training data.
Future work could explore methods to better balance camera control and scene motion preservation, potentially by incorporating scene dynamics into the NeRF training data. |
video generation, camera control, 3d motion, nerf, video transformer |
2405.13194
Report |
KPConvX: Modernizing Kernel Point Convolution with Kernel Attention |
Hugues Thomas, Yao-Hung Hubert Tsai, Timothy D. Barfoot, Jian Zhang |
In the field of deep point cloud understanding, KPConv is a unique
architecture that uses kernel points to locate convolutional weights in space,
instead of relying on Multi-Layer Perceptron (MLP) encodings. While it
initially achieved success, it has since been surpassed by recent MLP networks
that employ updated designs and training strategies. Building upon the kernel
point principle, we present two novel designs: KPConvD (depthwise KPConv), a
lighter design that enables the use of deeper architectures, and KPConvX, an
innovative design that scales the depthwise convolutional weights of KPConvD
with kernel attention values. Using KPConvX with a modern architecture and
training strategy, we are able to outperform current state-of-the-art
approaches on the ScanObjectNN, Scannetv2, and S3DIS datasets. We validate our
design choices through ablation studies and release our code and models. |
This paper presents KPConvX, an efficient point cloud feature extractor combining depthwise convolution and kernel attention, achieving state-of-the-art performance in semantic segmentation and shape classification. |
Existing methods for deep point cloud understanding, including those based on MLPs or transformers, often struggle to efficiently capture geometric patterns. This work addresses this limitation. |
The authors introduce KPConvD, a depthwise variant of KPConv, and further enhance it with kernel attention to create KPConvX. They design a modern deep architecture, KPConvX-L, using these novel operators. |
KPConvX-L outperforms state-of-the-art methods on ScanObjectNN, Scannetv2, and S3DIS datasets.
Ablation studies demonstrate the individual contributions of depthwise convolution, kernel attention, and architectural choices.
The proposed approach achieves a good balance between high performance and computational efficiency. |
Further research is needed to understand the interplay between topological and geometric feature extractors in deep learning.
Exploring the combination of topological and geometric features in a single architecture is a promising avenue. |
point cloud, deep learning, convolutional neural networks, attention mechanism, semantic segmentation |
2405.12978
Report |
Personalized Residuals for Concept-Driven Text-to-Image Generation |
Cusuh Ham, Matthew Fisher, James Hays, Nicholas Kolkin, Yuchen Liu, Richard Zhang, Tobias Hinz |
We present personalized residuals and localized attention-guided sampling for
efficient concept-driven generation using text-to-image diffusion models. Our
method first represents concepts by freezing the weights of a pretrained
text-conditioned diffusion model and learning low-rank residuals for a small
subset of the model's layers. The residual-based approach then directly enables
application of our proposed sampling technique, which applies the learned
residuals only in areas where the concept is localized via cross-attention and
applies the original diffusion weights in all other regions. Localized sampling
therefore combines the learned identity of the concept with the existing
generative prior of the underlying diffusion model. We show that personalized
residuals effectively capture the identity of a concept in ~3 minutes on a
single GPU without the use of regularization images and with fewer parameters
than previous models, and localized sampling allows using the original model as
strong prior for large parts of the image. |
This paper proposes 'personalized residuals' and 'localized attention-guided sampling' for efficient concept-driven generation using text-to-image diffusion models. |
Existing text-to-image models struggle to consistently generate specific user-defined concepts in novel contexts. This work aims to improve the efficiency and controllability of concept-driven generation. |
The method learns low-rank residuals for a subset of diffusion model layers to represent a specific concept. During sampling, these residuals can be applied locally based on cross-attention maps, allowing for better integration of the concept with the base diffusion model's generative capabilities. |
Personalized residuals effectively capture concept identity using minimal parameters and training time, without needing regularization images.
Localized attention-guided sampling enables better recontextualization of learned concepts by selectively applying personalized residuals based on attention maps.
User studies and quantitative evaluations show the approach achieves comparable or better performance than existing baselines in terms of text alignment, image alignment, and user preference. |
Localized sampling's effectiveness depends on the quality of cross-attention maps and may not be optimal for all types of prompts.
The method can be sensitive to the choice of macro class used to represent the concept. |
text-to-image generation, diffusion models, concept-driven synthesis, personalized residuals, attention-guided sampling |
2405.12970
Report |
Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control |
Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, Yong Liu |
Current face reenactment and swapping methods mainly rely on GAN frameworks,
but recent focus has shifted to pre-trained diffusion models for their superior
generation capabilities. However, training these models is resource-intensive,
and the results have not yet achieved satisfactory performance levels. To
address this issue, we introduce Face-Adapter, an efficient and effective
adapter designed for high-precision and high-fidelity face editing for
pre-trained diffusion models. We observe that both face reenactment/swapping
tasks essentially involve combinations of target structure, ID and attribute.
We aim to sufficiently decouple the control of these factors to achieve both
tasks in one model. Specifically, our method contains: 1) A Spatial Condition
Generator that provides precise landmarks and background; 2) A Plug-and-play
Identity Encoder that transfers face embeddings to the text space by a
transformer decoder. 3) An Attribute Controller that integrates spatial
conditions and detailed attributes. Face-Adapter achieves comparable or even
superior performance in terms of motion control precision, ID retention
capability, and generation quality compared to fully fine-tuned face
reenactment/swapping models. Additionally, Face-Adapter seamlessly integrates
with various StableDiffusion models. |
This paper presents Face-Adapter, a plug-and-play adapter for pre-trained diffusion models that enables fine-grained control over identity and attributes for face reenactment and swapping tasks. |
Existing GAN-based face editing methods have limitations in generative capabilities, while diffusion-based methods are resource-intensive to train. Face-Adapter leverages the power of pre-trained diffusion models while remaining efficient and achieving high-quality results. |
Face-Adapter consists of three components: 1) Spatial Condition Generator predicts landmarks and adapts the foreground mask. 2) Identity Encoder transfers face embeddings to the text space. 3) Attribute Controller combines spatial and attribute information for conditional inpainting. |
Face-Adapter achieves comparable or superior results in image quality and motion control accuracy for face reenactment compared to SOTA methods.
For face swapping, Face-Adapter effectively handles large facial shape changes and large poses, outperforming existing methods in identity preservation and attribute consistency.
The method is efficient and plug-and-play, only requiring fine-tuning of the adapter while freezing the pre-trained diffusion model. |
The unified model lacks temporal stability for video face editing, which will be addressed in future work.
Potential misuse of the technology for malicious purposes is a concern. |
face reenactment, face swapping, diffusion model, conditional inpainting, face editing |
2405.12914
Report |
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation |
Zhiyu Tan, Mengping Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou, Cheng Zhang, Hao Li |
One critical prerequisite for faithful text-to-image generation is the
accurate understanding of text inputs. Existing methods leverage the text
encoder of the CLIP model to represent input prompts. However, the pre-trained
CLIP model can merely encode English with a maximum token length of 77.
Moreover, the model capacity of the text encoder from CLIP is relatively
limited compared to Large Language Models (LLMs), which offer multilingual
input, accommodate longer context, and achieve superior text representation. In
this paper, we investigate LLMs as the text encoder to improve the language
understanding in text-to-image generation. Unfortunately, training
text-to-image generative model with LLMs from scratch demands significant
computational resources and data. To this end, we introduce a three-stage
training pipeline that effectively and efficiently integrates the existing
text-to-image model with LLMs. Specifically, we propose a lightweight adapter
that enables fast training of the text-to-image model using the textual
representations from LLMs. Extensive experiments demonstrate that our model
supports not only multilingual but also longer input context with superior
image generation quality. |
This paper proposes an efficient and effective three-stage training pipeline to integrate Large Language Models (LLMs) into text-to-image diffusion models for enhanced language understanding and multilingual generation. |
Existing text-to-image models often rely on CLIP's text encoder, limiting them to English input, short prompts, and potentially hindering generation quality due to CLIP's smaller capacity compared to LLMs. |
The pipeline consists of: (1) aligning LLM text features with CLIP's visual-textual space using a lightweight adapter, (2) end-to-end text-image training to optimize the adapter and the diffusion model, and (3) fine-tuning on a high-quality dataset for improved aesthetics. |
The model achieves competitive FID/CLIP scores on various benchmarks, demonstrating high synthesis quality and text alignment.
Supports multilingual text-to-image generation, including Chinese, Japanese, and Korean.
Successfully generates images from longer prompts, exceeding CLIP's limitation of 77 tokens. |
Human evaluation, while showing preference for the model's outputs, was limited in scale and inherently subjective.
The model's performance depends on the quality and diversity of the training data, potentially struggling with objects or concepts not well-represented in the data. |
text-to-image generation, large language models, diffusion models, multilingual generation, long-prompt generation |
2405.12806
Report |
MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video |
Hongsheng Wang, Xiang Cai, Xi Sun, Jinhong Yue, Shengyu Zhang, Feng Lin, Fei Wu |
Single-view clothed human reconstruction holds a central position in virtual
reality applications, especially in contexts involving intricate human motions.
It presents notable challenges in achieving realistic clothing deformation.
Current methodologies often overlook the influence of motion on surface
deformation, resulting in surfaces lacking the constraints imposed by global
motion. To overcome these limitations, we introduce an innovative framework,
Motion-Based 3D Clothed Humans Synthesis (MOSS), which employs kinematic
information to achieve motion-aware Gaussian split on the human surface. Our
framework consists of two modules: Kinematic Gaussian Locating Splatting (KGAS)
and Surface Deformation Detector (UID). KGAS incorporates matrix-Fisher
distribution to propagate global motion across the body surface. The density
and rotation factors of this distribution explicitly control the Gaussians,
thereby enhancing the realism of the reconstructed surface. Additionally, to
address local occlusions in single-view, based on KGAS, UID identifies
significant surfaces, and geometric reconstruction is performed to compensate
for these deformations. Experimental results demonstrate that MOSS achieves
state-of-the-art visual quality in 3D clothed human synthesis from monocular
videos. Notably, we improve the Human NeRF and the Gaussian Splatting by 33.94%
and 16.75% in LPIPS* respectively. Codes are available at
https://wanghongsheng01.github.io/MOSS/. |
This paper presents MOSS, a novel framework for high-quality, motion-aware 3D clothed human reconstruction from monocular videos using Gaussian Splatting. |
Existing methods struggle to realistically reconstruct fine details like clothing folds and joint deformations, especially under large-scale motions. |
MOSS uses two key modules: KGAS to control Gaussian density and orientation based on motion information extracted from the SMPL kinematic tree, and UID to detect and refine significant surface deformations. |
MOSS achieves state-of-the-art visual quality on ZJU-MoCap and MonoCap datasets, outperforming previous methods in LPIPS* and PSNR.
KGAS effectively incorporates global motion constraints, leading to realistic joint details and clothing folds.
UID enhances the reconstruction of complex surface deformations by identifying and densifying Gaussians in those areas. |
MOSS currently relies on pre-computed SMPL parameters and camera information.
Future work includes incorporating graph-based topological guidance for improved reconstruction. |
3d gaussian splatting, human reconstruction, surface deformation, motion-aware, single-view |
2405.12796
Report |
DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control |
Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu |
Generating customized content in videos has received increasing attention
recently. However, existing works primarily focus on customized text-to-video
generation for single subject, suffering from subject-missing and
attribute-binding problems when the video is expected to contain multiple
subjects. Furthermore, existing models struggle to assign the desired actions
to the corresponding subjects (action-binding problem), failing to achieve
satisfactory multi-subject generation performance. To tackle the problems, in
this paper, we propose DisenStudio, a novel framework that can generate
text-guided videos for customized multiple subjects, given few images for each
subject. Specifically, DisenStudio enhances a pretrained diffusion-based
text-to-video model with our proposed spatial-disentangled cross-attention
mechanism to associate each subject with the desired action. Then the model is
customized for the multiple subjects with the proposed motion-preserved
disentangled finetuning, which involves three tuning strategies: multi-subject
co-occurrence tuning, masked single-subject tuning, and multi-subject
motion-preserved tuning. The first two strategies guarantee the subject
occurrence and preserve their visual attributes, and the third strategy helps
the model maintain the temporal motion-generation ability when finetuning on
static images. We conduct extensive experiments to demonstrate our proposed
DisenStudio significantly outperforms existing methods in various metrics.
Additionally, we show that DisenStudio can be used as a powerful tool for
various controllable generation applications. |
Proposes DisenStudio, a novel framework for generating customized videos with multiple user-provided subjects and their desired actions, addressing limitations of existing single-subject methods. |
Existing methods struggle to generate videos with multiple customized subjects due to subject-missing, attribute-binding, and action-binding problems, hindering the creation of diverse and personalized video content. |
Enhances a pretrained diffusion-based text-to-video model with spatial-disentangled cross-attention (SDCA) to independently control subjects and their actions. Introduces motion-preserved disentangled finetuning with multi-subject co-occurrence, masked single-subject, and motion-preserved tuning strategies to ensure subject presence, preserve visual attributes, and maintain motion generation ability. |
Significantly outperforms baselines in subject fidelity (DINO), textual alignment (CLIP-T), and temporal consistency.
Enables precise control over subject actions and positions within the video.
Demonstrates potential for various applications, including storytelling with customized characters. |
Limited to the base model's video length and resolution, hindering the generation of longer videos with more complex scenarios and higher subject fidelity.
Relies on the pretrained model's motion repertoire, limiting customization of specific subject motions. |
text-to-video generation, subject customization, diffusion models, disentanglement, spatial control |
2405.12663
Report |
LAGA: Layered 3D Avatar Generation and Customization via Gaussian Splatting |
Jia Gong, Shenyu Ji, Lin Geng Foo, Kang Chen, Hossein Rahmani, Jun Liu |
Creating and customizing a 3D clothed avatar from textual descriptions is a
critical and challenging task. Traditional methods often treat the human body
and clothing as inseparable, limiting users' ability to freely mix and match
garments. In response to this limitation, we present LAyered Gaussian Avatar
(LAGA), a carefully designed framework enabling the creation of high-fidelity
decomposable avatars with diverse garments. By decoupling garments from avatar,
our framework empowers users to conviniently edit avatars at the garment level.
Our approach begins by modeling the avatar using a set of Gaussian points
organized in a layered structure, where each layer corresponds to a specific
garment or the human body itself. To generate high-quality garments for each
layer, we introduce a coarse-to-fine strategy for diverse garment generation
and a novel dual-SDS loss function to maintain coherence between the generated
garments and avatar components, including the human body and other garments.
Moreover, we introduce three regularization losses to guide the movement of
Gaussians for garment transfer, allowing garments to be freely transferred to
various avatars. Extensive experimentation demonstrates that our approach
surpasses existing methods in the generation of 3D clothed humans. |
This paper introduces LAGA, a novel framework for generating layered 3D avatars with diverse, interchangeable garments based on Gaussian Splatting. |
Existing 3D avatar generation methods often lack the ability to decompose garments from the avatar itself, limiting customization options. |
LAGA employs a layered structure, representing the body and each garment as separate layers of Gaussian points. It utilizes a coarse-to-fine strategy for diverse garment generation and a dual-SDS loss function for maintaining coherence between different layers. Furthermore, it introduces three regularization losses to enable garment transfer between avatars with different body shapes. |
LAGA generates high-quality 3D avatars with realistic textures and detailed features.
The layered structure enables convenient decomposition and customization of garments.
LAGA outperforms existing methods in qualitative and quantitative comparisons, demonstrating superior visual fidelity and text alignment. |
LAGA currently relies on a pre-trained 2D human skeleton conditioned diffusion model, limiting its generalization ability to unseen poses.
The garment transfer method could be further improved to handle extreme body shape variations. |
3d avatar generation, gaussian splatting, decomposable avatars, garment transfer, text-to-3d |
2405.12661
Report |
EmoEdit: Evoking Emotions through Image Manipulation |
Jingyuan Yang, Jiawei Feng, Weibin Luo, Dani Lischinski, Daniel Cohen-Or, Hui Huang |
Affective Image Manipulation (AIM) seeks to modify user-provided images to
evoke specific emotional responses. This task is inherently complex due to its
twofold objective: significantly evoking the intended emotion, while preserving
the original image composition. Existing AIM methods primarily adjust color and
style, often failing to elicit precise and profound emotional shifts. Drawing
on psychological insights, we extend AIM by incorporating content modifications
to enhance emotional impact. We introduce EmoEdit, a novel two-stage framework
comprising emotion attribution and image editing. In the emotion attribution
stage, we leverage a Vision-Language Model (VLM) to create hierarchies of
semantic factors that represent abstract emotions. In the image editing stage,
the VLM identifies the most relevant factors for the provided image, and guides
a generative editing model to perform affective modifications. A ranking
technique that we developed selects the best edit, balancing between emotion
fidelity and structure integrity. To validate EmoEdit, we assembled a dataset
of 416 images, categorized into positive, negative, and neutral classes. Our
method is evaluated both qualitatively and quantitatively, demonstrating
superior performance compared to existing state-of-the-art techniques.
Additionally, we showcase EmoEdit's potential in various manipulation tasks,
including emotion-oriented and semantics-oriented editing. |
EmoEdit, a novel two-stage framework for Affective Image Manipulation (AIM), modifies image content and color to evoke specific emotions while preserving original structure. |
Existing AIM methods, limited to color and style adjustments, struggle to evoke precise emotions. EmoEdit addresses this by incorporating content modification based on psychological insights. |
EmoEdit utilizes emotion factor trees built from EmoSet to map emotions to visual elements. It employs GPT-4V for factor selection and instruction generation, IP2P for editing, and a ranking technique to select the optimal result. |
EmoEdit outperforms state-of-the-art methods in emotion fidelity, structure preservation, and user preference.
Content and color modification, along with ranking, are crucial for EmoEdit's effectiveness.
EmoEdit enables diverse editing across eight emotion categories and various manipulation levels. |
The emotion factor tree's reliance on EmoSet might introduce bias and limitations.
Fixed filtering and ranking in EmoEdit could be enhanced by incorporating user interaction. |
affective image manipulation, emotion elicitation, image editing, content modification, vision-language model |
2405.12540
Report |
Context-Enhanced Video Moment Retrieval with Large Language Models |
Weijia Liu, Bo Miao, Jiuxin Cao, Xuelin Zhu, Bo Liu, Mehwish Nasim, Ajmal Mian |
Current methods for Video Moment Retrieval (VMR) struggle to align complex
situations involving specific environmental details, character descriptions,
and action narratives. To tackle this issue, we propose a Large Language
Model-guided Moment Retrieval (LMR) approach that employs the extensive
knowledge of Large Language Models (LLMs) to improve video context
representation as well as cross-modal alignment, facilitating accurate
localization of target moments. Specifically, LMR introduces a context
enhancement technique with LLMs to generate crucial target-related context
semantics. These semantics are integrated with visual features for producing
discriminative video representations. Finally, a language-conditioned
transformer is designed to decode free-form language queries, on the fly, using
aligned video representations for moment retrieval. Extensive experiments
demonstrate that LMR achieves state-of-the-art results, outperforming the
nearest competitor by up to 3.28\% and 4.06\% on the challenging QVHighlights
and Charades-STA benchmarks, respectively. More importantly, the performance
gains are significantly higher for localization of complex queries. |
This paper presents LMR, a novel Video Moment Retrieval (VMR) approach that leverages the power of Large Language Models (LLMs) to enhance video context modeling and improve the accuracy of retrieving specific moments from videos based on complex textual queries. |
Existing VMR methods struggle to accurately localize target moments described by intricate queries involving environmental details, character descriptions, and action narratives. LMR addresses this challenge by integrating LLM-derived contextual information, enabling more precise alignment between videos and complex language queries. |
LMR employs an LLM to generate target-related textual descriptions for video clips, enriching their contextual representation. These descriptions, along with visual features, are processed by a language-conditioned transformer to decode free-form language queries and localize the target moment. |
LMR achieves state-of-the-art results on the QVHighlights and Charades-STA benchmarks, outperforming existing methods.
The approach demonstrates significant performance gains on complex queries, highlighting its ability to handle intricate contextual requirements.
Ablation studies validate the contributions of individual components, demonstrating the importance of LLM-based context enhancement and language-conditioned decoding. |
The current implementation relies on offline LLM processing for generating video descriptions, which could be integrated into an end-to-end trainable framework in future work.
Exploring alternative LLM architectures and prompting strategies for generating even richer and more informative video descriptions could further improve performance. |
video moment retrieval, large language models, multimodal alignment, context enhancement, language-conditioned transformer |
2405.12531
Report |
CustomText: Customized Textual Image Generation using Diffusion Models |
Shubham Paliwal, Arushi Jain, Monika Sharma, Vikram Jamwal, Lovekesh Vig |
Textual image generation spans diverse fields like advertising, education,
product packaging, social media, information visualization, and branding.
Despite recent strides in language-guided image synthesis using diffusion
models, current models excel in image generation but struggle with accurate
text rendering and offer limited control over font attributes. In this paper,
we aim to enhance the synthesis of high-quality images with precise text
customization, thereby contributing to the advancement of image generation
models. We call our proposed method CustomText. Our implementation leverages a
pre-trained TextDiffuser model to enable control over font color, background,
and types. Additionally, to address the challenge of accurately rendering
small-sized fonts, we train the ControlNet model for a consistency decoder,
significantly enhancing text-generation performance. We assess the performance
of CustomText in comparison to previous methods of textual image generation on
the publicly available CTW-1500 dataset and a self-curated dataset for
small-text generation, showcasing superior results. |
This paper introduces CustomText, a novel method leveraging diffusion models to generate images with customized text, offering control over font attributes like type, color, size, and background for seamless integration into diverse layouts. |
Existing text-to-image synthesis methods struggle with accurate text rendering and lack control over font attributes, limiting their use in applications like advertising, education, and product packaging where customized text is crucial. |
CustomText utilizes a two-stage pipeline: first generating character and conditional masks to define text position and attributes, then using a modified TextDiffuser model with a ControlNet-based consistency decoder for enhanced small-font generation. |
CustomText demonstrates superior control over text attributes, enabling customization of font type, color, size, and background.
The ControlNet-based consistency decoder significantly improves the generation of small-sized fonts compared to previous methods.
Quantitative evaluations using MSE, PSNR, SSIM, and OCR performance on CTW-1500 and a custom SmallFontSize dataset confirm the effectiveness of CustomText. |
The current system only supports Latin alphabets, limiting its applicability to other languages.
The training dataset size for the decoder enhance model is limited, potentially hindering performance. Future work involves using a larger dataset for further improvement. |
text-to-image synthesis, diffusion models, font customization, text rendering, controlnet |
2405.12523
Report |
Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models |
Jiaqi Li, Qianshan Wei, Chuanyi Zhang, Guilin Qi, Miaozeng Du, Yongrui Chen, Sheng Bi |
Machine unlearning empowers individuals with the `right to be forgotten' by
removing their private or sensitive information encoded in machine learning
models. However, it remains uncertain whether MU can be effectively applied to
Multimodal Large Language Models (MLLMs), particularly in scenarios of
forgetting the leaked visual data of concepts. To overcome the challenge, we
propose an efficient method, Single Image Unlearning (SIU), to unlearn the
visual recognition of a concept by fine-tuning a single associated image for
few steps. SIU consists of two key aspects: (i) Constructing Multifaceted
fine-tuning data. We introduce four targets, based on which we construct
fine-tuning data for the concepts to be forgotten; (ii) Jointly training loss.
To synchronously forget the visual recognition of concepts and preserve the
utility of MLLMs, we fine-tune MLLMs through a novel Dual Masked KL-divergence
Loss combined with Cross Entropy loss. Alongside our method, we establish
MMUBench, a new benchmark for MU in MLLMs and introduce a collection of metrics
for its evaluation. Experimental results on MMUBench show that SIU completely
surpasses the performance of existing methods. Furthermore, we surprisingly
find that SIU can avoid invasive membership inference attacks and jailbreak
attacks. To the best of our knowledge, we are the first to explore MU in MLLMs.
We will release the code and benchmark in the near future. |
This paper explores machine unlearning (MU) in Multimodal Large Language Models (MLLMs), focusing on forgetting the visual recognition of concepts and introduces a new method called Single Image Unlearning (SIU). |
This is important because existing MU methods for LLMs may not be transferable to MLLMs, especially when dealing with limited training data and potential model degradation when forgetting visual concepts. |
SIU uses a single image of a target concept to unlearn its visual recognition. It employs Multifaceted Fine-tuning Data based on four targets (aligning with unseen concepts, assigning new visual descriptions, decoupling factual knowledge, and preserving non-targeted knowledge) and a Dual Masked KL-divergence (DMK) Loss jointly trained with cross-entropy loss to refine the unlearning process and preserve model utility. |
SIU outperforms existing methods (PO, GA, GA+KL) on the proposed MMUBench benchmark in terms of efficacy, generality, specificity, fluency, and diversity.
SIU demonstrates robustness against membership inference attacks and jailbreak attacks.
The research reveals a 'positive butterfly effect' where unlearning a concept can lead to the selective retention of related knowledge, suggesting a nuanced restructuring of knowledge within the model. |
The study primarily focuses on the LLAVA model, potentially limiting the generalizability of the findings to other MLLMs.
Future work will explore new MU methods in MLLMs and evaluate unlearning for specific data points rather than concept-wise knowledge. |
machine unlearning, multimodal large language models, visual recognition, benchmarking, privacy |
2405.12490
Report |
Customize Your Own Paired Data via Few-shot Way |
Jinshu Chen, Bingchuan Li, Miao Hua, Panpan Xu, Qian He |
Existing solutions to image editing tasks suffer from several issues. Though
achieving remarkably satisfying generated results, some supervised methods
require huge amounts of paired training data, which greatly limits their
usages. The other unsupervised methods take full advantage of large-scale
pre-trained priors, thus being strictly restricted to the domains where the
priors are trained on and behaving badly in out-of-distribution cases. The task
we focus on is how to enable the users to customize their desired effects
through only few image pairs. In our proposed framework, a novel few-shot
learning mechanism based on the directional transformations among samples is
introduced and expands the learnable space exponentially. Adopting a diffusion
model pipeline, we redesign the condition calculating modules in our model and
apply several technical improvements. Experimental results demonstrate the
capabilities of our method in various cases. |
This paper proposes a novel few-shot image editing framework allowing users to customize image editing effects with only a few image pairs. |
Existing methods either require large paired datasets or rely heavily on pre-trained models, limiting their flexibility and applicability to new editing tasks. |
The method utilizes a novel "n-source-to-n-target" learning mechanism, expanding the dataset by training on directional transformations within sample pairs. It adopts a diffusion model pipeline with redesigned condition injection modules, incorporating pixel-level transformations as conditions, and employs technical improvements like adaptive noise and skip connections for enhanced generation quality. |
The framework achieves comparable performance to existing paired-data methods with only 1% of the training data.
It avoids disentanglement issues present in latent space editing methods, preserving areas outside the editing target.
The framework is not limited by pre-trained priors, enabling the creation of new editing effects beyond existing datasets. |
The paper acknowledges limitations in handling high-resolution images due to computational constraints.
Future work includes exploring the application of the framework to other image editing tasks beyond those presented. |
image editing, few-shot learning, diffusion models, customization, paired data |
2405.12399
Report |
Diffusion for World Modeling: Visual Details Matter in Atari |
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, François Fleuret |
World models constitute a promising approach for training reinforcement
learning agents in a safe and sample-efficient manner. Recent world models
predominantly operate on sequences of discrete latent variables to model
environment dynamics. However, this compression into a compact discrete
representation may ignore visual details that are important for reinforcement
learning. Concurrently, diffusion models have become a dominant approach for
image generation, challenging well-established methods modeling discrete
latents. Motivated by this paradigm shift, we introduce DIAMOND (DIffusion As a
Model Of eNvironment Dreams), a reinforcement learning agent trained in a
diffusion world model. We analyze the key design choices that are required to
make diffusion suitable for world modeling, and demonstrate how improved visual
details can lead to improved agent performance. DIAMOND achieves a mean human
normalized score of 1.46 on the competitive Atari 100k benchmark; a new best
for agents trained entirely within a world model. To foster future research on
diffusion for world modeling, we release our code, agents and playable world
models at https://github.com/eloialonso/diamond. |
Introduces DIAMOND, a reinforcement learning agent trained within a diffusion world model for improved sample efficiency and visual fidelity. |
Addresses limitations of discrete latent-based world models, which can lose visual details crucial for complex tasks, by leveraging the strengths of diffusion models in high-fidelity image generation. |
Implements a diffusion model conditioned on past observations and actions to predict future observations, employing EDM over DDPM for stability with fewer denoising steps. Trains an actor-critic RL agent within this imagined environment. |
Achieves state-of-the-art mean human-normalized score (1.46) on the Atari 100k benchmark among world model agents.
Demonstrates greater stability over longer time horizons compared to DDPM-based world models.
Generates visually consistent and higher-quality imagined trajectories compared to discrete latent-based models like IRIS. |
Evaluation primarily focuses on discrete control environments (Atari), with limited exploration of continuous control tasks.
Relies on simple frame stacking for observation history, potentially limiting long-term memory and scalability compared to transformer-based architectures. |
world models, diffusion models, reinforcement learning, atari, generative vision models |
2405.12369
Report |
AtomGS: Atomizing Gaussian Splatting for High-Fidelity Radiance Field |
Rong Liu, Rui Xu, Yue Hu, Meida Chen, Andrew Feng |
3D Gaussian Splatting (3DGS) has recently advanced radiance field
reconstruction by offering superior capabilities for novel view synthesis and
real-time rendering speed. However, its strategy of blending optimization and
adaptive density control might lead to sub-optimal results; it can sometimes
yield noisy geometry and blurry artifacts due to prioritizing optimizing large
Gaussians at the cost of adequately densifying smaller ones. To address this,
we introduce AtomGS, consisting of Atomized Proliferation and Geometry-Guided
Optimization. The Atomized Proliferation constrains ellipsoid Gaussians of
various sizes into more uniform-sized Atom Gaussians. The strategy enhances the
representation of areas with fine features by placing greater emphasis on
densification in accordance with scene details. In addition, we proposed a
Geometry-Guided Optimization approach that incorporates an Edge-Aware Normal
Loss. This optimization method effectively smooths flat surfaces while
preserving intricate details. Our evaluation shows that AtomGS outperforms
existing state-of-the-art methods in rendering quality. Additionally, it
achieves competitive accuracy in geometry reconstruction and offers a
significant improvement in training speed over other SDF-based methods. More
interactive demos can be found in our website
(https://rongliu-leo.github.io/AtomGS/). |
AtomGS, a novel approach for radiance field reconstruction, enhances 3D Gaussian Splatting by emphasizing uniform densification through Atomized Proliferation and refining surface details via Geometry-Guided Optimization. |
Existing 3DGS methods often prioritize optimizing large Gaussians over densifying smaller ones, leading to noisy geometry and blurry artifacts, especially in areas with fine details. This work addresses these limitations by improving the alignment of Gaussians with the underlying scene geometry. |
AtomGS introduces two key components: (1) Atomized Proliferation, which constrains smaller Gaussians into uniformly-sized Atom Gaussians to prioritize densification in detail-rich areas, and (2) Geometry-Guided Optimization, incorporating an Edge-Aware Normal Loss to smooth flat surfaces while preserving intricate details. |
AtomGS outperforms state-of-the-art methods in rendering quality on Mip-NeRF360 and Tanks & Temples datasets.
It achieves competitive accuracy in geometry reconstruction on the DTU dataset, surpassing other explicit methods and rivaling implicit SDF-based methods.
AtomGS demonstrates significant improvement in training speed compared to SDF-based methods. |
AtomGS might struggle with highly specular or semi-transparent materials.
The current pruning strategy could be further improved to achieve a more compact representation, especially in highly complex environments. |
radiance field reconstruction, 3d gaussian splatting, novel view synthesis, geometry-guided optimization, atomized proliferation |
2405.12218
Report |
Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo |
Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu |
We present MVSGaussian, a new generalizable 3D Gaussian representation
approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct
unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware
Gaussian representations and decode them into Gaussian parameters. 2) To
further enhance performance, we propose a hybrid Gaussian rendering that
integrates an efficient volume rendering design for novel view synthesis. 3) To
support fast fine-tuning for specific scenes, we introduce a multi-view
geometric consistent aggregation strategy to effectively aggregate the point
clouds generated by the generalizable model, serving as the initialization for
per-scene optimization. Compared with previous generalizable NeRF-based
methods, which typically require minutes of fine-tuning and seconds of
rendering per image, MVSGaussian achieves real-time rendering with better
synthesis quality for each scene. Compared with the vanilla 3D-GS, MVSGaussian
achieves better view synthesis with less training computational cost. Extensive
experiments on DTU, Real Forward-facing, NeRF Synthetic, and Tanks and Temples
datasets validate that MVSGaussian attains state-of-the-art performance with
convincing generalizability, real-time rendering speed, and fast per-scene
optimization. |
MVSGaussian, a novel generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS), enables efficient reconstruction of unseen scenes. |
Existing generalizable Gaussian Splatting methods are inefficient, limited to object-centric reconstruction, and restricted in input types. This work addresses these limitations by proposing an efficient framework for novel view synthesis in unseen general scenes. |
The method leverages MVS for geometry reasoning and feature encoding, establishing a pixel-aligned Gaussian representation. It then employs a hybrid Gaussian rendering approach, integrating depth-aware volume rendering for enhanced generalization. For per-scene optimization, a multi-view geometric consistent aggregation strategy provides high-quality initialization. |
MVSGaussian outperforms other generalizable methods in terms of rendering quality and speed.
It achieves comparable or even superior performance to state-of-the-art methods after a short per-scene optimization.
The method enables real-time rendering with faster optimization compared to existing generalizable NeRFs and vanilla 3D-GS. |
The reliance on MVS for depth estimation can lead to decreased accuracy in areas with weak textures or specular reflections.
Future work may explore improving depth estimation accuracy in challenging regions. |
generalizable gaussian splatting, multi-view stereo, neural radiance field, novel view synthesis, real-time rendering |
2405.12200
Report |
Multi-View Attentive Contextualization for Multi-View 3D Object Detection |
Xianpeng Liu, Ce Zheng, Ming Qian, Nan Xue, Chen Chen, Zhebin Zhang, Chen Li, Tianfu Wu |
We present Multi-View Attentive Contextualization (MvACon), a simple yet
effective method for improving 2D-to-3D feature lifting in query-based
multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in
the field of query-based MV3D object detection, prior art often suffers from
either the lack of exploiting high-resolution 2D features in dense
attention-based lifting, due to high computational costs, or from
insufficiently dense grounding of 3D queries to multi-scale 2D features in
sparse attention-based lifting. Our proposed MvACon hits the two birds with one
stone using a representationally dense yet computationally sparse attentive
feature contextualization scheme that is agnostic to specific 2D-to-3D feature
lifting approaches. In experiments, the proposed MvACon is thoroughly tested on
the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable
attention (DFA3D) variant, as well as the PETR, showing consistent detection
performance improvement, especially in enhancing performance in location,
orientation, and velocity prediction. It is also tested on the Waymo-mini
benchmark using BEVFormer with similar improvement. We qualitatively and
quantitatively show that global cluster-based contexts effectively encode dense
scene-level contexts for MV3D object detection. The promising results of our
proposed MvACon reinforces the adage in computer vision -- ``(contextualized)
feature matters". |
This paper introduces Multi-View Attentive Contextualization (MvACon), a plug-and-play module designed to enhance 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. |
Existing MV3D detectors suffer from limitations in effectively capturing 3D information during feature lifting. MvACon addresses these limitations by incorporating global and semantically meaningful 3D awareness. |
MvACon utilizes a cluster-attention mechanism, adapted from PaCa (Patch-to-Cluster attention), to contextualize 2D features. It expands the traditional three-component MV3D detection pipeline to a four-component setup by adding an attentive contextualization stage. |
MvACon consistently improves the performance of various query-based MV3D detectors, including PETR and BEVFormer, on both NuScenes and Waymo datasets.
It significantly enhances localization, orientation, and velocity prediction in these detectors.
Qualitative analysis shows that MvACon learns stable and semantically meaningful representations of the scene, contributing to its improved performance. |
The computational cost of MvACon in the full model might be high.
Future work includes exploring alternative clustering techniques for improved efficiency. |
3d object detection, multi-view vision, attentive contextualization, feature lifting, autonomous driving |
2405.12155
Report |
Embracing Radiance Field Rendering in 6G: Over-the-Air Training and Inference with 3D Contents |
Guanlin Wu, Zhonghao Lyu, Juyong Zhang, Jie Xu |
The efficient representation, transmission, and reconstruction of
three-dimensional (3D) contents are becoming increasingly important for
sixth-generation (6G) networks that aim to merge virtual and physical worlds
for offering immersive communication experiences. Neural radiance field (NeRF)
and 3D Gaussian splatting (3D-GS) have recently emerged as two promising 3D
representation techniques based on radiance field rendering, which are able to
provide photorealistic rendering results for complex scenes. Therefore,
embracing NeRF and 3D-GS in 6G networks is envisioned to be a prominent
solution to support emerging 3D applications with enhanced quality of
experience. This paper provides a comprehensive overview on the integration of
NeRF and 3D-GS in 6G. First, we review the basics of the radiance field
rendering techniques, and highlight their applications and implementation
challenges over wireless networks. Next, we consider the over-the-air training
of NeRF and 3D-GS models over wireless networks by presenting various learning
techniques. We particularly focus on the federated learning design over a
hierarchical device-edge-cloud architecture. Then, we discuss three practical
rendering architectures of NeRF and 3D-GS models at wireless network edge. We
provide model compression approaches to facilitate the transmission of radiance
field models, and present rendering acceleration approaches and joint
computation and communication designs to enhance the rendering efficiency. In
particular, we propose a new semantic communication enabled 3D content
transmission design, in which the radiance field models are exploited as the
semantic knowledge base to reduce the communication overhead for distributed
inference. Furthermore, we present the utilization of radiance field rendering
in wireless applications like radio mapping and radio imaging. |
This paper provides a comprehensive overview of integrating Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3D-GS) rendering techniques into 6G networks for immersive communication experiences. |
NeRF and 3D-GS are revolutionary for representing and transmitting 3D content, crucial for immersive 6G applications like XR and telepresence. |
The paper explores various aspects, including centralized/distributed learning for NeRF/3D-GS, a hierarchical device-edge-cloud architecture for federated learning, model compression/acceleration, joint computation/communication design, and semantic communication for efficient rendering. |
Federated learning over a hierarchical architecture enables efficient training of large-scale scene radiance fields.
Model compression and algorithmic acceleration techniques enhance the transmission and rendering efficiency.
The proposed semantic communication framework for 3D content transmission, using NeRF as a semantic knowledge base, significantly reduces communication overhead. |
The paper mainly focuses on the technical feasibility of integrating NeRF/3D-GS in 6G, without delving into specific protocol design or standardization aspects.
Future work could investigate asynchronous federated learning, generalizable models, and the use of over-the-air computation for efficient model aggregation. |
6g, immersive communications, neural radiance field (nerf), 3d gaussian splatting (3d-gs), federated learning |
2405.12110
Report |
CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization |
Jiawei Zhang, Jiahe Li, Xiaohan Yu, Lei Huang, Lin Gu, Jin Zheng, Xiao Bai |
3D Gaussian Splatting (3DGS) creates a radiance field consisting of 3D
Gaussians to represent a scene. With sparse training views, 3DGS easily suffers
from overfitting, negatively impacting the reconstruction quality. This paper
introduces a new co-regularization perspective for improving sparse-view 3DGS.
When training two 3D Gaussian radiance fields with the same sparse views of a
scene, we observe that the two radiance fields exhibit \textit{point
disagreement} and \textit{rendering disagreement} that can unsupervisedly
predict reconstruction quality, stemming from the sampling implementation in
densification. We further quantify the point disagreement and rendering
disagreement by evaluating the registration between Gaussians' point
representations and calculating differences in their rendered pixels. The
empirical study demonstrates the negative correlation between the two
disagreements and accurate reconstruction, which allows us to identify
inaccurate reconstruction without accessing ground-truth information. Based on
the study, we propose CoR-GS, which identifies and suppresses inaccurate
reconstruction based on the two disagreements: (\romannumeral1) Co-pruning
considers Gaussians that exhibit high point disagreement in inaccurate
positions and prunes them. (\romannumeral2) Pseudo-view co-regularization
considers pixels that exhibit high rendering disagreement are inaccurately
rendered and suppress the disagreement. Results on LLFF, Mip-NeRF360, DTU, and
Blender demonstrate that CoR-GS effectively regularizes the scene geometry,
reconstructs the compact representations, and achieves state-of-the-art novel
view synthesis quality under sparse training views. |
This paper investigates the behavior disagreement between two 3D Gaussian Radiance Fields (3DGRFs) trained on the same scene with sparse views, and proposes a novel co-regularization method, CoR-GS, to improve sparse-view 3D Gaussian Splatting. |
3D Gaussian Splatting (3DGS) suffers from overfitting with sparse training views, leading to degraded novel view synthesis quality. This work provides a new perspective on regularizing sparse-view 3DGS by leveraging the disagreement between different 3DGRFs. |
The authors simultaneously train two 3DGRFs with the same sparse views. They introduce "point disagreement" and "rendering disagreement" to quantify the differences between Gaussian positions and rendered results of the two fields. They then propose co-pruning to suppress point disagreement and pseudo-view co-regularization to suppress rendering disagreement. |
Two 3DGRFs trained with the same sparse views exhibit significant point and rendering disagreements, particularly during densification.
The disagreements are negatively correlated with accurate scene reconstruction, providing an unsupervised way to identify inaccurate reconstruction.
CoR-GS effectively suppresses the disagreements, reconstructing more compact geometry representations and achieving state-of-the-art novel view synthesis quality on multiple benchmarks. |
Color co-regularization implicitly handles depth information, making explicit depth co-regularization less effective.
More advanced co-regularization strategies could further improve the performance, particularly in handling complex scenes. |
3d gaussian splatting, radiance fields, novel view synthesis, sparse view reconstruction, co-regularization |
2405.12107
Report |
Imp: Highly Capable Large Multimodal Models for Mobile Devices |
Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang, Lihao Zheng, Zhenbiao Gai, Mingyang Wang, Jiajun Ding |
By harnessing the capabilities of large language models (LLMs), recent large
multimodal models (LMMs) have shown remarkable versatility in open-world
multimodal understanding. Nevertheless, they are usually parameter-heavy and
computation-intensive, thus hindering their applicability in
resource-constrained scenarios. To this end, several lightweight LMMs have been
proposed successively to maximize the capabilities under constrained scale
(e.g., 3B). Despite the encouraging results achieved by these methods, most of
them only focus on one or two aspects of the design space, and the key design
choices that influence model capability have not yet been thoroughly
investigated. In this paper, we conduct a systematic study for lightweight LMMs
from the aspects of model architecture, training strategy, and training data.
Based on our findings, we obtain Imp -- a family of highly capable LMMs at the
2B-4B scales. Notably, our Imp-3B model steadily outperforms all the existing
lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs
at the 13B scale. With low-bit quantization and resolution reduction
techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile
chip with a high inference speed of about 13 tokens/s. |
This paper introduces Imp, a family of lightweight Large Multimodal Models (LMMs) at the 2B/3B/4B parameter scales, demonstrating that carefully designed lightweight LMMs can achieve competitive performance compared to larger counterparts. |
Building lightweight LMMs is crucial for enabling wider access to this technology for researchers with limited resources and for deployment on resource-constrained devices like PCs and mobile phones. |
The authors systematically explore the design space of lightweight LMMs, investigating the impact of model architecture (LLM and visual encoder choices), training strategy (fine-tuning mechanism and the number of training epochs), and augmented training data (OCR, chart-oriented, and GPT4V-annotated) on model performance. |
Imp-3B significantly outperforms existing open-source lightweight LMMs of similar size and achieves comparable performance to state-of-the-art 13B LMMs on various benchmarks.
The study highlights the importance of high-quality training data for lightweight LMMs, showing that quality often outweighs quantity in this context.
Imp models can be effectively deployed on mobile devices, particularly Imp-3B@196 with 4-bit quantization, which balances a small model size with low latency and strong capabilities. |
The model currently only supports English inputs and requires further development for multilingual capabilities.
Future work will focus on improving performance in specific tasks like OCR and object counting, incorporating more efficient training and compression techniques, and expanding to other modalities such as audio and 3D. |
large multimodal models, lightweight models, vision-language models, model efficiency, mobile deployment |
2405.12069
Report |
Gaussian Head & Shoulders: High Fidelity Neural Upper Body Avatars with Anchor Gaussian Guided Texture Warping |
Tianhao Wu, Jing Yang, Zhilin Guo, Jingyi Wan, Fangcheng Zhong, Cengiz Oztireli |
By equipping the most recent 3D Gaussian Splatting representation with head
3D morphable models (3DMM), existing methods manage to create head avatars with
high fidelity. However, most existing methods only reconstruct a head without
the body, substantially limiting their application scenarios. We found that
naively applying Gaussians to model the clothed chest and shoulders tends to
result in blurry reconstruction and noisy floaters under novel poses. This is
because of the fundamental limitation of Gaussians and point clouds -- each
Gaussian or point can only have a single directional radiance without spatial
variance, therefore an unnecessarily large number of them is required to
represent complicated spatially varying texture, even for simple geometry. In
contrast, we propose to model the body part with a neural texture that consists
of coarse and pose-dependent fine colors. To properly render the body texture
for each view and pose without accurate geometry nor UV mapping, we optimize
another sparse set of Gaussians as anchors that constrain the neural warping
field that maps image plane coordinates to the texture space. We demonstrate
that Gaussian Head & Shoulders can fit the high-frequency details on the
clothed upper body with high fidelity and potentially improve the accuracy and
fidelity of the head region. We evaluate our method with casual phone-captured
and internet videos and show our method archives superior reconstruction
quality and robustness in both self and cross reenactment tasks. To fully
utilize the efficient rendering speed of Gaussian splatting, we additionally
propose an accelerated inference method of our trained model without
Multi-Layer Perceptron (MLP) queries and reach a stable rendering speed of
around 130 FPS for any subjects. |
This paper introduces "Gaussian Head & Shoulders", a method for reconstructing high-fidelity, animatable upper body avatars from monocular videos using Gaussian Splatting for the head and a learned texture map guided by anchor Gaussians for the body. |
Existing methods struggle to realistically capture the complex textures and deformations of clothed upper bodies, limiting their use in immersive applications. |
The method combines 3D Gaussian Splatting with a neural texture map. Sparse anchor Gaussians, driven by a head 3DMM, constrain a neural warping field that maps image pixels to the texture space, enabling high-frequency detail rendering. An accelerated inference method bypasses MLP queries for real-time performance. |
Outperforms baselines in self-reenactment tasks, achieving higher fidelity and robustness, especially for subjects with intricate clothing.
Demonstrates improved expression control compared to pure Gaussian Splatting methods due to the focused modeling of the head region.
Achieves a rendering speed of around 130 FPS with the accelerated inference method, surpassing pure Gaussian Splatting for subjects with complex clothing. |
The method cannot model avatars with extreme body rotations that lead to self-occlusion.
The accelerated inference relies on rigid transformations and may not capture non-rigid body deformations accurately. |
neural avatars, gaussian splatting, texture mapping, 3d reconstruction, monocular video |
2405.11921
Report |
MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections |
Jiayue Liu, Xiao Tang, Freeman Cheng, Roy Yang, Zhihao Li, Jianzhuang Liu, Yi Huang, Jiaqi Lin, Shiyong Liu, Xiaofei Wu, Songcen Xu, Chun Yuan |
3D Gaussian Splatting showcases notable advancements in photo-realistic and
real-time novel view synthesis. However, it faces challenges in modeling mirror
reflections, which exhibit substantial appearance variations from different
viewpoints. To tackle this problem, we present MirrorGaussian, the first method
for mirror scene reconstruction with real-time rendering based on 3D Gaussian
Splatting. The key insight is grounded on the mirror symmetry between the
real-world space and the virtual mirror space. We introduce an intuitive
dual-rendering strategy that enables differentiable rasterization of both the
real-world 3D Gaussians and the mirrored counterpart obtained by reflecting the
former about the mirror plane. All 3D Gaussians are jointly optimized with the
mirror plane in an end-to-end framework. MirrorGaussian achieves high-quality
and real-time rendering in scenes with mirrors, empowering scene editing like
adding new mirrors and objects. Comprehensive experiments on multiple datasets
demonstrate that our approach significantly outperforms existing methods,
achieving state-of-the-art results. Project page:
https://mirror-gaussian.github.io/. |
MirrorGaussian is the first method to achieve high-fidelity reconstruction and real-time rendering of scenes containing mirrors using 3D Gaussian Splatting. |
Existing NVS methods struggle with reconstructing mirror reflections due to their high specularity and viewpoint variation, which are difficult to model with MLPs or SH functions. NeRF-based solutions are computationally expensive, hindering interactive applications. |
MirrorGaussian leverages the mirror symmetry between the real world and virtual mirror space. It uses a dual-rendering strategy: 1) rendering the real-world scene from 3D Gaussians, 2) rendering the mirror image by reflecting the 3D Gaussians across an estimated and optimized mirror plane. A mirror label is introduced to enable differentiable mirror mask generation from arbitrary viewpoints. |
MirrorGaussian significantly outperforms existing NeRF-based methods in terms of both rendering quality and speed, achieving state-of-the-art results.
It enables real-time novel view synthesis at high resolution, thanks to efficient point-based rasterization.
The explicit point cloud representation allows for scene editing, such as adding new objects and mirrors. |
MirrorGaussian requires mirror segmentation on input images for mirror plane and mask estimation.
The current dual-rendering strategy slightly decreases rendering speed, which can be further optimized. |
novel view synthesis, mirror reflections, 3d gaussian splatting, real-time rendering, scene editing |
2405.11914
Report |
PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images |
Yiheng Xiong, Angela Dai |
Generating 3D shapes from single RGB images is essential in various
applications such as robotics. Current approaches typically target images
containing clear and complete visual descriptions of the object, without
considering common realistic cases where observations of objects that are
largely occluded or truncated. We thus propose a transformer-based
autoregressive model to generate the probabilistic distribution of 3D shapes
conditioned on an RGB image containing potentially highly ambiguous
observations of the object. To handle realistic scenarios such as occlusion or
field-of-view truncation, we create simulated image-to-shape training pairs
that enable improved fine-tuning for real-world scenarios. We then adopt
cross-attention to effectively identify the most relevant region of interest
from the input image for shape generation. This enables inference of sampled
shapes with reasonable diversity and strong alignment with the input image. We
train and test our model on our synthetic data then fine-tune and test it on
real-world data. Experiments demonstrate that our model outperforms state of
the art in both scenarios |
This paper proposes a transformer-based autoregressive model for generating a probabilistic distribution of 3D shapes from a single RGB image, especially those with occlusion or truncation. |
Generating 3D shapes from single RGB images is crucial for robotics and computer vision, but existing methods struggle with images containing ambiguous observations like occlusion or truncation. This work addresses this challenge by generating multiple plausible 3D shapes. |
The approach compresses 3D shapes into a low-dimensional latent representation using P-VQ-VAE. Then, a transformer model with cross-attention learns the distribution of these representations conditioned on an input image. The model is trained on a synthetic dataset with multiple ground-truth shapes per image to handle ambiguity and then fine-tuned on real-world data. |
The proposed method outperforms state-of-the-art methods in terms of shape generation quality on both synthetic and real-world datasets.
The model generates multiple plausible 3D shape hypotheses that align well with the input image, demonstrating its ability to handle ambiguity.
Pretraining on the synthetic dataset with multiple ground-truth shapes per image is shown to be effective, significantly improving performance on real-world data. |
The generation scale is currently limited to the object level, and expanding it to the scene level is left for future work.
The diversity of generated shapes, while reasonable, is not as high as some existing methods, indicating a potential trade-off between diversity and alignment with the input image. |
3d shape generation, single-view reconstruction, probabilistic modeling, transformers, cross-attention |
2405.11852
Report |
Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models |
Xiyu Wang, Yufei Wang, Satoshi Tsutsui, Weisi Lin, Bihan Wen, Alex C. Kot |
Diffusion-based models for story visualization have shown promise in
generating content-coherent images for storytelling tasks. However, how to
effectively integrate new characters into existing narratives while maintaining
character consistency remains an open problem, particularly with limited data.
Two major limitations hinder the progress: (1) the absence of a suitable
benchmark due to potential character leakage and inconsistent text labeling,
and (2) the challenge of distinguishing between new and old characters, leading
to ambiguous results. To address these challenges, we introduce the NewEpisode
benchmark, comprising refined datasets designed to evaluate generative models'
adaptability in generating new stories with fresh characters using just a
single example story. The refined dataset involves refined text prompts and
eliminates character leakage. Additionally, to mitigate the character confusion
of generated results, we propose EpicEvo, a method that customizes a
diffusion-based visual story generation model with a single story featuring the
new characters seamlessly integrating them into established character dynamics.
EpicEvo introduces a novel adversarial character alignment module to align the
generated images progressively in the diffusive process, with exemplar images
of new characters, while applying knowledge distillation to prevent forgetting
of characters and background details. Our evaluation quantitatively
demonstrates that EpicEvo outperforms existing baselines on the NewEpisode
benchmark, and qualitative studies confirm its superior customization of visual
story generation in diffusion models. In summary, EpicEvo provides an effective
way to incorporate new characters using only one example story, unlocking new
possibilities for applications such as serialized cartoons. |
This paper introduces the NewEpisode benchmark for evaluating the ability of generative models to incorporate new characters into existing narratives, and proposes EpicEvo, a method for customizing diffusion-based visual story generation models to include new characters using just a single example story. |
The ability to seamlessly integrate new characters into established stories is crucial for applications like creating new episodes of comic books or cartoons, but existing models struggle with this due to limited data and the risk of disrupting established character dynamics. |
The NewEpisode benchmark is created by refining existing datasets to include unseen characters in the test set. EpicEvo uses adversarial character alignment to encourage distinct generation of new characters and knowledge distillation to preserve the model's priors and prevent overfitting. |
EpicEvo outperforms existing baselines on the NewEpisode benchmark in terms of FID score, indicating better new character consistency.
Qualitative analysis confirms EpicEvo's superior ability to generate stories featuring new characters, both alone and interacting with existing characters.
Ablation studies demonstrate the effectiveness of both the adversarial character alignment and knowledge distillation components of EpicEvo. |
The paper primarily focuses on visual similarity metrics like FID, CLIP-I, and CLIP-T, acknowledging the need for further investigation into human perception of story coherence and character integration.
Future work could explore expanding the NewEpisode benchmark with more diverse datasets and evaluating the generalization ability of EpicEvo to characters with even fewer example images. |
generative diffusion model, story visualization, generative model customization, character consistency, few-shot learning |
2405.11794
Report |
ViViD: Video Virtual Try-on using Diffusion Models |
Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha |
Video virtual try-on aims to transfer a clothing item onto the video of a
target person. Directly applying the technique of image-based try-on to the
video domain in a frame-wise manner will cause temporal-inconsistent outcomes
while previous video-based try-on solutions can only generate low visual
quality and blurring results. In this work, we present ViViD, a novel framework
employing powerful diffusion models to tackle the task of video virtual try-on.
Specifically, we design the Garment Encoder to extract fine-grained clothing
semantic features, guiding the model to capture garment details and inject them
into the target video through the proposed attention feature fusion mechanism.
To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder
to encode pose signals, enabling the model to learn the interactions between
clothing and human posture and insert hierarchical Temporal Modules into the
text-to-image stable diffusion model for more coherent and lifelike video
synthesis. Furthermore, we collect a new dataset, which is the largest, with
the most diverse types of garments and the highest resolution for the task of
video virtual try-on to date. Extensive experiments demonstrate that our
approach is able to yield satisfactory video try-on results. The dataset,
codes, and weights will be publicly available. Project page:
https://becauseimbatman0.github.io/ViViD. |
This paper presents ViViD, a novel framework leveraging diffusion models for video virtual try-on, and introduces a new large-scale, diverse dataset for this task. |
Current video virtual try-on methods suffer from limitations such as temporal inconsistency, low visual quality, and lack of diverse training data, hindering their real-world application. |
ViViD utilizes a Garment Encoder with attention feature fusion to capture fine-grained clothing details, a Pose Encoder for spatial-temporal consistency, and temporal modules for coherent video synthesis. It is trained with an image-video joint strategy on a newly collected dataset. |
ViViD outperforms existing methods in generating high-quality try-on videos with better temporal consistency and detail preservation.
The proposed Garment Encoder and attention feature fusion mechanism effectively capture and integrate fine-grained clothing details into the generated videos.
The image-video joint training strategy proves beneficial in learning both detailed clothing representation and temporal dynamics. |
The current model does not generalize well to videos with extreme poses or rapid movements.
Future work can explore incorporating user-specific features and preferences for personalized try-on experiences. |
video virtual try-on, diffusion models, temporal consistency, garment encoder, dataset |
2405.11685
Report |
ColorFoil: Investigating Color Blindness in Large Vision and Language Models |
Ahnaf Mozib Samin, M. Firoz Ahmed, Md. Mushtaq Shahriyar Rafee |
With the utilization of Transformer architecture, large Vision and Language
(V&L) models have shown promising performance in even zero-shot settings.
Several studies, however, indicate a lack of robustness of the models when
dealing with complex linguistics and visual attributes. In this work, we
introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to
assess the models' perception ability to detect colors like red, white, green,
etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT,
GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing
findings from the V&L models. The experimental evaluation indicates that ViLT
and BridgeTower demonstrate much better color perception capabilities compared
to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT
struggle to distinguish colors that are visually distinct to humans with normal
color perception ability. |
This paper introduces ColorFoil, a novel Vision and Language (V&L) benchmark, to assess the ability of V&L models to perceive and identify color attributes. |
This work investigates the robustness and generalizability of V&L models in perceiving colors, a crucial aspect of human-like visual understanding, essential for real-world applications. |
ColorFoil is constructed by creating color-related foils from MS COCO and Flickr30k datasets. The model's ability to distinguish between original captions and color-foiled versions is evaluated using accuracy and F1-score. |
BridgeTower and ViLT models demonstrate superior color perception compared to CLIP and its variants, as well as GroupViT.
CLIP-based models and GroupViT struggle to differentiate colors easily distinguishable by humans.
Model performance degrades with an increase in the number of foils, highlighting a challenge in handling complex scenarios. |
The selection of 10 common colors for foils is subjective and might not represent the full spectrum of frequently used colors.
Future work includes expanding the benchmark to assess robustness in other areas like gender, size, emotions, and negation. |
vision and language, v&l models, color perception, benchmarking, robustness |
2405.11616
Report |
Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention |
Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan, Wenping Wang, Qifeng Liu, Yike Guo |
In this paper, we introduce Era3D, a novel multiview diffusion method that
generates high-resolution multiview images from a single-view image. Despite
significant advancements in multiview generation, existing methods still suffer
from camera prior mismatch, inefficacy, and low resolution, resulting in
poor-quality multiview images. Specifically, these methods assume that the
input images should comply with a predefined camera type, e.g. a perspective
camera with a fixed focal length, leading to distorted shapes when the
assumption fails. Moreover, the full-image or dense multiview attention they
employ leads to an exponential explosion of computational complexity as image
resolution increases, resulting in prohibitively expensive training costs. To
bridge the gap between assumption and reality, Era3D first proposes a
diffusion-based camera prediction module to estimate the focal length and
elevation of the input image, which allows our method to generate images
without shape distortions. Furthermore, a simple but efficient attention layer,
named row-wise attention, is used to enforce epipolar priors in the multiview
diffusion, facilitating efficient cross-view information fusion. Consequently,
compared with state-of-the-art methods, Era3D generates high-quality multiview
images with up to a 512*512 resolution while reducing computation complexity by
12x times. Comprehensive experiments demonstrate that Era3D can reconstruct
high-quality and detailed 3D meshes from diverse single-view input images,
significantly outperforming baseline multiview diffusion methods. |
Era3D, a novel multiview diffusion method that generates high-resolution multiview images from single-view images by addressing camera prior mismatch, inefficacy, and low resolution in existing methods. |
Existing multiview generation methods suffer from limitations like camera prior mismatch, inefficacy, and low resolution, leading to poor-quality multiview images and hindering high-quality 3D reconstruction. |
Era3D uses different camera models for input (arbitrary) and generated images (orthogonal with fixed viewpoints) and employs a camera prediction module to estimate focal length and elevation. It introduces row-wise attention for efficient cross-view information fusion. |
Generates high-quality, consistent multiview images and normal maps at resolutions up to 512x512.
Successfully mitigates distortion artifacts caused by inconsistent camera intrinsics.
Achieves state-of-the-art performance for single-view 3D generation. |
Struggles to generate intricate geometries and open meshes due to sparse multiview generation.
Reliance on Neural SDF limits reconstruction of meshes with open surfaces. |
multiview diffusion, 3d reconstruction, row-wise attention, camera canonicalization, single-view 3d generation |
2405.11523
Report |
Diffusion-Based Hierarchical Image Steganography |
Youmin Xu, Xuanyu Zhang, Jiwen Yu, Chong Mou, Xiandong Meng, Jian Zhang |
This paper introduces Hierarchical Image Steganography, a novel method that
enhances the security and capacity of embedding multiple images into a single
container using diffusion models. HIS assigns varying levels of robustness to
images based on their importance, ensuring enhanced protection against
manipulation. It adaptively exploits the robustness of the Diffusion Model
alongside the reversibility of the Flow Model. The integration of Embed-Flow
and Enhance-Flow improves embedding efficiency and image recovery quality,
respectively, setting HIS apart from conventional multi-image steganography
techniques. This innovative structure can autonomously generate a container
image, thereby securely and efficiently concealing multiple images and text.
Rigorous subjective and objective evaluations underscore our advantage in
analytical resistance, robustness, and capacity, illustrating its expansive
applicability in content safeguarding and privacy fortification. |
This paper proposes Hierarchical Image Steganography (HIS), a novel method using diffusion models to embed multiple images into a single container image with varying levels of robustness based on image importance. |
Existing multi-image steganography methods lack robustness and don't differentiate between the importance of embedded images, making them vulnerable to degradation. |
HIS employs a tiered embedding strategy using diffusion models for robust embedding of important images (Tier-1) and flow models for high-capacity embedding of less important images (Tier-2). It further integrates Embed-Flow and Enhance-Flow to improve embedding efficiency and image recovery quality. |
HIS demonstrates superior robustness against various distortions, ensuring integrity of important images.
The tiered embedding strategy allows for high-capacity embedding while maintaining significant robustness.
HIS exhibits outstanding statistical security, effectively confusing steganalysis tools. |
The recovery quality of Tier-2 images degrades with an increasing number of embedded images.
Local tampering on the container image can lead to information loss in Tier-2 images. |
steganography, diffusion models, image hiding, robustness, security |
2405.11473
Report |
FIFO-Diffusion: Generating Infinite Videos from Text without Training |
Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han |
We propose a novel inference technique based on a pretrained diffusion model
for text-conditional video generation. Our approach, called FIFO-Diffusion, is
conceptually capable of generating infinitely long videos without training.
This is achieved by iteratively performing diagonal denoising, which
concurrently processes a series of consecutive frames with increasing noise
levels in a queue; our method dequeues a fully denoised frame at the head while
enqueuing a new random noise frame at the tail. However, diagonal denoising is
a double-edged sword as the frames near the tail can take advantage of cleaner
ones by forward reference but such a strategy induces the discrepancy between
training and inference. Hence, we introduce latent partitioning to reduce the
training-inference gap and lookahead denoising to leverage the benefit of
forward referencing. We have demonstrated the promising results and
effectiveness of the proposed methods on existing text-to-video generation
baselines. |
FIFO-Diffusion, a novel inference technique based on pretrained diffusion models for generating infinitely long videos without additional training. |
Long video generation remains challenging for diffusion-based models due to computational costs and limitations in capturing long-term temporal context. |
FIFO-Diffusion utilizes diagonal denoising, processing consecutive frames with increasing noise levels in a queue. It incorporates latent partitioning to reduce training-inference gap and lookahead denoising to enhance noise prediction accuracy. |
FIFO-Diffusion can generate extremely long videos (over 10,000 frames) without quality degradation, relying solely on models trained with short clips.
It produces videos with natural and consistent motion by propagating temporal context throughout the generation process.
Qualitative comparisons and user study show that FIFO-Diffusion significantly outperforms other training-free long video generation methods. |
Training-inference gap remains due to the change in input distribution induced by diagonal denoising.
Future work includes integrating diagonal denoising into the training process to further improve the performance. |
text-to-video generation, diffusion models, long video generation, diagonal denoising, latent partitioning |
2405.11467
Report |
AdaAugment: A Tuning-Free and Adaptive Approach to Enhance Data Augmentation |
Suorong Yang, Peijia Li, Xin Xiong, Furao Shen, Jian Zhao |
Data augmentation (DA) is widely employed to improve the generalization
performance of deep models. However, most existing DA methods use augmentation
operations with random magnitudes throughout training. While this fosters
diversity, it can also inevitably introduce uncontrolled variability in
augmented data, which may cause misalignment with the evolving training status
of the target models. Both theoretical and empirical findings suggest that this
misalignment increases the risks of underfitting and overfitting. To address
these limitations, we propose AdaAugment, an innovative and tuning-free
Adaptive Augmentation method that utilizes reinforcement learning to
dynamically adjust augmentation magnitudes for individual training samples
based on real-time feedback from the target network. Specifically, AdaAugment
features a dual-model architecture consisting of a policy network and a target
network, which are jointly optimized to effectively adapt augmentation
magnitudes. The policy network optimizes the variability within the augmented
data, while the target network utilizes the adaptively augmented samples for
training. Extensive experiments across benchmark datasets and deep
architectures demonstrate that AdaAugment consistently outperforms other
state-of-the-art DA methods in effectiveness while maintaining remarkable
efficiency. |
This paper proposes AdaAugment, a novel adaptive data augmentation method that uses reinforcement learning to dynamically adjust augmentation magnitudes for individual training samples based on real-time feedback from the target network. |
Existing data augmentation methods often employ random or predefined augmentation magnitudes, leading to potential misalignment with the evolving training status of deep models and increasing the risks of underfitting and overfitting. |
AdaAugment utilizes a dual-model architecture with a policy network and a target network. The policy network learns to determine optimal augmentation magnitudes based on real-time feedback from the target network, which is simultaneously trained using the adaptively augmented data. |
AdaAugment consistently outperforms state-of-the-art data augmentation methods across benchmark datasets (CIFAR-10/100, Tiny-ImageNet) and deep architectures.
AdaAugment demonstrates improved model transferability in transfer learning settings.
Complexity analysis reveals that AdaAugment incurs minimal parameter and computational overhead, highlighting its efficiency. |
The current study focuses on image classification tasks, future work can explore AdaAugment's applicability to other domains.
Future research can investigate the generalization of AdaAugment to a broader range of tasks beyond image classification. |
data augmentation, reinforcement learning, deep learning, image classification, adaptive methods |
2405.11442
Report |
Unifying 3D Vision-Language Understanding via Promptable Queries |
Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li |
A unified model for 3D vision-language (3D-VL) understanding is expected to
take various scene representations and perform a wide range of tasks in a 3D
scene. However, a considerable gap exists between existing methods and such a
unified model, due to the independent application of representation and
insufficient exploration of 3D multi-task training. In this paper, we introduce
PQ3D, a unified model capable of using Promptable Queries to tackle a wide
range of 3D-VL tasks, from low-level instance segmentation to high-level
reasoning and planning. This is achieved through three key innovations: (1)
unifying various 3D scene representations (i.e., voxels, point clouds,
multi-view images) into a shared 3D coordinate space by segment-level grouping,
(2) an attention-based query decoder for task-specific information retrieval
guided by prompts, and (3) universal output heads for different tasks to
support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D
demonstrates impressive performance on these tasks, setting new records on most
benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by
1.8% (AP), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and
Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with
individual or combined forms of available 3D representations, e.g., solely
voxel input. |
This paper introduces PQ3D, a unified model using Promptable Queries to manage various 3D scene representations, prompts, and outputs for numerous 3D vision-language (3D-VL) tasks. |
A unified model for 3D scene understanding is crucial for embodied agents to understand and execute human instructions in real-world scenarios, bridging the gap between low-level instance segmentation and high-level reasoning. |
PQ3D unifies point cloud, voxel, and multi-view image features into a shared 3D space, employs an attention-based query decoder for task-specific information retrieval guided by prompts, and utilizes universal output heads for predicting instance masks, task-relevance scores, and textual responses. |
PQ3D achieves state-of-the-art results on ten diverse 3D-VL datasets, setting new records on most benchmarks, including ScanNet200, ScanRefer, Multi3DRefer, and Scan2Cap.
The model demonstrates strong zero-shot capability with novel prompt types, such as using image sketches for object localization.
PQ3D shows promising results in embodied navigation and task planning, highlighting its potential as a fundamental 3D encoding module for embodied agents. |
The model's performance on tail classes in instance segmentation is less robust due to biases in the CLIP text encoder.
PQ3D's ability to handle complex spatial relations and long sentences in visual grounding and question answering can be further improved. |
3d vision-language understanding, promptable queries, unified model, embodied ai, multi-task learning |
2405.11286
Report |
Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion |
Zeyu Zhang, Yiran Wang, Biao Wu, Shuo Chen, Zhiyuan Zhang, Shiya Huang, Wenbo Zhang, Meng Fang, Ling Chen, Yang Zhao |
In recent years, there has been significant interest in creating 3D avatars
and motions, driven by their diverse applications in areas like film-making,
video games, AR/VR, and human-robot interaction. However, current efforts
primarily concentrate on either generating the 3D avatar mesh alone or
producing motion sequences, with integrating these two aspects proving to be a
persistent challenge. Additionally, while avatar and motion generation
predominantly target humans, extending these techniques to animals remains a
significant challenge due to inadequate training data and methods. To bridge
these gaps, our paper presents three key contributions. Firstly, we proposed a
novel agent-based approach named Motion Avatar, which allows for the automatic
generation of high-quality customizable human and animal avatars with motions
through text queries. The method significantly advanced the progress in dynamic
3D character generation. Secondly, we introduced a LLM planner that coordinates
both motion and avatar generation, which transforms a discriminative planning
into a customizable Q&A fashion. Lastly, we presented an animal motion dataset
named Zoo-300K, comprising approximately 300,000 text-motion pairs across 65
animal categories and its building pipeline ZooGen, which serves as a valuable
resource for the community. See project website
https://steve-zeyu-zhang.github.io/MotionAvatar/ |
This paper introduces Motion Avatar, an LLM agent-based method for generating customizable human and animal avatars with motions based on text input. |
Current methods struggle to integrate 3D avatar mesh generation and motion generation, especially for animals due to data scarcity. This work bridges this gap and enables customizable avatar creation with realistic motions. |
The approach leverages an LLM planner to process user queries and generate prompts for motion (using MoMask) and 3D mesh generation (using Stable Diffusion XL and TripoSR). It also introduces Zoo-300K, a new animal motion dataset with 300,000 text-motion pairs across 65 animal categories, created using the ZooGen pipeline. |
The LLM planner effectively extracts motion and avatar categories from user input and generates appropriate prompts for downstream generation.
Motion Avatar generates high-quality and customizable human and animal avatars with realistic motions from text descriptions.
The Zoo-300K dataset and ZooGen pipeline provide valuable resources for future research on animal motion generation. |
Quantitative evaluation of animal motion generation is still in progress and will be included in the next revision.
Future work will focus on enhancing the LLM planner's generalization ability to encompass broader dynamic avatar generation tasks. |
text-to-motion generation, 3d avatar generation, llm agent, animal motion dataset, customizable avatar |
2405.11273
Report |
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts |
Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang |
Recent advancements in Multimodal Large Language Models (MLLMs) underscore
the significance of scalable models and data to boost performance, yet this
often incurs substantial computational costs. Although the Mixture of Experts
(MoE) architecture has been employed to efficiently scale large language and
image-text models, these efforts typically involve fewer experts and limited
modalities. To address this, our work presents the pioneering attempt to
develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle
a wide array of modalities. Specifically, it features modality-specific
encoders with connectors for a unified multimodal representation. We also
implement a sparse MoE architecture within the LLMs to enable efficient
training and inference through modality-level data parallelism and expert-level
model parallelism. To enhance the multi-expert collaboration and
generalization, we present a progressive training strategy: 1) Cross-modality
alignment using various connectors with different cross-modality data, 2)
Training modality-specific experts with cross-modality instruction data to
activate experts' preferences, and 3) Tuning the Uni-MoE framework utilizing
Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate
the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
The extensive experimental results demonstrate Uni-MoE's principal advantage of
significantly reducing performance bias in handling mixed multimodal datasets,
alongside improved multi-expert collaboration and generalization. Our findings
highlight the substantial potential of MoE frameworks in advancing MLLMs and
the code is available at
https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs. |
This paper introduces Uni-MoE, a novel unified Multimodal Large Language Model (MLLM) that leverages the Mixture of Experts (MoE) architecture for efficient scaling and handling of various modalities such as video, image, text, audio, and speech. |
Scaling up MLLMs incurs high computational costs. Uni-MoE addresses this by activating only a subset of expert parameters per input, improving efficiency in training and inference. |
Uni-MoE uses modality-specific encoders and connectors to map inputs into a unified language representation. A sparse MoE layer within the LLM allows for selective expert activation. The model is trained in three stages: cross-modality alignment, modality-specific expert training, and unified MoE training with mixed multimodal data. |
Uni-MoE outperforms dense MLLMs on various benchmarks, demonstrating advantages in handling complex out-of-domain tasks, particularly long speech understanding and reasoning.
The model exhibits less performance bias across different modalities compared to dense models, even when trained on unbalanced mixed-modality data.
Pre-training experts on individual modalities enhances multi-expert collaboration and generalization compared to standard MoE tuning with identical initial expert parameters. |
Fully converting all layers to MoE does not necessarily yield the best performance and requires longer training.
Further exploration of more robust and efficient MoE architectures is needed for larger MLLMs. |
mixture of experts, multimodal large language model, unified framework, multimodal learning, cross-modal reasoning |
2405.11252
Report |
Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching |
Xingyu Miao, Haoran Duan, Varun Ojha, Jun Song, Tejal Shah, Yang Long, Rajiv Ranjan |
In this work, we propose a novel Trajectory Score Matching (TSM) method that
aims to solve the pseudo ground truth inconsistency problem caused by the
accumulated error in Interval Score Matching (ISM) when using the Denoising
Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the
inversion process of DDIM to calculate on a single path, our TSM method
leverages the inversion process of DDIM to generate two paths from the same
starting point for calculation. Since both paths start from the same starting
point, TSM can reduce the accumulated error compared to ISM, thus alleviating
the problem of pseudo ground truth inconsistency. TSM enhances the stability
and consistency of the model's generated paths during the distillation process.
We demonstrate this experimentally and further show that ISM is a special case
of TSM. Furthermore, to optimize the current multi-stage optimization process
from high-resolution text to 3D generation, we adopt Stable Diffusion XL for
guidance. In response to the issues of abnormal replication and splitting
caused by unstable gradients during the 3D Gaussian splatting process when
using Stable Diffusion XL, we propose a pixel-by-pixel gradient clipping
method. Extensive experiments show that our model significantly surpasses the
state-of-the-art models in terms of visual quality and performance. Code:
\url{https://github.com/xingy038/Dreamer-XL}. |
This paper introduces Dreamer XL, a novel text-to-3D generation method that leverages Trajectory Score Matching (TSM) and Stable Diffusion XL for high-quality and consistent 3D content creation. |
Existing text-to-3D methods suffer from limitations such as over-smoothing, low resolution, and inconsistencies in generated results. This work aims to address these issues and enhance the realism and detail of generated 3D content. |
The proposed TSM method utilizes dual paths during the DDIM inversion process to minimize accumulated errors and improve consistency. Additionally, the work incorporates Stable Diffusion XL for high-resolution guidance and introduces a pixel-by-pixel gradient clipping method to address gradient instability issues. |
Dreamer XL generates high-quality 3D content with realistic appearances and avoids over-smoothing and oversaturation.
Compared to state-of-the-art methods, Dreamer XL demonstrates superior visual quality and consistency, as evidenced by qualitative comparisons and quantitative metrics such as CLIP-Score and A-LPIPS.
Ablation studies confirm the effectiveness of the proposed TSM and gradient clipping techniques in enhancing the quality and consistency of the generated 3D models. |
The method exhibits limitations in handling light, particularly with anomalous blue reflections observed in generated scenes, potentially attributed to SDXL.
The advancements in 3D model generation might be misused for malicious purposes like deepfakes. |
text-to-3d generation, trajectory score matching, stable diffusion xl, 3d gaussian splatting, deep learning |
2405.11236
Report |
TriLoRA: Integrating SVD for Advanced Style Personalization in Text-to-Image Generation |
Chengcheng Feng, Mu He, Qiuyu Tian, Haojie Yin, Xiaofang Zhao, Hongwei Tang, Xingqiang Wei |
As deep learning technology continues to advance, image generation models,
especially models like Stable Diffusion, are finding increasingly widespread
application in visual arts creation. However, these models often face
challenges such as overfitting, lack of stability in generated results, and
difficulties in accurately capturing the features desired by creators during
the fine-tuning process. In response to these challenges, we propose an
innovative method that integrates Singular Value Decomposition (SVD) into the
Low-Rank Adaptation (LoRA) parameter update strategy, aimed at enhancing the
fine-tuning efficiency and output quality of image generation models. By
incorporating SVD within the LoRA framework, our method not only effectively
reduces the risk of overfitting but also enhances the stability of model
outputs, and captures subtle, creator-desired feature adjustments more
accurately. We evaluated our method on multiple datasets, and the results show
that, compared to traditional fine-tuning methods, our approach significantly
improves the model's generalization ability and creative flexibility while
maintaining the quality of generation. Moreover, this method maintains LoRA's
excellent performance under resource-constrained conditions, allowing for
significant improvements in image generation quality without sacrificing the
original efficiency and resource advantages. |
Introduces TriLoRA, an innovative method integrating Singular Value Decomposition (SVD) into the Low-Rank Adaptation (LoRA) framework for enhanced fine-tuning of text-to-image generation models. |
Addresses challenges in existing models like overfitting, output instability, and difficulty capturing nuanced style features during fine-tuning. |
Incorporates SVD within LoRA to create a triple low-rank matrix representation, enabling more precise control over feature integration during model training. |
Demonstrates superior visual quality and stability in generated images compared to traditional LoRA.
Shows greater resistance to overfitting, particularly during extended training periods.
Exhibits improved performance in user studies, achieving higher scores in textual-visual consistency and visual appeal. |
Increased model complexity leading to longer convergence times, requiring more training epochs.
Performance improvement is limited by the quality of the pre-trained model used as a foundation. |
text-to-image generation, stable diffusion, fine-tuning, low-rank adaptation (lora), singular value decomposition (svd) |
2405.11190
Report |
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing |
Ying Jin, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, Dahua Lin |
Instruction-based image editing focuses on equipping a generative model with
the capacity to adhere to human-written instructions for editing images.
Current approaches typically comprehend explicit and specific instructions.
However, they often exhibit a deficiency in executing active reasoning
capacities required to comprehend instructions that are implicit or
insufficiently defined. To enhance active reasoning capabilities and impart
intelligence to the editing model, we introduce ReasonPix2Pix, a comprehensive
reasoning-attentive instruction editing dataset. The dataset is characterized
by 1) reasoning instruction, 2) more realistic images from fine-grained
categories, and 3) increased variances between input and edited images. When
fine-tuned with our dataset under supervised conditions, the model demonstrates
superior performance in instructional editing tasks, independent of whether the
tasks require reasoning or not. The code, model, and dataset will be publicly
available. |
This paper introduces ReasonPix2Pix, a dataset for instruction-based image editing focusing on reasoning abilities, and proposes a simple framework incorporating a multi-modal large language model (MLLM) with a diffusion model to improve image editing with reasoning instructions. |
Existing instruction-based image editing models often lack active reasoning capabilities, failing to understand implicit or insufficiently defined instructions. This paper addresses this by enabling models to understand the intent behind instructions rather than just recognizing keywords. |
The authors create ReasonPix2Pix dataset by generating reasoning instructions for image pairs from existing datasets and generating new image pairs with reasoning instructions. They then fine-tune a framework with an MLLM and a diffusion model on this dataset. |
The proposed method demonstrates superior performance in instruction editing tasks, both with and without reasoning requirements.
The model successfully handles complex instructions and generates high-quality edited images.
Analysis confirms the importance of the proposed dataset and the effectiveness of integrating MLLM for improving image editing with reasoning. |
The dataset size is limited due to API costs, although researchers can expand it using the provided pipeline.
Future work could explore more complex reasoning scenarios and further enhance the model's ability to handle abstract instructions. |
image editing, instruction-based editing, reasoning, multi-modal large language model, diffusion model |
2405.11135
Report |
AquaLoRA: Toward White-box Protection for Customized Stable Diffusion Models via Watermark LoRA |
Weitao Feng, Wenbo Zhou, Jiyan He, Jie Zhang, Tianyi Wei, Guanlin Li, Tianwei Zhang, Weiming Zhang, Nenghai Yu |
Diffusion models have achieved remarkable success in generating high-quality
images. Recently, the open-source models represented by Stable Diffusion (SD)
are thriving and are accessible for customization, giving rise to a vibrant
community of creators and enthusiasts. However, the widespread availability of
customized SD models has led to copyright concerns, like unauthorized model
distribution and unconsented commercial use. To address it, recent works aim to
let SD models output watermarked content for post-hoc forensics. Unfortunately,
none of them can achieve the challenging white-box protection, wherein the
malicious user can easily remove or replace the watermarking module to fail the
subsequent verification. For this, we propose \texttt{\method} as the first
implementation under this scenario. Briefly, we merge watermark information
into the U-Net of Stable Diffusion Models via a watermark Low-Rank Adaptation
(LoRA) module in a two-stage manner. For watermark LoRA module, we devise a
scaling matrix to achieve flexible message updates without retraining. To
guarantee fidelity, we design Prior Preserving Fine-Tuning (PPFT) to ensure
watermark learning with minimal impacts on model distribution, validated by
proofs. Finally, we conduct extensive experiments and ablation studies to
verify our design. |
This paper introduces AquaLoRA, a novel technique to watermark customized Stable Diffusion models for white-box protection, ensuring copyright in open-source environments. |
The open-source nature of Stable Diffusion models raises copyright concerns as customized models are easily redistributed without consent, necessitating robust watermarking solutions. |
AquaLoRA operates in two stages: (1) It pre-trains a latent watermark, optimizing robustness and fidelity with a novel Peak Regional Variation Loss. (2) It uses a scaling matrix within a Low-Rank Adaptation (LoRA) module for flexible watermark embedding and a prior preserving fine-tuning method to minimize visual impact on generated images. |
AquaLoRA achieves high fidelity, with negligible impact on image quality compared to original models.
The method exhibits robustness against various image distortions, sampling configurations, and the use of add-ons like ControlNet and LoRA.
AquaLoRA provides flexibility, allowing for easy modification of the embedded watermark without retraining. |
The current method faces limitations in handling heavy cropping and rotation distortions.
Future work will focus on extending AquaLoRA's protection to editing, inpainting, and outpainting functionalities.
The performance degradation with larger output image sizes requires further investigation and improvement. |
stable diffusion, watermarking, copyright protection, white-box protection, generative ai |
2405.11129
Report |
MotionGS : Compact Gaussian Splatting SLAM by Motion Filter |
Xinli Guo, Peng Han, Weidong Zhang, Hongtian Chen |
With their high-fidelity scene representation capability, the attention of
SLAM field is deeply attracted by the Neural Radiation Field (NeRF) and 3D
Gaussian Splatting (3DGS). Recently, there has been a Surge in NeRF-based SLAM,
while 3DGS-based SLAM is sparse. A novel 3DGS-based SLAM approach with a fusion
of deep visual feature, dual keyframe selection and 3DGS is presented in this
paper. Compared with the existing methods, the proposed selectively tracking is
achieved by feature extraction and motion filter on each frame. The joint
optimization of pose and 3D Gaussian runs through the entire mapping process.
Additionally, the coarse-to-fine pose estimation and compact Gaussian scene
representation are implemented by dual keyfeature selection and novel loss
functions. Experimental results demonstrate that the proposed algorithm not
only outperforms the existing methods in tracking and mapping, but also has
less memory usage. |
MotionGS, a novel dense 3D Gaussian Splatting (3DGS)-based SLAM approach that combines deep visual features, a dual keyframe selection strategy, and 3DGS for accurate real-time tracking and high-fidelity scene reconstruction. |
Existing dense visual SLAM methods, including those based on NeRF, face limitations in achieving high-fidelity representation, real-time performance, and efficient memory usage. 3DGS offers a promising alternative with faster optimization and rendering compared to NeRF. |
The approach employs a dual keyframe strategy with motion and information filters to select keyframes for tracking and mapping. A novel loss function and direct pose optimization tailored for 3DGS are introduced to refine camera poses and compactly represent the scene. |
MotionGS achieves state-of-the-art tracking accuracy on both TUM RGB-D and Replica datasets, outperforming existing NeRF-based and 3DGS-based SLAM methods.
It demonstrates superior rendering quality compared to baselines, capturing finer details and textures.
The approach significantly reduces memory usage for map representation compared to previous 3DGS-based methods. |
The lack of loop closure detection and global bundle adjustment in the monocular setting limits tracking accuracy in challenging scenarios.
Future work will focus on extending the approach to multi-sensor fusion and large-scale outdoor environments. |
slam, 3d gaussian splatting, dense visual slam, keyframe selection, scene representation |
2405.10988
Report |
Flow Score Distillation for Diverse Text-to-3D Generation |
Runjie Yan, Kailu Wu, Kaisheng Ma |
Recent advancements in Text-to-3D generation have yielded remarkable
progress, particularly through methods that rely on Score Distillation Sampling
(SDS). While SDS exhibits the capability to create impressive 3D assets, it is
hindered by its inherent maximum-likelihood-seeking essence, resulting in
limited diversity in generation outcomes. In this paper, we discover that the
Denoise Diffusion Implicit Models (DDIM) generation process (\ie PF-ODE) can be
succinctly expressed using an analogue of SDS loss. One step further, one can
see SDS as a generalized DDIM generation process. Following this insight, we
show that the noise sampling strategy in the noise addition stage significantly
restricts the diversity of generation results. To address this limitation, we
present an innovative noise sampling approach and introduce a novel text-to-3D
method called Flow Score Distillation (FSD). Our validation experiments across
various text-to-image Diffusion Models demonstrate that FSD substantially
enhances generation diversity without compromising quality. |
This paper introduces Flow Score Distillation (FSD), a novel text-to-3D generation method that leverages pre-trained 2D text-to-image Diffusion Models. FSD enhances generation diversity by introducing a new noise sampling approach within the Score Distillation Sampling (SDS) framework. |
Existing SDS-based methods, while effective in generating high-quality 3D assets, suffer from limited diversity due to their inherent maximum-likelihood-seeking nature. This limitation restricts the range of generated outputs. |
The paper first establishes a theoretical connection between SDS and the DDIM generation process, revealing SDS as a generalized DDIM process for 3D representations. Building upon this insight, it identifies the noise sampling strategy in SDS as the primary factor limiting diversity. FSD addresses this by employing a deterministic world-map noise function to generate coarsely aligned noise, promoting consistent optimization trajectories and enhancing diversity. |
FSD significantly enhances generation diversity compared to traditional SDS-based methods, producing a wider range of 3D models from the same text prompt.
The method maintains the generation quality of SDS, ensuring that the generated 3D models remain realistic and detailed.
FSD achieves diversity improvement without introducing additional training costs compared to SDS. |
While improving diversity, FSD still faces challenges in achieving the same level of diversity observed in 2D image generation using DDIM.
The deterministic noise function in FSD, while effective, relies on manual design; exploring learned or more sophisticated noise functions could further enhance diversity. |
3d generation, noise prior, diffusion models, score distillation sampling, text-to-3d |
2405.10864
Report |
Improving face generation quality and prompt following with synthetic captions |
Michail Tarasiou, Stylianos Moschoglou, Jiankang Deng, Stefanos Zafeiriou |
Recent advancements in text-to-image generation using diffusion models have
significantly improved the quality of generated images and expanded the ability
to depict a wide range of objects. However, ensuring that these models adhere
closely to the text prompts remains a considerable challenge. This issue is
particularly pronounced when trying to generate photorealistic images of
humans. Without significant prompt engineering efforts models often produce
unrealistic images and typically fail to incorporate the full extent of the
prompt information. This limitation can be largely attributed to the nature of
captions accompanying the images used in training large scale diffusion models,
which typically prioritize contextual information over details related to the
person's appearance. In this paper we address this issue by introducing a
training-free pipeline designed to generate accurate appearance descriptions
from images of people. We apply this method to create approximately 250,000
captions for publicly available face datasets. We then use these synthetic
captions to fine-tune a text-to-image diffusion model. Our results demonstrate
that this approach significantly improves the model's ability to generate
high-quality, realistic human faces and enhances adherence to the given
prompts, compared to the baseline model. We share our synthetic captions,
pretrained checkpoints and training code. |
This paper introduces a training-free pipeline to generate detailed appearance descriptions from human face images, using these descriptions to fine-tune a text-to-image diffusion model for improved realism and prompt adherence in generating human faces. |
Existing text-to-image models struggle to generate realistic and accurate human faces due to the lack of detailed appearance information in typical image captions used for training. |
The pipeline extracts facial features (age, gender, ethnicity, emotion, hair, etc.) from images using pre-trained models. These features are converted into natural language descriptions using an LLM (Vicuna 13B). These descriptions are then used to fine-tune a Stable Diffusion 2.1 model. |
The fine-tuned model generates more realistic human faces compared to the base Stable Diffusion model.
The model demonstrates better adherence to detailed prompts in generating specific facial features.
The model exhibits some degree of identity preservation across different age, ethnicity, and emotion attributes. |
The pipeline inherits potential biases from the pre-trained face analysis models used for feature extraction.
The fine-tuned model might still exhibit biases present in the original Stable Diffusion model and the selected finetuning datasets. |
text-to-image generation, diffusion models, facial image description, synthetic captions, realistic face generation |
2405.10832
Report |
Open-Vocabulary Spatio-Temporal Action Detection |
Tao Wu, Shuqiu Ge, Jie Qin, Gangshan Wu, Limin Wang |
Spatio-temporal action detection (STAD) is an important fine-grained video
understanding task. Current methods require box and label supervision for all
action classes in advance. However, in real-world applications, it is very
likely to come across new action classes not seen in training because the
action category space is large and hard to enumerate. Also, the cost of data
annotation and model training for new classes is extremely high for traditional
methods, as we need to perform detailed box annotations and re-train the whole
network from scratch. In this paper, we propose a new challenging setting by
performing open-vocabulary STAD to better mimic the situation of action
detection in an open world. Open-vocabulary spatio-temporal action detection
(OV-STAD) requires training a model on a limited set of base classes with box
and label supervision, which is expected to yield good generalization
performance on novel action classes. For OV-STAD, we build two benchmarks based
on the existing STAD datasets and propose a simple but effective method based
on pretrained video-language models (VLM). To better adapt the holistic VLM for
the fine-grained action detection task, we carefully fine-tune it on the
localized video region-text pairs. This customized fine-tuning endows the VLM
with better motion understanding, thus contributing to a more accurate
alignment between video regions and texts. Local region feature and global
video feature fusion before alignment is adopted to further improve the action
detection performance by providing global context. Our method achieves a
promising performance on novel classes. |
This paper proposes a new setting for open-vocabulary spatio-temporal action detection (OV-STAD) and introduces a simple yet effective method using pretrained video-language models (VLMs). |
OV-STAD addresses the limitations of traditional STAD methods that require extensive box annotations and retraining for new action classes, making it more practical for real-world applications with a vast and dynamic action space. |
The method leverages a pretrained VLM fine-tuned on video region-text pairs to enhance local feature representation for action recognition. It also incorporates global video features for improved alignment and overfitting mitigation. |
The proposed method achieves promising results on novel classes for OV-STAD.
Video region-text alignment pretraining significantly enhances the model's capability for recognizing unseen action classes.
Fusing global and local video features effectively improves the alignment between visual features and action prompts, benefiting action recognition. |
The performance on the AVA dataset is limited, potentially due to the atomic nature of actions and reliance on object/scene cues in pretraining.
The method relies on an external human detector, which might introduce errors and limit the overall performance. |
spatio-temporal action detection, open vocabulary learning, video-language models, region-text alignment, zero-shot learning |
2405.10674
Report |
From Sora What We Can See: A Survey of Text-to-Video Generation |
Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoying Zhang, Wenqi Li, Haoran Duan, Bo Wei, Rajiv Ranjan |
With impressive achievements made, artificial intelligence is on the path
forward to artificial general intelligence. Sora, developed by OpenAI, which is
capable of minute-level world-simulative abilities can be considered as a
milestone on this developmental path. However, despite its notable successes,
Sora still encounters various obstacles that need to be resolved. In this
survey, we embark from the perspective of disassembling Sora in text-to-video
generation, and conducting a comprehensive review of literature, trying to
answer the question, \textit{From Sora What We Can See}. Specifically, after
basic preliminaries regarding the general algorithms are introduced, the
literature is categorized from three mutually perpendicular dimensions:
evolutionary generators, excellent pursuit, and realistic panorama.
Subsequently, the widely used datasets and metrics are organized in detail.
Last but more importantly, we identify several challenges and open problems in
this domain and propose potential future directions for research and
development. |
This paper presents a comprehensive survey of text-to-video (T2V) generation, offering a structured analysis of current research inspired by the capabilities of OpenAI's Sora. |
Sora represents a significant leap in T2V technology, demonstrating the potential for generating realistic and imaginative videos from textual descriptions, thus necessitating a focused review of this rapidly evolving field. |
The authors categorize T2V generation techniques based on the evolution of generative models (GAN/VAE, autoregressive, diffusion-based), essential video qualities (duration, resolution, quality), and realism components (motion, scenes, objects, layout). They also review commonly used datasets and evaluation metrics. |
Sora, while advanced, still exhibits limitations in generating realistic motion, consistent object appearances, and accurate physical interactions, highlighting ongoing challenges in T2V research.
The survey identifies key areas for future development, including robot learning from visual assistance, infinite 3D scene reconstruction, augmented digital twins, and the establishment of ethical and normative frameworks for AI applications.
Existing T2V techniques have achieved significant progress in generating longer, higher-resolution, and smoother videos, but challenges remain in seamlessly integrating complex elements and ensuring realism. |
The survey primarily focuses on Sora's capabilities, potentially overlooking advancements in other T2V models.
The rapid evolution of the field may lead to new breakthroughs and challenges not fully addressed in the current review. |
text-to-video generation, sora, diffusion models, generative ai, video synthesis |
2405.10577
Report |
DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection |
Zhe Huang, Yizhe Zhao, Hao Xiao, Chenyan Wu, Lingting Ge |
Recent advances in multi-view camera-only 3D object detection either rely on
an accurate reconstruction of bird's-eye-view (BEV) 3D features or on
traditional 2D perspective view (PV) image features. While both have their own
pros and cons, few have found a way to stitch them together in order to benefit
from "the best of both worlds". To this end, we explore a duo space (i.e., BEV
and PV) 3D perception framework, in conjunction with some useful duo space
fusion strategies that allow effective aggregation of the two feature
representations. To the best of our knowledge, our proposed method,
DuoSpaceNet, is the first to leverage two distinct feature spaces and achieves
the state-of-the-art 3D object detection and BEV map segmentation results on
nuScenes dataset. |
DuoSpaceNet, a novel camera-based 3D perception framework for autonomous driving, leverages both bird's-eye-view (BEV) and perspective-view (PV) features to enhance 3D object detection and map segmentation. |
Existing methods rely on either BEV or PV features, each with limitations. DuoSpaceNet bridges the gap, combining strengths of both representations for superior performance. |
DuoSpaceNet uses a duo space decoder with space-specific cross-attention layers to process and fuse BEV and PV features. It employs feature divergence enhancement for inter-space distinctiveness and a novel temporal modeling method for multi-frame settings. |
Achieves state-of-the-art 3D object detection results on nuScenes dataset, outperforming both BEV-based and PV-based methods.
Demonstrates superior map segmentation performance, achieving highest IoU for drivable area and lane boundaries.
Ablation studies confirm the effectiveness of each proposed component, highlighting the synergy of duo space features, feature divergence enhancement, and temporal modeling. |
Computational cost of feature divergence enhancement can be high.
Long-range detection capabilities are not fully explored due to the limitations of current datasets. |
3d object detection, autonomous driving, multi-view perception, "birds-eye-view (bev)", perspective view (pv) |
2405.10508
Report |
ART3D: 3D Gaussian Splatting for Text-Guided Artistic Scenes Generation |
Pengzhi Li, Chengshuai Tang, Qinxuan Huang, Zhiheng Li |
In this paper, we explore the existing challenges in 3D artistic scene
generation by introducing ART3D, a novel framework that combines diffusion
models and 3D Gaussian splatting techniques. Our method effectively bridges the
gap between artistic and realistic images through an innovative image semantic
transfer algorithm. By leveraging depth information and an initial artistic
image, we generate a point cloud map, addressing domain differences.
Additionally, we propose a depth consistency module to enhance 3D scene
consistency. Finally, the 3D scene serves as initial points for optimizing
Gaussian splats. Experimental results demonstrate ART3D's superior performance
in both content and structural consistency metrics when compared to existing
methods. ART3D significantly advances the field of AI in art creation by
providing an innovative solution for generating high-quality 3D artistic
scenes. |
Introduces ART3D, a novel framework for generating high-quality 3D artistic scenes from text descriptions or reference images using diffusion models and 3D Gaussian splatting. |
Addresses the limitations of existing 3D art creation methods, particularly in bridging the domain gap between artistic and realistic images and ensuring global scene consistency. |
Employs an image semantic transfer algorithm to align the semantic information of artistic and realistic images, enabling accurate depth estimation. Uses a depth consistency module to enhance the consistency of the point cloud map across different views. Finally, optimizes a 3D Gaussian splatting representation for high-quality rendering. |
Generates 3D artistic scenes with superior style consistency and continuity compared to existing methods.
Effectively addresses the domain gap between artistic and realistic images, enabling accurate depth estimation and 3D reconstruction.
Demonstrates improved global scene consistency through the depth consistency module, resulting in more coherent and visually appealing 3D scenes. |
Relies on monocular depth estimation, which may have limitations in capturing complex scene geometry.
Limited exploration of dynamic scene generation. |
3d scene generation, diffusion models, gaussian splatting, ai art, text-to-3d |
2405.10370
Report |
Grounded 3D-LLM with Referent Tokens |
Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang |
Prior studies on 3D scene understanding have primarily developed specialized
models for specific tasks or required task-specific fine-tuning. In this study,
we propose Grounded 3D-LLM, which explores the potential of 3D large
multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a
unified generative framework. The model uses scene referent tokens as special
noun phrases to reference 3D scenes, enabling the handling of sequences that
interleave 3D and textual data. It offers a natural approach for translating 3D
vision tasks into language formats using task-specific instruction templates.
To facilitate the use of referent tokens in subsequent language modeling, we
have curated large-scale grounded language datasets that offer finer scene-text
correspondence at the phrase level by bootstrapping existing object labels.
Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to
effectively leverage this data, thereby integrating 3D vision with language
models. Our comprehensive evaluation covers open-ended tasks like dense
captioning and 3D QA, alongside close-ended tasks such as object detection and
language grounding. Experiments across multiple 3D benchmarks reveal the
leading performance and the broad applicability of Grounded 3D-LLM. Code and
datasets will be released on the project page:
https://groundedscenellm.github.io/grounded_3d-llm.github.io. |
This paper introduces Grounded 3D-LLM, a novel framework that uses "referent tokens" to represent scene regions or object features, enabling the integration of diverse 3D vision tasks within a unified generative language modeling framework. |
Existing 3D scene understanding models are often task-specific and lack generalizability. Grounded 3D-LLM addresses this limitation by offering a unified approach to handle various tasks, such as object detection, grounding, captioning, and question answering, within a single model. |
The proposed framework utilizes two main steps: (1) Contrastive Language-Scene Pre-training (CLASP) aligns point cloud features with textual phrases at a granular level. (2) Multi-task instruction tuning, incorporating "referent tokens," enables the model to perform diverse 3D vision tasks based on textual instructions. |
Grounded 3D-LLM outperforms previous generative models in most 3D vision tasks, showcasing its potential as a unified framework.
CLASP demonstrates superior performance in 3D grounding and detection benchmarks, highlighting its ability to align textual phrases with 3D scene regions effectively.
Automated generation of a large-scale, grounded language dataset, G-SceneCap, contributes to the model's performance and offers a valuable resource for future research. |
While promising, Grounded 3D-LLM shows performance gaps compared to the pre-trained CLASP, suggesting further improvement in bridging discriminative and generative approaches.
The model primarily focuses on indoor scenarios and may exhibit limitations in handling complex real-world environments or generating entirely accurate language outputs. |
3d vision, large language models, vision-language models, scene understanding, generative modeling |
2405.10320
Report |
Toon3D: Seeing Cartoons from a New Perspective |
Ethan Weber, Riley Peterlinz, Rohan Mathur, Frederik Warburg, Alexei A. Efros, Angjoo Kanazawa |
In this work, we recover the underlying 3D structure of non-geometrically
consistent scenes. We focus our analysis on hand-drawn images from cartoons and
anime. Many cartoons are created by artists without a 3D rendering engine,
which means that any new image of a scene is hand-drawn. The hand-drawn images
are usually faithful representations of the world, but only in a qualitative
sense, since it is difficult for humans to draw multiple perspectives of an
object or scene 3D consistently. Nevertheless, people can easily perceive 3D
scenes from inconsistent inputs! In this work, we correct for 2D drawing
inconsistencies to recover a plausible 3D structure such that the newly warped
drawings are consistent with each other. Our pipeline consists of a
user-friendly annotation tool, camera pose estimation, and image deformation to
recover a dense structure. Our method warps images to obey a perspective camera
model, enabling our aligned results to be plugged into novel-view synthesis
reconstruction methods to experience cartoons from viewpoints never drawn
before. Our project page is https://toon3d.studio . |
Presents Toon3D, a pipeline for reconstructing the 3D structure of non-geometrically consistent scenes, focusing on hand-drawn images from cartoons and anime. |
Addresses the challenge of reconstructing 3D from hand-drawn images that lack geometric consistency, a problem that traditional SfM pipelines struggle with. |
Uses a three-step process: (1) sparse alignment of user-annotated correspondences backprojected using monocular depth, (2) dense alignment with 2D image and 3D depth warping, and (3) refinement using Gaussian Splatting. |
Successfully recovers camera poses and dense 3D structure from various cartoon scenes, enabling novel view synthesis.
Reveals geometric inconsistencies in hand-drawn images through the process of warping them to fit a perspective camera model.
Demonstrates applicability beyond cartoons by reconstructing scenes from sparse photo collections and paintings. |
Reliance on accurate user-provided correspondences and depth predictions.
Limited exploration of end-to-end learning-based approaches for cartoon reconstruction. |
3d reconstruction, non-geometric modeling, cartoon analysis, sparse-view reconstruction, image warping |
2405.10317
Report |
Text-to-Vector Generation with Neural Path Representation |
Peiying Zhang, Nanxuan Zhao, Jing Liao |
Vector graphics are widely used in digital art and highly favored by
designers due to their scalability and layer-wise properties. However, the
process of creating and editing vector graphics requires creativity and design
expertise, making it a time-consuming task. Recent advancements in
text-to-vector (T2V) generation have aimed to make this process more
accessible. However, existing T2V methods directly optimize control points of
vector graphics paths, often resulting in intersecting or jagged paths due to
the lack of geometry constraints. To overcome these limitations, we propose a
novel neural path representation by designing a dual-branch Variational
Autoencoder (VAE) that learns the path latent space from both sequence and
image modalities. By optimizing the combination of neural paths, we can
incorporate geometric constraints while preserving expressivity in generated
SVGs. Furthermore, we introduce a two-stage path optimization method to improve
the visual and topological quality of generated SVGs. In the first stage, a
pre-trained text-to-image diffusion model guides the initial generation of
complex vector graphics through the Variational Score Distillation (VSD)
process. In the second stage, we refine the graphics using a layer-wise image
vectorization strategy to achieve clearer elements and structure. We
demonstrate the effectiveness of our method through extensive experiments and
showcase various applications. The project page is
https://intchous.github.io/T2V-NPR. |
This paper presents a novel text-to-vector (T2V) generation pipeline that generates high-quality vector graphics from text prompts, ensuring geometric regularity and layer-wise structure. |
Existing T2V methods either rely on image vectorization of raster T2I results, leading to complex and inaccurate vectors, or directly optimize control points, resulting in intersecting and jagged paths. This work addresses these limitations. |
The method uses a dual-branch VAE to learn a neural path representation capturing geometric properties. A two-stage optimization process then refines a set of paths: first with VSD based on a pre-trained diffusion model for text alignment, and then with a layer-wise strategy for clarity and structure. |
The method outperforms existing approaches in generating high-quality and diverse vector graphics with valid paths and layer properties.
It offers control over details and style, and enables applications like SVG customization, image-to-SVG generation, and animation.
User study confirms its superiority in overall SVG quality and alignment with text prompts. |
The method's reliance on diffusion models may lead to inaccuracies in representing highly detailed prompts.
The current path latent space struggles to capture intricate boundaries, leading to over-simplification of complex shapes. |
vector graphics, svg, text-to-vector generation, diffusion model, neural path representation |
2405.10316
Report |
Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model |
Zheng Gu, Shiyuan Yang, Jing Liao, Jing Huo, Yang Gao |
Visual In-Context Learning (ICL) has emerged as a promising research area due
to its capability to accomplish various tasks with limited example pairs
through analogical reasoning. However, training-based visual ICL has
limitations in its ability to generalize to unseen tasks and requires the
collection of a diverse task dataset. On the other hand, existing methods in
the inference-based visual ICL category solely rely on textual prompts, which
fail to capture fine-grained contextual information from given examples and can
be time-consuming when converting from images to text prompts. To address these
challenges, we propose Analogist, a novel inference-based visual ICL approach
that exploits both visual and textual prompting techniques using a
text-to-image diffusion model pretrained for image inpainting. For visual
prompting, we propose a self-attention cloning (SAC) method to guide the
fine-grained structural-level analogy between image examples. For textual
prompting, we leverage GPT-4V's visual reasoning capability to efficiently
generate text prompts and introduce a cross-attention masking (CAM) operation
to enhance the accuracy of semantic-level analogy guided by text prompts. Our
method is out-of-the-box and does not require fine-tuning or optimization. It
is also generic and flexible, enabling a wide range of visual tasks to be
performed in an in-context manner. Extensive experiments demonstrate the
superiority of our method over existing approaches, both qualitatively and
quantitatively. |
This paper presents a comprehensive survey of in-context learning (ICL) with a specific focus on its application in computer vision. |
In-context learning is gaining increasing attention as it enables learning and adaptation without explicit parameter updates, holding promise for more flexible and data-efficient machine learning. |
The paper reviews the evolution of ICL, examines its principles, analyzes various ICL approaches within different visual learning tasks, and discusses promising future directions. |
The paper provides a taxonomy of ICL, categorizing it into different types and highlighting their strengths and weaknesses.
It offers an in-depth analysis of ICL applications across diverse vision tasks, including image generation, image editing, and video processing.
The paper identifies key challenges and opportunities associated with ICL in computer vision, pointing towards areas for future research. |
The paper acknowledges that ICL is a rapidly evolving field and the survey might not encompass the very latest developments.
Further exploration of benchmarks and evaluation metrics tailored for ICL in computer vision is suggested. |
in-context learning, computer vision, survey, image generation, image editing |
2405.10314
Report |
CAT3D: Create Anything in 3D with Multi-View Diffusion Models |
Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, Ben Poole |
Advances in 3D reconstruction have enabled high-quality 3D capture, but
require a user to collect hundreds to thousands of images to create a 3D scene.
We present CAT3D, a method for creating anything in 3D by simulating this
real-world capture process with a multi-view diffusion model. Given any number
of input images and a set of target novel viewpoints, our model generates
highly consistent novel views of a scene. These generated views can be used as
input to robust 3D reconstruction techniques to produce 3D representations that
can be rendered from any viewpoint in real-time. CAT3D can create entire 3D
scenes in as little as one minute, and outperforms existing methods for single
image and few-view 3D scene creation. See our project page for results and
interactive demos at https://cat3d.github.io . |
CAT3D is a method for creating 3D scenes from any number of generated or real images by simulating a real-world capture process with a multi-view diffusion model. |
Creating 3D content typically requires dense multi-view capture, which is time-consuming and limits accessibility. CAT3D enables 3D creation from limited input, such as a single image or text prompt. |
CAT3D first generates consistent novel views from input images using a multi-view diffusion model. These views are then used for robust 3D reconstruction with a modified NeRF pipeline. |
CAT3D produces high-quality 3D scenes in as little as one minute.
It outperforms existing methods for single-image and few-view 3D scene creation on multiple benchmarks.
The method effectively handles various input modalities, including text prompts, single images, and sparse multi-view captures. |
The trained model cannot handle cases with varying camera intrinsics across input views.
Generation quality depends on the expressivity of the base text-to-image model, potentially limiting performance on out-of-distribution content. |
3d reconstruction, novel view synthesis, diffusion models, multi-view consistency, nerf |
2405.10305
Report |
4D Panoptic Scene Graph Generation |
Jingkang Yang, Jun Cen, Wenxuan Peng, Shuai Liu, Fangzhou Hong, Xiangtai Li, Kaiyang Zhou, Qifeng Chen, Ziwei Liu |
We are living in a three-dimensional space while moving forward through a
fourth dimension: time. To allow artificial intelligence to develop a
comprehensive understanding of such a 4D environment, we introduce 4D Panoptic
Scene Graph (PSG-4D), a new representation that bridges the raw visual data
perceived in a dynamic 4D world and high-level visual understanding.
Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent
entities with precise location and status information, and edges, which capture
the temporal relations. To facilitate research in this new area, we build a
richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of
1M frames, each of which is labeled with 4D panoptic segmentation masks as well
as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer,
a Transformer-based model that can predict panoptic segmentation masks, track
masks along the time axis, and generate the corresponding scene graphs via a
relation component. Extensive experiments on the new dataset show that our
method can serve as a strong baseline for future research on PSG-4D. In the
end, we provide a real-world application example to demonstrate how we can
achieve dynamic scene understanding by integrating a large language model into
our PSG-4D system. |
This paper introduces 4D Panoptic Scene Graph (PSG-4D), a novel representation bridging raw visual data in dynamic environments with high-level visual understanding by abstracting sensory data into nodes (entities with location and status) and edges (temporal relations). |
Current scene understanding methods lack the ability to integrate dynamic, spatio-temporal relationships crucial for AI agents to interact with the real world. PSG-4D aims to overcome this by capturing the dynamic 4D nature of the environment. |
The authors propose PSG4DFormer, a two-stage framework. Stage 1 performs 4D panoptic segmentation, tracking objects over time. Stage 2 leverages a spatial-temporal transformer to model relations between tracked objects and generate the 4D scene graph. |
RGB-D video sequences as input generally yield better results than point cloud sequences.
Incorporating depth information significantly improves performance in 4D scene graph generation.
Temporal attention is crucial for capturing the dynamic relationships between objects in the scene. |
Current models are limited to simple scenes and struggle with complex real-world environments.
There is a need for larger and more diverse datasets for training and evaluation of 4D scene graph generation models. |
4d scene understanding, scene graph generation, panoptic segmentation, spatial-temporal transformer, robot vision |
2405.10300
Report |
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection |
Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang |
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object
detection models developed by IDEA Research, which aims to advance the "Edge"
of open-set object detection. The suite encompasses two models: Grounding DINO
1.5 Pro, a high-performance model designed for stronger generalization
capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an
efficient model optimized for faster speed demanded in many applications
requiring edge deployment. The Grounding DINO 1.5 Pro model advances its
predecessor by scaling up the model architecture, integrating an enhanced
vision backbone, and expanding the training dataset to over 20 million images
with grounding annotations, thereby achieving a richer semantic understanding.
The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced
feature scales, maintains robust detection capabilities by being trained on the
same comprehensive dataset. Empirical results demonstrate the effectiveness of
Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP
on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot
transfer benchmark, setting new records for open-set object detection.
Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT,
achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP
on the LVIS-minival benchmark, making it more suitable for edge computing
scenarios. Model examples and demos with API will be released at
https://github.com/IDEA-Research/Grounding-DINO-1.5-API |
The paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models including a high-performance model (Grounding DINO 1.5 Pro) and an efficient model optimized for edge devices (Grounding DINO 1.5 Edge). |
The models aim to advance the state-of-the-art in open-set object detection, providing stronger generalization and faster inference speed for wider real-world application. |
Grounding DINO 1.5 leverages a dual-encoder-single-decoder structure, incorporating a larger Vision Transformer backbone (ViT-L) for the Pro model and an efficient feature enhancer for the Edge model. Both models are trained on a large-scale dataset (Grounding-20M) with over 20 million images and grounding annotations. |
Grounding DINO 1.5 Pro achieves state-of-the-art performance on COCO and LVIS zero-shot benchmarks, surpassing previous methods significantly.
Grounding DINO 1.5 Edge, optimized with TensorRT, reaches a speed of 75.2 FPS while attaining a competitive zero-shot performance of 36.2 AP on LVIS-minival, demonstrating its suitability for edge computing.
Both models showcase robust detection capabilities in various scenarios, including common object detection, long-tailed object detection, dense object detection, and video object detection. |
The paper acknowledges limitations in the quality of category names within the ODinW benchmark.
Future work could explore the model's capabilities in real-time video object detection and further optimize its performance on edge devices with more limited computational resources. |
open-set object detection, grounding dino, vision transformer, edge computing, zero-shot learning |
2405.10185
Report |
DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data |
Chengxiang Fan, Muzhi Zhu, Hao Chen, Yang Liu, Weijia Wu, Huaqi Zhang, Chunhua Shen |
Instance segmentation is data-hungry, and as model capacity increases, data
scale becomes crucial for improving the accuracy. Most instance segmentation
datasets today require costly manual annotation, limiting their data scale.
Models trained on such data are prone to overfitting on the training set,
especially for those rare categories. While recent works have delved into
exploiting generative models to create synthetic datasets for data
augmentation, these approaches do not efficiently harness the full potential of
generative models.
To address these issues, we introduce a more efficient strategy to construct
generative datasets for data augmentation, termed DiverGen. Firstly, we provide
an explanation of the role of generative data from the perspective of
distribution discrepancy. We investigate the impact of different data on the
distribution learned by the model. We argue that generative data can expand the
data distribution that the model can learn, thus mitigating overfitting.
Additionally, we find that the diversity of generative data is crucial for
improving model performance and enhance it through various strategies,
including category diversity, prompt diversity, and generative model diversity.
With these strategies, we can scale the data to millions while maintaining the
trend of model performance improvement. On the LVIS dataset, DiverGen
significantly outperforms the strong model X-Paste, achieving +1.1 box AP and
+1.1 mask AP across all categories, and +1.9 box AP and +2.5 mask AP for rare
categories. |
This paper proposes DiverGen, an efficient strategy for constructing generative datasets to augment instance segmentation datasets and enhance model performance. |
Instance segmentation models are data-hungry and existing datasets are limited by costly manual annotation. While generative models offer a solution, current methods don't fully utilize their potential or address the distribution discrepancy between real and generative data. |
The paper analyzes the role of generative data from a distribution discrepancy perspective, finding that it expands the data distribution learnable by the model and mitigates overfitting. It proposes DiverGen, enhancing data diversity via category diversity (using LVIS and ImageNet categories), prompt diversity (ChatGPT generated prompts), and generative model diversity (Stable Diffusion and DeepFloyd-IF). It also optimizes the generation pipeline with SAM-background annotation and CLIP inter-similarity filtration. |
Data diversity is more crucial than quantity for generative data augmentation.
DiverGen outperforms previous methods, including X-Paste, on the LVIS dataset, demonstrating significant improvement in box and mask AP, particularly for rare categories.
Ablation studies validate the effectiveness of individual components like category diversity, prompt diversity, generative model diversity, SAM-background, and CLIP inter-similarity. |
The improvement from using extra categories plateaus and even declines slightly with too many, suggesting a balance is needed.
The computational cost of using ChatGPT for prompt generation is a limitation, addressed by applying it to a subset of categories. |
instance segmentation, generative data augmentation, data diversity, distribution discrepancy, long-tailed recognition |
2405.10140
Report |
Libra: Building Decoupled Vision System on Large Language Models |
Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu |
In this work, we introduce Libra, a prototype model with a decoupled vision
system on a large language model (LLM). The decoupled vision system decouples
inner-modal modeling and cross-modal interaction, yielding unique visual
information modeling and effective cross-modal comprehension. Libra is trained
through discrete auto-regressive modeling on both vision and language inputs.
Specifically, we incorporate a routed visual expert with a cross-modal bridge
module into a pretrained LLM to route the vision and language flows during
attention computing to enable different attention patterns in inner-modal
modeling and cross-modal interaction scenarios. Experimental results
demonstrate that the dedicated design of Libra achieves a strong MLLM baseline
that rivals existing works in the image-to-text scenario with merely 50 million
training data, providing a new perspective for future multimodal foundation
models. Code is available at https://github.com/YifanXu74/Libra. |
This paper introduces Libra, a new multimodal large language model (MLLM) that utilizes a decoupled vision system built upon a large language model (LLM). This approach separates inner-modal modeling from cross-modal interaction, leading to a more effective vision system for LLMs. |
Existing MLLMs often struggle with balancing the vast knowledge capacity of LLMs with the complexities of visual understanding. This work addresses this challenge by proposing a novel vision system design specifically tailored for LLMs. |
Libra employs a routed visual expert with a cross-modal bridge module. The visual expert, integrated into a pretrained LLM, allows for separate processing of visual and language information. The cross-modal bridge facilitates interaction between these modalities during attention computations. Libra is trained using discrete auto-regressive modeling with a hybrid image tokenization strategy that leverages contiguous visual signals and pretrained visual knowledge from a CLIP-based image tokenizer. |
Libra achieves strong performance on a variety of vision-language tasks, including visual question answering and image captioning, outperforming several larger models despite using less training data.
Analysis of Libra's attention patterns reveals increased diversity across layers compared to traditional MLLMs, indicating reduced learning redundancy and improved cross-modal comprehension.
The decoupled vision system in Libra exhibits strong performance on benchmarks designed to detect CLIP bias, highlighting its ability to learn unique visual representations beyond simple modality alignment. |
The routed visual expert introduces new attention mechanisms not yet fully supported by existing acceleration frameworks, limiting training and inference efficiency.
As Libra's design is based on pretrained LLMs, it inherits limitations associated with these models, including potential hallucinations and difficulties in handling long sequences. |
multimodal large language model, decoupled vision system, vision-language comprehension, discrete auto-regressive modeling, cross-modal interaction |
2405.10053
Report |
SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection |
Mingxuan Liu, Tyler L. Hayes, Elisa Ricci, Gabriela Csurka, Riccardo Volpi |
Open-vocabulary object detection (OvOD) has transformed detection into a
language-guided task, empowering users to freely define their class
vocabularies of interest during inference. However, our initial investigation
indicates that existing OvOD detectors exhibit significant variability when
dealing with vocabularies across various semantic granularities, posing a
concern for real-world deployment. To this end, we introduce Semantic Hierarchy
Nexus (SHiNe), a novel classifier that uses semantic knowledge from class
hierarchies. It runs offline in three steps: i) it retrieves relevant
super-/sub-categories from a hierarchy for each target class; ii) it integrates
these categories into hierarchy-aware sentences; iii) it fuses these sentence
embeddings to generate the nexus classifier vector. Our evaluation on various
detection benchmarks demonstrates that SHiNe enhances robustness across diverse
vocabulary granularities, achieving up to +31.9% mAP50 with ground truth
hierarchies, while retaining improvements using hierarchies generated by large
language models. Moreover, when applied to open-vocabulary classification on
ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy.
SHiNe is training-free and can be seamlessly integrated with any off-the-shelf
OvOD detector, without incurring additional computational overhead during
inference. The code is open source. |
This paper introduces SHiNe, a novel training-free classifier that leverages semantic hierarchies to enhance the robustness of open-vocabulary object detectors (OVOD) to diverse vocabulary granularities. |
Existing OVOD detectors show significant performance variability when handling vocabularies with different semantic granularities, posing challenges for real-world deployment. |
SHiNe retrieves super-/sub-categories from a hierarchy for each target class, integrates them into hierarchy-aware sentences using an 'Is-A' connector, and fuses their embeddings to generate a 'nexus' classifier vector. |
SHiNe consistently improves performance across various vocabulary granularities on iNat and FSOD datasets, with gains up to +31.9% in mAP50.
It operates effectively with both ground-truth and LLM-generated hierarchies.
SHiNe generalizes to other OVOD detectors and shows resilience to mis-specified vocabularies. |
The performance gain with LLM-generated hierarchies, while significant, is not as substantial as with ground-truth hierarchies.
Future work includes exploring alternative hierarchy generation methods and extending SHiNe to other open-vocabulary tasks like segmentation. |
open-vocabulary object detection, semantic hierarchy, robustness, zero-shot learning, vision-language models |
2405.09879
Report |
Generative Unlearning for Any Identity |
Juwon Seo, Sung-Hoon Lee, Tae-Young Lee, Seungjun Moon, Gyeong-Moon Park |
Recent advances in generative models trained on large-scale datasets have
made it possible to synthesize high-quality samples across various domains.
Moreover, the emergence of strong inversion networks enables not only a
reconstruction of real-world images but also the modification of attributes
through various editing methods. However, in certain domains related to privacy
issues, e.g., human faces, advanced generative models along with strong
inversion methods can lead to potential misuses. In this paper, we propose an
essential yet under-explored task called generative identity unlearning, which
steers the model not to generate an image of a specific identity. In the
generative identity unlearning, we target the following objectives: (i)
preventing the generation of images with a certain identity, and (ii)
preserving the overall quality of the generative model. To satisfy these goals,
we propose a novel framework, Generative Unlearning for Any Identity (GUIDE),
which prevents the reconstruction of a specific identity by unlearning the
generator with only a single image. GUIDE consists of two parts: (i) finding a
target point for optimization that un-identifies the source latent code and
(ii) novel loss functions that facilitate the unlearning procedure while less
affecting the learned distribution. Our extensive experiments demonstrate that
our proposed method achieves state-of-the-art performance in the generative
machine unlearning task. The code is available at
https://github.com/KHU-AGI/GUIDE. |
The paper introduces Generative Unlearning for Any IDEntity (GUIDE), a novel framework designed to remove the identity information associated with a single source image from pre-trained 2D or 3D GANs, addressing privacy concerns in generative models. |
The advancement of GANs and inversion networks enables high-quality image synthesis and manipulation, raising privacy concerns as they can be misused to reconstruct and exploit individual identities even if the specific identity wasn't in the training data. |
GUIDE consists of two main components: (1) Un-identifying Face On Latent Space (UFO), which identifies a suitable target latent code by extrapolating from the source latent code away from the average latent code, encouraging a distinct identity shift. (2) Latent Target Unlearning (LTU) utilizes three novel loss functions: local unlearning loss for direct identity shift, adjacency-aware unlearning loss for unlearning the entire identity neighborhood, and global preservation loss to maintain the generator's overall performance. |
GUIDE effectively removes identities from pre-trained GANs, even for unseen, out-of-domain images.
The adjacency-aware unlearning loss in GUIDE successfully generalizes identity removal to unseen images with the same identity.
The global preservation loss effectively minimizes the distribution shift in generated images, preserving the overall quality and performance of the pre-trained GAN. |
The paper mainly focuses on face identity removal and is evaluated on face datasets. Further research is needed to generalize GUIDE for unlearning any identity in broader domains.
The current implementation requires fine-tuning the pre-trained generator for each identity to be removed. Exploring more efficient unlearning strategies without modifying the generator is a potential future direction. |
generative adversarial networks, machine unlearning, privacy protection, identity removal, image synthesis |
2405.09874
Report |
Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion |
Xinyang Li, Zhangyu Lai, Linning Xu, Jianfei Guo, Liujuan Cao, Shengchuan Zhang, Bo Dai, Rongrong Ji |
We present Dual3D, a novel text-to-3D generation framework that generates
high-quality 3D assets from texts in only $1$ minute.The key component is a
dual-mode multi-view latent diffusion model. Given the noisy multi-view
latents, the 2D mode can efficiently denoise them with a single latent
denoising network, while the 3D mode can generate a tri-plane neural surface
for consistent rendering-based denoising. Most modules for both modes are tuned
from a pre-trained text-to-image latent diffusion model to circumvent the
expensive cost of training from scratch. To overcome the high rendering cost
during inference, we propose the dual-mode toggling inference strategy to use
only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in
just $10$ seconds without sacrificing quality. The texture of the 3D asset can
be further enhanced by our efficient texture refinement process in a short
time. Extensive experiments demonstrate that our method delivers
state-of-the-art performance while significantly reducing generation time. Our
project page is available at https://dual3d.github.io |
This paper presents \modelname, a novel text-to-3D generation framework that produces high-quality 3D assets from text descriptions in just one minute. |
This research is important because it addresses the limitations of existing text-to-3D generation methods, which often suffer from slow generation speed, high training costs, and a lack of 3D consistency. |
The key component of \modelname is a dual-mode multi-view latent diffusion model. This model leverages a pre-trained 2D latent diffusion model (LDM) and is trained on multi-view image data. It employs a dual-mode toggling inference strategy, switching between 2D and 3D modes to balance generation speed and 3D consistency. Furthermore, an efficient texture refinement process enhances the realism of the generated 3D assets. |
Significantly faster generation time (under a minute) compared to optimization-based methods while maintaining high quality.
Achieves state-of-the-art performance in text alignment and aesthetic quality, as evidenced by CLIP Score and user studies.
Demonstrates robust generalization capabilities, generating diverse assets from the same text prompt and handling fine-grained semantic variations. |
Limited ability to generate scenes with multiple interacting objects or highly complex geometries due to reliance on single-object multi-view data and mesh rendering during refinement.
Potential for future improvement by incorporating more diverse multi-view datasets, exploring more efficient 3D representations, and investigating parameter-efficient fine-tuning methods. |
text-to-3d generation, latent diffusion models, multi-view diffusion, 3d neural rendering, texture refinement |
2405.09818
Report |
Chameleon: Mixed-Modal Early-Fusion Foundation Models |
Chameleon Team |
We present Chameleon, a family of early-fusion token-based mixed-modal models
capable of understanding and generating images and text in any arbitrary
sequence. We outline a stable training approach from inception, an alignment
recipe, and an architectural parameterization tailored for the early-fusion,
token-based, mixed-modal setting. The models are evaluated on a comprehensive
range of tasks, including visual question answering, image captioning, text
generation, image generation, and long-form mixed modal generation. Chameleon
demonstrates broad and general capabilities, including state-of-the-art
performance in image captioning tasks, outperforms Llama-2 in text-only tasks
while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and
performs non-trivial image generation, all in a single model. It also matches
or exceeds the performance of much larger models, including Gemini Pro and
GPT-4V, according to human judgments on a new long-form mixed-modal generation
evaluation, where either the prompt or outputs contain mixed sequences of both
images and text. Chameleon marks a significant step forward in a unified
modeling of full multimodal documents. |
This paper introduces Chameleon, a family of early-fusion, token-based mixed-modal foundation models that can reason over and generate interleaved image-text documents. |
Chameleon aims to address the limitations of existing multimodal models that often process modalities separately, hindering their ability to fully integrate information and generate complex multimodal content. |
Chameleon represents both images and text as discrete tokens within a unified transformer architecture. It is trained from scratch on a massive dataset of interleaved text and image tokens (around 10 trillion). The authors also introduce architectural innovations and training techniques to overcome the challenges of stable and scalable training in this early-fusion setting. |
Chameleon achieves state-of-the-art performance on various vision-language benchmarks, including image captioning and visual question answering, while maintaining competitive performance on text-only tasks.
Human evaluations show that Chameleon outperforms strong baselines like Gemini-Pro and GPT-4V in generating mixed-modal responses to open-ended prompts.
Chameleon demonstrates new capabilities in mixed-modal reasoning and generation, effectively handling prompts that require interleaving text and images in its responses. |
The evaluation prompts used, while diverse, were crowdsourced and might not fully represent real user interactions.
The absence of other native mixed-modal models limits the scope of comparative evaluation for Chameleon's novel capabilities. |
multimodal learning, foundation models, tokenization, vision-language tasks, early fusion |
2405.09717
Report |
From NeRFs to Gaussian Splats, and Back |
Siming He, Zach Osman, Pratik Chaudhari |
For robotics applications where there is a limited number of (typically
ego-centric) views, parametric representations such as neural radiance fields
(NeRFs) generalize better than non-parametric ones such as Gaussian splatting
(GS) to views that are very different from those in the training data; GS
however can render much faster than NeRFs. We develop a procedure to convert
back and forth between the two. Our approach achieves the best of both NeRFs
(superior PSNR, SSIM, and LPIPS on dissimilar views, and a compact
representation) and GS (real-time rendering and ability for easily modifying
the representation); the computational cost of these conversions is minor
compared to training the two from scratch. |
This paper introduces a novel method for converting between implicit neural radiance fields (NeRFs) and explicit Gaussian Splatting (GS) representations, leveraging the advantages of both for robotics applications. |
This is crucial for robotics as it allows for combining the superior generalization and compactness of NeRFs with the real-time rendering and easy modification capabilities of GS, particularly beneficial in sparse view scenarios common in robotics. |
The approach involves training a modified NeRF to predict spherical harmonics, converting it to GS by generating a point cloud from the NeRF and initializing Gaussians, and optionally fine-tuning the GS. Conversely, GS can be converted back to NeRF by rendering training views from the GS and fitting a NeRF to these renderings. |
The proposed method, termed NeRFGS, achieves comparable or better results than state-of-the-art methods like Splatfacto, especially on novel views dissimilar to training data.
Conversion between representations is computationally efficient, taking only a few seconds.
The approach allows for editing the scene representation by modifying the GS and converting back to NeRF, enabling dynamic scene understanding and manipulation. |
The initial conversion from NeRF to GS can lead to a decrease in quality, highlighting potential for improvement in the conversion efficiency.
Future work includes exploring adaptive Gaussian scales and anisotropic Gaussians for enhanced representation accuracy. |
neural radiance fields, gaussian splatting, scene representation, robotics, view generalization |
2405.09673
Report |
LoRA Learns Less and Forgets Less |
Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham |
Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning
method for large language models. LoRA saves memory by training only low rank
perturbations to selected weight matrices. In this work, we compare the
performance of LoRA and full finetuning on two target domains, programming and
mathematics. We consider both the instruction finetuning ($\approx$100K
prompt-response pairs) and continued pretraining ($\approx$10B unstructured
tokens) data regimes. Our results show that, in most settings, LoRA
substantially underperforms full finetuning. Nevertheless, LoRA exhibits a
desirable form of regularization: it better maintains the base model's
performance on tasks outside the target domain. We show that LoRA provides
stronger regularization compared to common techniques such as weight decay and
dropout; it also helps maintain more diverse generations. We show that full
finetuning learns perturbations with a rank that is 10-100X greater than
typical LoRA configurations, possibly explaining some of the reported gaps. We
conclude by proposing best practices for finetuning with LoRA. |
This paper presents a rigorous comparison of Low-Rank Adaptation (LoRA) and full finetuning for Llama-2 language models on challenging code and math domains. |
LoRA is widely used for efficient finetuning, but its performance compared to full finetuning in demanding domains is not well-understood. |
The authors finetuned Llama-2 7B and 13B models on code and math datasets using both LoRA and full finetuning. They evaluated performance on HumanEval (coding) and GSM8K (math), and measured forgetting on language understanding, world knowledge, and reasoning tasks. |
LoRA consistently underperforms full finetuning in terms of accuracy and sample efficiency, especially for code.
LoRA exhibits better preservation of source-domain performance (less forgetting) compared to full finetuning.
Full finetuning learns weight perturbations with a rank much higher than typical LoRA configurations, challenging the assumption of low-rank updates. |
The study primarily focuses on 7B and 13B models, leaving open the question of how LoRA scales with larger models.
The spectral analysis does not rule out the existence of low-rank solutions for the downstream tasks. |
lora, fine-tuning, large language models, code generation, math reasoning |
2405.09546
Report |
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation |
Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu |
The systematic evaluation and understanding of computer vision models under
varying conditions require large amounts of data with comprehensive and
customized labels, which real-world vision datasets rarely satisfy. While
current synthetic data generators offer a promising alternative, particularly
for embodied AI tasks, they often fall short for computer vision tasks due to
low asset and rendering quality, limited diversity, and unrealistic physical
properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and
assets to generate fully customized synthetic data for systematic evaluation of
computer vision models, based on the newly developed embodied AI benchmark,
BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene
level (e.g., lighting, object placement), the object level (e.g., joint
configuration, attributes such as "filled" and "folded"), and the camera level
(e.g., field of view, focal length). Researchers can arbitrarily vary these
parameters during data generation to perform controlled experiments. We
showcase three example application scenarios: systematically evaluating the
robustness of models across different continuous axes of domain shift,
evaluating scene understanding models on the same set of images, and training
and evaluating simulation-to-real transfer for a novel vision task: unary and
binary state prediction. Project website:
https://behavior-vision-suite.github.io/ |
\method (\methodabbr) is a customizable data generation tool for systematic evaluation and understanding of computer vision models. It leverages extended 3D asset library from BEHAVIOR-1K, and a generator to create custom vision datasets with rich annotations. |
Real-world datasets have limitations: limited labels, fixed data distributions, and difficulties in acquiring rare event data. Synthetic data can address these limitations but often lack realism or customizability. \methodabbr bridges the gap by offering a customizable generator for photorealistic synthetic data. |
\methodabbr consists of extended BEHAVIOR-1K assets (8K+ object models, 1K scene instances) and a customizable data generator built upon OmniGibson. The generator allows for scene object randomization, physically realistic pose generation, predicate-based rich labeling, camera pose sampling, and configurable rendering. |
Parametric model evaluation reveals performance variations of SOTA models (detection and segmentation) across different domain shifts (articulation, lighting, visibility, zoom, pitch).
Holistic scene understanding evaluation shows consistent relative performance between models tested on \methodabbr's synthetic data and real datasets, highlighting the datasets' realism.
Training a model on \methodabbr's synthetic data for object state and relation prediction demonstrates promising zero-shot transfer capability to real-world images. |
The current version of \methodabbr primarily focuses on indoor scenes.
The sim2real gap, although minimized, still exists and requires further investigation, potentially through improved rendering techniques or domain adaptation methods. |
synthetic data generation, computer vision, model evaluation, sim2real transfer, 3d simulation |
2405.09426
Report |
Global-Local Image Perceptual Score (GLIPS): Evaluating Photorealistic Quality of AI-Generated Images |
Memoona Aziz, Umair Rehman, Muhammad Umair Danish, Katarina Grolinger |
This paper introduces the Global-Local Image Perceptual Score (GLIPS), an
image metric designed to assess the photorealistic image quality of
AI-generated images with a high degree of alignment to human visual perception.
Traditional metrics such as FID and KID scores do not align closely with human
evaluations. The proposed metric incorporates advanced transformer-based
attention mechanisms to assess local similarity and Maximum Mean Discrepancy
(MMD) to evaluate global distributional similarity. To evaluate the performance
of GLIPS, we conducted a human study on photorealistic image quality.
Comprehensive tests across various generative models demonstrate that GLIPS
consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms
of correlation with human scores. Additionally, we introduce the Interpolative
Binning Scale (IBS), a refined scaling method that enhances the
interpretability of metric scores by aligning them more closely with human
evaluative standards. The proposed metric and scaling approach not only
provides more reliable assessments of AI-generated images but also suggest
pathways for future enhancements in image generation technologies. |
This paper introduces the Global-Local Image Perceptual Score (GLIPS), a novel image metric designed to assess the photorealistic image quality of AI-generated images, aiming for a higher alignment with human visual perception compared to traditional metrics like FID and KID. |
Existing image quality metrics often fail to accurately capture and reflect human judgments of photorealism, particularly for images generated by advanced AI models. This discrepancy highlights the need for a more reliable and human-aligned metric for evaluating AI-generated images. |
GLIPS leverages vision transformer-based attention mechanisms to extract and compare salient image patches, addressing the issue of structural differences between camera-captured and AI-generated images. It also incorporates Maximum Mean Discrepancy (MMD) to evaluate the global distributional similarity of deep features extracted from the images. A novel scaling strategy, the Interpolative Binning Scale (IBS), is introduced to ensure unbiased and interpretable comparison between human and metric scores. A human study was conducted to evaluate the correlation between GLIPS and human perception of photorealistic image quality. |
GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores, demonstrating its effectiveness in capturing human-like perceptions of photorealism.
The IBS method effectively mitigates biases introduced by traditional scaling methods, enabling a fairer and more interpretable comparison between metric outputs and human judgments.
The human study confirms a strong correlation between GLIPS scores and human assessments of photorealism, validating the metric's alignment with human visual perception. |
Future work will focus on optimizing the GLIPS framework by exploring different neural network architectures and refining the kernel functions used in MMD calculation to enhance its applicability across a wider range of image types and generative models.
Further research will investigate the generalizability of GLIPS to other image domains beyond those tested in the study, ensuring its robustness and effectiveness across diverse datasets. |
photorealistic image quality, ai-generated images, image quality assessment, vision transformer, maximum mean discrepancy |
2405.09403
Report |
Identity Overlap Between Face Recognition Train/Test Data: Causing Optimistic Bias in Accuracy Measurement |
Haiyu Wu, Sicong Tian, Jacob Gutierrez, Aman Bhatta, Kağan Öztürk, Kevin W. Bowyer |
A fundamental tenet of pattern recognition is that overlap between training
and testing sets causes an optimistic accuracy estimate. Deep CNNs for face
recognition are trained for N-way classification of the identities in the
training set. Accuracy is commonly estimated as average 10-fold classification
accuracy on image pairs from test sets such as LFW, CALFW, CPLFW, CFP-FP and
AgeDB-30. Because train and test sets have been independently assembled, images
and identities in any given test set may also be present in any given training
set. In particular, our experiments reveal a surprising degree of identity and
image overlap between the LFW family of test sets and the MS1MV2 training set.
Our experiments also reveal identity label noise in MS1MV2. We compare accuracy
achieved with same-size MS1MV2 subsets that are identity-disjoint and not
identity-disjoint with LFW, to reveal the size of the optimistic bias. Using
more challenging test sets from the LFW family, we find that the size of the
optimistic bias is larger for more challenging test sets. Our results highlight
the lack of and the need for identity-disjoint train and test methodology in
face recognition research. |
This paper investigates the optimistic bias in face recognition accuracy caused by identity overlap between training and testing datasets, demonstrating the need for identity-disjoint train/test methodology. |
Current face recognition research lacks analysis of identity overlap between datasets, making it difficult to reliably compare algorithms and understand the true impact of this overlap on accuracy. |
The authors reverse-engineered the overlap between MS1MV2 (training) and LFW (testing) datasets, created identity-disjoint and identity-overlapped MS1MV2 subsets, and trained six face recognition models on these subsets to compare their performance on various test sets. |
46.93% of identities in LFW are also present in MS1MV2, leading to an optimistic accuracy bias.
Cleaning identity label noise in MS1MV2, even without addressing identity overlap, improves accuracy.
The optimistic bias due to identity overlap is more pronounced on more challenging test sets. |
The study primarily focuses on the MS1MV2 training set and LFW family of test sets.
Further investigation is needed to analyze identity overlap and its impact on other training and testing datasets. |
face recognition, identity overlap, optimistic bias, dataset bias, evaluation methodology |
2405.09266
Report |
Dance Any Beat: Blending Beats with Visuals in Dance Video Generation |
Xuanchen Wang, Heng Wang, Dongnan Liu, Weidong Cai |
The task of generating dance from music is crucial, yet current methods,
which mainly produce joint sequences, lead to outputs that lack intuitiveness
and complicate data collection due to the necessity for precise joint
annotations. We introduce a Dance Any Beat Diffusion model, namely DabFusion,
that employs music as a conditional input to directly create dance videos from
still images, utilizing conditional image-to-video generation principles. This
approach pioneers the use of music as a conditioning factor in image-to-video
synthesis. Our method unfolds in two stages: training an auto-encoder to
predict latent optical flow between reference and driving frames, eliminating
the need for joint annotation, and training a U-Net-based diffusion model to
produce these latent optical flows guided by music rhythm encoded by CLAP.
Although capable of producing high-quality dance videos, the baseline model
struggles with rhythm alignment. We enhance the model by adding beat
information, improving synchronization. We introduce a 2D motion-music
alignment score (2D-MM Align) for quantitative assessment. Evaluated on the
AIST++ dataset, our enhanced model shows marked improvements in 2D-MM Align
score and established metrics. Video results can be found on our project page:
https://DabFusion.github.io. |
Introduces DabFusion, a novel diffusion-based model that generates dance videos directly from a still image and music, eliminating the need for joint annotations and pioneering the use of music as a condition in image-to-video synthesis. |
Addresses the limitations of current music-to-dance generation methods that rely on joint sequences, resulting in less intuitive outputs and complex data collection. |
Employs a two-stage approach: 1) training a latent flow auto-encoder to estimate optical flow between video frames and 2) training a U-Net-based diffusion model to generate latent flows conditioned on music (encoded by CLAP) and a starting image. Enhances rhythm alignment by incorporating beat information extracted via Librosa. |
DabFusion generates high-quality dance videos comparable to state-of-the-art unconditional video generation models.
Incorporating beat information significantly improves the synchronization between dance movements and music.
Camera angle and distance significantly influence the quality of generated videos. |
Video quality degrades with increasing length due to the accumulation of errors.
Future work includes improving arbitrary-length video generation and exploring other conditioning factors like dance style descriptions. |
image-to-video synthesis, music-to-dance generation, diffusion models, motion-music alignment, conditional video generation |
2405.09215
Report |
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model |
Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang |
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It
is designed for efficient deployment on consumer GPU servers. Our work directly
confronts a pivotal industry issue by grappling with the prohibitive service
costs that hinder the broad adoption of large-scale multimodal systems. Through
rigorous training, we have developed a 1B-scale language model from the ground
up, employing the LLaVA paradigm for modal alignment. The result, which we call
Xmodel-VLM, is a lightweight yet powerful multimodal vision language model.
Extensive testing across numerous classic multimodal benchmarks has revealed
that despite its smaller size and faster execution, Xmodel-VLM delivers
performance comparable to that of larger models. Our model checkpoints and code
are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM. |
This paper introduces XModel-VLM, an efficient and lightweight multimodal vision language model designed for deployment on consumer-grade GPU servers. |
The paper addresses the challenge of high operational costs associated with large-scale multimodal models, which hinders their widespread adoption. |
The authors develop a 1B-scale language model from scratch and integrate it with a CLIP ViT-L/14 vision encoder using a simple yet effective two-layer MLP projector (XDP). The model is trained using a two-stage approach: pre-training for feature alignment and fine-tuning for instruction following. |
XModel-VLM achieves comparable performance to larger models on various multimodal benchmarks despite its smaller size.
The model demonstrates faster inference speeds compared to LLAVA-7B.
Ablation studies highlight the effectiveness of the proposed projector design and the impact of token numbers on model performance. |
Larger language models could further improve performance.
Further optimization is needed for even faster inference. |
vision language model, multimodal learning, efficient deployment, lightweight model, cross-modal alignment |
2405.09114
Report |
SOEDiff: Efficient Distillation for Small Object Editing |
Qihe Pan, Zicheng Wang, Zhen Zhao, Yiming Wu, Sifan Long, Haoran Liang, Ronghua Liang |
In this paper, we delve into a new task known as small object editing (SOE),
which focuses on text-based image inpainting within a constrained, small-sized
area. Despite the remarkable success have been achieved by current image
inpainting approaches, their application to the SOE task generally results in
failure cases such as Object Missing, Text-Image Mismatch, and Distortion.
These failures stem from the limited use of small-sized objects in training
datasets and the downsampling operations employed by U-Net models, which
hinders accurate generation. To overcome these challenges, we introduce a novel
training-based approach, SOEDiff, aimed at enhancing the capability of baseline
models like StableDiffusion in editing small-sized objects while minimizing
training costs. Specifically, our method involves two key components: SO-LoRA,
which efficiently fine-tunes low-rank matrices, and Cross-Scale Score
Distillation loss, which leverages high-resolution predictions from the
pre-trained teacher diffusion model. Our method presents significant
improvements on the test dataset collected from MSCOCO and OpenImage,
validating the effectiveness of our proposed method in small object editing. In
particular, when comparing SOEDiff with SD-I model on the OpenImage-f dataset,
we observe a 0.99 improvement in CLIP-Score and a reduction of 2.87 in FID. Our
project page can be found in https://soediff.github.io/. |
Introduces SOEDiff, a novel training-based approach for text-based small object editing (SOE) in images, enhancing the capabilities of baseline models like StableDiffusion. |
Addresses the limitations of existing image inpainting models in handling small object editing, a task crucial for subtle image manipulations. |
Employs SO-LoRA for efficient fine-tuning of low-rank matrices and a Cross-Scale Score Distillation loss leveraging high-resolution predictions from a pre-trained teacher diffusion model. |
Significantly improves text-to-image alignment and reduces object-missing, mismatch, and distortion issues in small object editing.
Outperforms baselines like SD-I and BlendedDM on MSCOCO and OpenImage datasets, showing significant gains in CLIP-Score and FID.
Demonstrates extended applications in object removal and replacement tasks beyond basic inpainting. |
Limited exploration of different crop sizes and aspect ratios for the teacher model input.
Further research on reducing the computational cost associated with VAE fine-tuning. |
small object editing, image editing, lora, score distillation, diffusion models |
2405.08911
Report |
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks |
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Oncel Tuzel |
CLIP models perform remarkably well on zero-shot classification and retrieval
tasks. But recent studies have shown that learnt representations in CLIP are
not well suited for dense prediction tasks like object detection, semantic
segmentation or depth estimation. More recently, multi-stage training methods
for CLIP models was introduced to mitigate the weak performance of CLIP on
downstream tasks. In this work, we find that simply improving the quality of
captions in image-text datasets improves the quality of CLIP's visual
representations, resulting in significant improvement on downstream dense
prediction vision tasks. In fact, we find that CLIP pretraining with good
quality captions can surpass recent supervised, self-supervised and weakly
supervised pretraining methods. We show that when CLIP model with ViT-B/16 as
image encoder is trained on well aligned image-text pairs it obtains 12.1%
higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation
tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining
methods like Masked Autoencoder (MAE). We find that mobile architectures also
benefit significantly from CLIP pretraining. A recent mobile vision
architecture, MCi2, with CLIP pretraining obtains similar performance as
Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being
6.1$\times$ smaller. Moreover, we show that improving caption quality results
in $10\times$ data efficiency when finetuning for dense prediction tasks. |
This paper investigates the impact of caption quality on CLIP's performance in downstream dense prediction tasks, showing that CLIP with high-quality captions outperforms many supervised, self-supervised, and weakly supervised methods. |
While CLIP excels in zero-shot classification and retrieval, its performance in dense prediction tasks has lagged behind other methods. This work demonstrates that caption quality is crucial for CLIP's performance in these tasks. |
The authors compare the performance of CLIP models pretrained on datasets with varying caption quality (ALIGN, DataComp, DataCompDR). They fine-tune and evaluate these models on ImageNet-1K, MS COCO, ADE20k, and NYUv2 benchmarks. |
CLIP pretrained on DataCompDR, a dataset with high-quality captions, achieves state-of-the-art results on dense prediction tasks, outperforming methods like MAE and MAWS.
Improving caption quality leads to better data efficiency, with CLIP models trained on smaller subsets of DataCompDR matching the performance of models trained on larger subsets of DataComp.
CLIP pretraining significantly benefits mobile architectures, achieving accuracy comparable to larger models like Swin-L on semantic segmentation. |
The study primarily focuses on ViT-B/16 architecture, and further investigation is needed to assess the impact of caption quality on larger CLIP models.
Future work could explore the development of more advanced captioning methods to further improve CLIP's performance in dense prediction tasks. |
clip, image captioning, dense prediction, self-supervised learning, mobile architectures |
2405.08733
Report |
A Simple Approach to Differentiable Rendering of SDFs |
Zichen Wang, Xi Deng, Ziyi Zhang, Wenzel Jakob, Steve Marschner |
We present a simple algorithm for differentiable rendering of surfaces
represented by Signed Distance Fields (SDF), which makes it easy to integrate
rendering into gradient-based optimization pipelines. To tackle
visibility-related derivatives that make rendering non-differentiable, existing
physically based differentiable rendering methods often rely on elaborate
guiding data structures or reparameterization with a global impact on variance.
In this article, we investigate an alternative that embraces nonzero bias in
exchange for low variance and architectural simplicity. Our method expands the
lower-dimensional boundary integral into a thin band that is easy to sample
when the underlying surface is represented by an SDF. We demonstrate the
performance and robustness of our formulation in end-to-end inverse rendering
tasks, where it obtains results that are competitive with or superior to
existing work. |
This paper introduces a simple and robust algorithm for differentiable rendering of surfaces represented by Signed Distance Fields (SDFs), enabling easier integration of rendering into gradient-based optimization pipelines. |
Differentiable rendering, crucial for applications like inverse rendering and 3D reconstruction, often suffers from visibility discontinuities. Existing solutions either rely on complex data structures or increase gradient variance. This method offers an alternative that embraces a small, controlled bias in exchange for low variance and simplicity. |
The core idea is to relax the strict visibility boundary to a thin band around the object silhouette. This transforms the challenging lower-dimensional boundary integral into a simpler area integral that can be efficiently estimated using standard Monte Carlo sampling. |
The method achieves high-quality inverse rendering results, comparable to or surpassing existing techniques.
It exhibits robustness to the choice of SDF threshold, a key parameter controlling the relaxation.
The simplicity of the approach makes it easy to implement and integrate into existing rendering systems. |
The method introduces a small bias due to the relaxation of the visibility boundary.
The optimal SDF threshold may require tuning depending on the scene scale. |
differentiable rendering, signed distance functions, inverse rendering, 3d reconstruction, monte carlo methods |
2405.08720
Report |
The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective |
Andrew Shin, Yusuke Mori, Kunitake Kaneko |
Text-to-video generation task has witnessed a notable progress, with the
generated outcomes reflecting the text prompts with high fidelity and
impressive visual qualities. However, current text-to-video generation models
are invariably focused on conveying the visual elements of a single scene, and
have so far been indifferent to another important potential of the medium,
namely a storytelling. In this paper, we examine text-to-video generation from
a storytelling perspective, which has been hardly investigated, and make
empirical remarks that spotlight the limitations of current text-to-video
generation scheme. We also propose an evaluation framework for storytelling
aspects of videos, and discuss the potential future directions. |
This paper investigates the capabilities and limitations of current text-to-video generation models in storytelling, a largely unexplored area. |
Current text-to-video models excel at generating visually appealing single scenes or movements but struggle to weave coherent narratives. This paper aims to bridge this gap and explore storytelling potential in video generation. |
The authors generate videos from three types of text prompts: 1) short stories, 2) scripts with dialogue, and 3) existing captions from a video storytelling dataset. They then evaluate these videos using established visual quality metrics (FVD, Inception Score), a novel cyclical evaluation framework (T2Vid2T) that assesses text-video alignment, and human evaluations focusing on story components (character, setting, plot) and overall comprehensibility. |
Current text-to-video generation models struggle to maintain narrative coherence across multiple scenes, often resulting in visually appealing but narratively disjointed videos.
Videos generated from factual descriptions (captions) show better visual quality and story coherence than those generated from short stories or scripts, highlighting a potential bias in training data.
Adding narration to videos generally improves story comprehension, but mismatches between generated visuals and narration can hinder understanding. |
The study relies heavily on manual evaluation for storytelling aspects due to the lack of standardized automatic metrics.
The research primarily focuses on visual storytelling, leaving exploration of incorporating audio cues (e.g., dialogue, sound effects) for future work. |
text-to-video generation, storytelling, video evaluation, narrative coherence, ai and creativity |
2405.08055
Report |
DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation |
Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu |
Generating diverse and high-quality 3D assets automatically poses a
fundamental yet challenging task in 3D computer vision. Despite extensive
efforts in 3D generation, existing optimization-based approaches struggle to
produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods
often focus on generating only a single category or a few categories, limiting
their generalizability. Therefore, we introduce a diffusion-based feed-forward
framework to address these challenges with a single model. To handle the large
diversity and complexity in geometry and texture across categories efficiently,
we 1) adopt improved triplane to guarantee efficiency; 2) introduce the
3D-aware transformer to aggregate the generalized 3D knowledge with specialized
3D features; and 3) devise the 3D-aware encoder/decoder to enhance the
generalized 3D knowledge. Building upon our 3D-aware Diffusion model with
TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e.,
DiffTF++. It boils down to two parts: multi-view reconstruction loss and
triplane refinement. Specifically, we utilize multi-view reconstruction loss to
fine-tune the diffusion model and triplane decoder, thereby avoiding the
negative influence caused by reconstruction errors and improving texture
synthesis. By eliminating the mismatch between the two stages, the generative
performance is enhanced, especially in texture. Additionally, a 3D-aware
refinement process is introduced to filter out artifacts and refine triplanes,
resulting in the generation of more intricate and reasonable details. Extensive
experiments on ShapeNet and OmniObject3D convincingly demonstrate the
effectiveness of our proposed modules and the state-of-the-art 3D object
generation performance with large diversity, rich semantics, and high quality. |
Presents DiffTF++, a diffusion-based feed-forward framework for generating diverse 3D objects across many categories using a single model. |
Addresses limitations of existing optimization-based and feed-forward 3D generation methods in efficiency, generalizability, and handling diverse object appearances. |
Employs triplane representation, 3D-aware transformer for global 3D knowledge and specialized feature extraction, 3D-aware encoder/decoder for enhanced semantic understanding, multi-view reconstruction loss for consistency between stages, and 3D-aware refinement for artifact elimination and detail enhancement. |
Achieves state-of-the-art performance on ShapeNet and OmniObject3D datasets in terms of 2D and 3D metrics.
Generates high-quality 3D objects with realistic topology, rich texture, and fine details, outperforming previous methods.
Demonstrates strong generalization ability for large-vocabulary 3D object generation, handling diverse categories with complex geometry and textures. |
Current implementation is limited to relatively low-resolution triplanes.
Exploration of incorporating text-guided generation capabilities for more controllable and diverse 3D object synthesis. |
3d generation, diffusion models, transformer, triplane representation, large-vocabulary generation |
2405.08054
Report |
Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning |
Wenqi Dong, Bangbang Yang, Lin Ma, Xiao Liu, Liyuan Cui, Hujun Bao, Yuewen Ma, Zhaopeng Cui |
As humans, we aspire to create media content that is both freely willed and
readily controlled. Thanks to the prominent development of generative
techniques, we now can easily utilize 2D diffusion methods to synthesize images
controlled by raw sketch or designated human poses, and even progressively
edit/regenerate local regions with masked inpainting. However, similar
workflows in 3D modeling tasks are still unavailable due to the lack of
controllability and efficiency in 3D generation. In this paper, we present a
novel controllable and interactive 3D assets modeling framework, named Coin3D.
Coin3D allows users to control the 3D generation using a coarse geometry proxy
assembled from basic shapes, and introduces an interactive generation workflow
to support seamless local part editing while delivering responsive 3D object
previewing within a few seconds. To this end, we develop several techniques,
including the 3D adapter that applies volumetric coarse shape control to the
diffusion model, proxy-bounded editing strategy for precise part editing,
progressive volume cache to support responsive preview, and volume-SDS to
ensure consistent mesh reconstruction. Extensive experiments of interactive
generation and editing on diverse shape proxies demonstrate that our method
achieves superior controllability and flexibility in the 3D assets generation
task. |
Coin3D is a novel controllable and interactive 3D asset modeling framework that uses coarse geometry proxies, assembled from basic shapes, to guide the generation of detailed 3D objects. |
Existing 3D generative methods lack controllability and efficiency, relying on text prompts or images that inadequately represent 3D shapes. Coin3D addresses this by offering a user-friendly approach for creating and editing 3D assets with precise 3D control. |
Coin3D leverages a 3D adapter module to integrate a voxelized 3D proxy into a multiview diffusion process. It employs a proxy-bounded editing strategy for precise local adjustments and a progressive volume caching mechanism for responsive preview. |
Coin3D enables generating 3D objects with faithful adherence to user-provided coarse shapes, outperforming image-based generation methods in quality and user studies.
Compared to existing controllable generation methods, Coin3D shows superior control and avoids issues like overgrowth or incomplete details, while being significantly faster in providing feedback.
The interactive workflow allows for seamlessly adding, adjusting, or regenerating specific parts of the object with responsive preview, making it suitable for an iterative design process. |
The initial 2D image candidate generation, while providing a quick preview, depends on prompt engineering and might require further enhancement for complex textures or backgrounds.
The resolution of generated details is limited by the base diffusion model, and future work could explore high-resolution optimization or material-disentangled models. |
3d object generation, controllable generation, interactive modeling, diffusion models, 3d-aware conditioning |
2405.07992
Report |
MambaOut: Do We Really Need Mamba for Vision? |
Weihao Yu, Xinchao Wang |
Mamba, an architecture with RNN-like token mixer of state space model (SSM),
was recently introduced to address the quadratic complexity of the attention
mechanism and subsequently applied to vision tasks. Nevertheless, the
performance of Mamba for vision is often underwhelming when compared with
convolutional and attention-based models. In this paper, we delve into the
essence of Mamba, and conceptually conclude that Mamba is ideally suited for
tasks with long-sequence and autoregressive characteristics. For vision tasks,
as image classification does not align with either characteristic, we
hypothesize that Mamba is not necessary for this task; Detection and
segmentation tasks are also not autoregressive, yet they adhere to the
long-sequence characteristic, so we believe it is still worthwhile to explore
Mamba's potential for these tasks. To empirically verify our hypotheses, we
construct a series of models named MambaOut through stacking Mamba blocks while
removing their core token mixer, SSM. Experimental results strongly support our
hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models
on ImageNet image classification, indicating that Mamba is indeed unnecessary
for this task. As for detection and segmentation, MambaOut cannot match the
performance of state-of-the-art visual Mamba models, demonstrating the
potential of Mamba for long-sequence visual tasks. The code is available at
https://github.com/yuweihao/MambaOut |
This paper investigates the necessity of Mamba, an RNN-like architecture, for visual recognition tasks, arguing that it is not essential for image classification but potentially beneficial for detection and segmentation. |
The quadratic complexity of attention in Transformers poses challenges for long sequences, motivating the exploration of alternative token mixers like Mamba, particularly for vision tasks where their efficacy remains unclear. |
The authors analyze the suitability of Mamba for long-sequence and autoregressive tasks, then examine the characteristics of visual recognition tasks against these criteria. They introduce MambaOut models, which remove the core SSM component of Mamba, to empirically evaluate its necessity. |
MambaOut models, despite lacking SSM, consistently outperform visual Mamba models on ImageNet image classification, supporting the hypothesis that SSM is unnecessary for this task.
In contrast, MambaOut models fall short of state-of-the-art visual Mamba models in object detection and semantic segmentation, highlighting the potential benefits of SSM for long-sequence visual tasks.
Visual Mamba models, while showing promise for long sequences, still lag behind state-of-the-art convolution and attention-based models in visual recognition tasks, indicating a need for further development. |
The study primarily focuses on conceptual analysis and empirical verification of Mamba's efficacy for visual tasks, leaving the exploration of RNN and Transformer integration for future work.
The paper acknowledges computational resource limitations and suggests further investigation into Mamba and RNN concepts for large language models (LLMs) and large multimodal models (LMMs) as future directions. |
mamba, vision transformer, image classification, object detection, semantic segmentation |
2405.07919
Report |
Exploring the Low-Pass Filtering Behavior in Image Super-Resolution |
Haoyu Deng, Zijing Xu, Yule Duan, Xiao Wu, Wenjie Shu, Liang-Jian Deng |
Deep neural networks for image super-resolution (ISR) have shown significant
advantages over traditional approaches like the interpolation. However, they
are often criticized as 'black boxes' compared to traditional approaches with
solid mathematical foundations. In this paper, we attempt to interpret the
behavior of deep neural networks in ISR using theories from the field of signal
processing. First, we report an intriguing phenomenon, referred to as `the sinc
phenomenon.' It occurs when an impulse input is fed to a neural network. Then,
building on this observation, we propose a method named Hybrid Response
Analysis (HyRA) to analyze the behavior of neural networks in ISR tasks.
Specifically, HyRA decomposes a neural network into a parallel connection of a
linear system and a non-linear system and demonstrates that the linear system
functions as a low-pass filter while the non-linear system injects
high-frequency information. Finally, to quantify the injected high-frequency
information, we introduce a metric for image-to-image tasks called Frequency
Spectrum Distribution Similarity (FSDS). FSDS reflects the distribution
similarity of different frequency components and can capture nuances that
traditional metrics may overlook. Code, videos and raw experimental results for
this paper can be found in: https://github.com/RisingEntropy/LPFInISR. |
This paper unveils the "sinc phenomenon," demonstrating that the impulse response of image super-resolution (ISR) networks acts as a low-pass filter, and introduces Hybrid Response Analysis (HyRA) to interpret ISR networks by separating them into linear (low-pass filter) and non-linear (high-frequency injection) components. |
This work enhances the interpretability of ISR networks, typically criticized as "black boxes," by linking them to traditional signal processing theories. |
The authors analyze impulse responses of various ISR networks, visualize feature maps, and compare performance with traditional low-pass filters. They also propose a new metric, Frequency Spectrum Distribution Similarity (FSDS), to quantify high-frequency information injection. |
The impulse responses of many ISR networks, regardless of CNN or transformer-based architecture, resemble sinc functions, suggesting an inherent low-pass filtering behavior.
HyRA demonstrates that the non-linear component of ISR networks injects high-frequency details, compensating for the low-pass filtering effect.
FSDS effectively captures high-frequency distortions, unlike PSNR, SSIM, or LPIPS, highlighting its sensitivity and necessity in evaluating ISR quality. |
The "sinc phenomenon" is not universally observed, particularly in networks trained with adversarial loss, suggesting a connection to loss function choices.
Future work includes investigating the impact of different window functions on impulse responses and exploring why certain networks treat specific high-frequency information as low-frequency. |
image super-resolution, deep learning interpretability, signal processing, low-pass filtering, frequency spectrum analysis |
2405.07913
Report |
CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models |
Nick Stracke, Stefan Andreas Baumann, Joshua M. Susskind, Miguel Angel Bautista, Björn Ommer |
Text-to-image generative models have become a prominent and powerful tool
that excels at generating high-resolution realistic images. However, guiding
the generative process of these models to consider detailed forms of
conditioning reflecting style and/or structure information remains an open
problem. In this paper, we present LoRAdapter, an approach that unifies both
style and structure conditioning under the same formulation using a novel
conditional LoRA block that enables zero-shot control. LoRAdapter is an
efficient, powerful, and architecture-agnostic approach to condition
text-to-image diffusion models, which enables fine-grained control conditioning
during generation and outperforms recent state-of-the-art approaches |
This paper introduces \textbf{\methodname{}}, a novel approach for conditioning text-to-image diffusion models that unifies style and structure conditioning under the same formulation using conditional LoRA blocks. |
\textbf{\methodname{}} addresses the open problem of guiding the generative process of text-to-image models to consider detailed forms of conditioning reflecting both style and structure information in a zero-shot manner. |
\textbf{\methodname{}} leverages the low-rank property of LoRAs to regularize conditioning and applies a conditional affine transformation to the low-dimensional intermediate embedding in the LoRA. This allows for efficient adaptation of both convolutional and attention layers in diffusion models for local (structure) and global (style) conditioning. |
\textbf{\methodname{}} achieves state-of-the-art performance on CLIP-I and CLIP-T scores for style conditioning, outperforming both dedicated adapters and some models trained from scratch.
\textbf{\methodname{}} demonstrates superior adherence to structure guidance compared to existing methods like ControlNet and T2I-Adapter, as evidenced by quantitative metrics on depth and HED map reconstruction tasks.
Ablation studies highlight the modularity of \textbf{\methodname{}}, showing that adapting cross-attention layers yields the best performance for style conditioning and allows for logical fusion with text prompts. |
The effectiveness of \textbf{\methodname{}} has only been demonstrated on text-to-image diffusion models based on Stable Diffusion, further investigation on fully transformer-based diffusion models and large language models is needed.
While improving control over image generation, \textbf{\methodname{}} could potentially be misused to generate more believable disinformation or harmful content. |
text-to-image generation, diffusion models, conditional image synthesis, low-rank adaptation (lora), style and structure control |
2405.07813
Report |
Localizing Task Information for Improved Model Merging and Compression |
Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, François Fleuret, Pascal Frossard |
Model merging and task arithmetic have emerged as promising scalable
approaches to merge multiple single-task checkpoints to one multi-task model,
but their applicability is reduced by significant performance loss. Previous
works have linked these drops to interference in the weight space and erasure
of important task-specific features. Instead, in this work we show that the
information required to solve each task is still preserved after merging as
different tasks mostly use non-overlapping sets of weights. We propose
TALL-masks, a method to identify these task supports given a collection of task
vectors and show that one can retrieve >99% of the single task accuracy by
applying our masks to the multi-task vector, effectively compressing the
individual checkpoints. We study the statistics of intersections among
constructed masks and reveal the existence of selfish and catastrophic weights,
i.e., parameters that are important exclusively to one task and irrelevant to
all tasks but detrimental to multi-task fusion. For this reason, we propose
Consensus Merging, an algorithm that eliminates such weights and improves the
general performance of existing model merging approaches. Our experiments in
vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging
consistently improves existing approaches. Furthermore, our proposed
compression scheme reduces storage from 57Gb to 8.2Gb while retaining 99.7% of
original performance. |
This paper introduces TALL-masks, a method to localize task-specific information in multi-task vectors generated from merging fine-tuned models, enabling both model compression and improved model merging. |
Model merging and compression are crucial for efficiently leveraging and deploying large, fine-tuned models, but existing methods suffer from performance loss due to task interference. |
TALL-masks identifies task-specific weight subsets by minimizing the L1 distance between the original task vector and a masked version of the multi-task vector. This enables the creation of compressed models or improved merged models by eliminating catastrophic and selfish weights. |
Task-specific information is preserved in merged models, and TALL-masks can effectively recover near-original performance.
TALL-masks enables compression of fine-tuned models to a fraction of their original size (e.g., 13.7% for a 20-task benchmark) with minimal performance loss.
Consensus Merging, which leverages TALL-masks to eliminate detrimental weights, consistently improves the performance of existing model merging methods like Task Arithmetic and TIES across vision and NLP tasks. |
The optimal weight-pruning threshold for Consensus Merging varies depending on factors like the model merging method and the number of tasks.
Further research can explore the impact of different merging strategies on weight profiles and optimize for specific applications. |
model merging, model compression, task arithmetic, weight interpolation, task interference |
2405.07648
Report |
CDFormer:When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution |
Qingguo Liu, Chenyi Zhuang, Pan Gao, Jie Qin |
Existing Blind image Super-Resolution (BSR) methods focus on estimating
either kernel or degradation information, but have long overlooked the
essential content details. In this paper, we propose a novel BSR approach,
Content-aware Degradation-driven Transformer (CDFormer), to capture both
degradation and content representations. However, low-resolution images cannot
provide enough content details, and thus we introduce a diffusion-based module
$CDFormer_{diff}$ to first learn Content Degradation Prior (CDP) in both low-
and high-resolution images, and then approximate the real distribution given
only low-resolution information. Moreover, we apply an adaptive SR network
$CDFormer_{SR}$ that effectively utilizes CDP to refine features. Compared to
previous diffusion-based SR methods, we treat the diffusion model as an
estimator that can overcome the limitations of expensive sampling time and
excessive diversity. Experiments show that CDFormer can outperform existing
methods, establishing a new state-of-the-art performance on various benchmarks
under blind settings. Codes and models will be available at
\href{https://github.com/I2-Multimedia-Lab/CDFormer}{https://github.com/I2-Multimedia-Lab/CDFormer}. |
This paper proposes CDFormer, a novel Content-aware Degradation-driven Transformer network for Blind image Super-Resolution (BSR). CDFormer leverages a two-stage training strategy to capture both degradation and content representations through a Content Degradation Prior (CDP) generation module and a CDP-guided SR network. |
Existing BSR methods typically focus solely on estimating kernel or degradation information, neglecting crucial content details. This can lead to suboptimal performance, particularly in challenging scenarios with complex degradations. |
The method utilizes a two-stage training approach. Stage 1: A ground-truth encoder (E_GT) learns CDP from paired HR and LR images to guide the SR network. Stage 2: An LR encoder (E_LR) and a diffusion model generate CDP solely from LR images. |
CDFormer achieves state-of-the-art performance on various BSR benchmarks under blind settings.
The introduction of CDP enables CDFormer to reconstruct SR images with sharper and more harmonious textures, even in cases of severe degradation.
The diffusion model effectively recreates CDP from LR images, demonstrating its potential in super-resolution tasks. |
The performance improvement of CDFormer is limited when dealing with LR images with high noise levels.
Future work could explore the integration of other techniques like contrastive learning to further enhance the robustness and accuracy of CDFormer. |
blind image super-resolution, diffusion models, transformer networks, content degradation prior, deep learning |
2405.07392
Report |
NGD-SLAM: Towards Real-Time SLAM for Dynamic Environments without GPU |
Yuhao Zhang |
Accurate and robust camera tracking in dynamic environments presents a
significant challenge for visual SLAM (Simultaneous Localization and Mapping).
Recent progress in this field often involves the use of deep learning
techniques to generate mask for dynamic objects, which usually require GPUs to
operate in real-time (30 fps). Therefore, this paper proposes a novel visual
SLAM system for dynamic environments that obtains real-time performance on CPU
by incorporating a mask prediction mechanism, which allows the deep learning
method and the camera tracking to run entirely in parallel at different
frequencies such that neither waits for the result from the other. Based on
this, it further introduces a dual-stage optical flow tracking approach and
employs a hybrid usage of optical flow and ORB features, which significantly
enhance the efficiency and robustness of the system. Compared with
state-of-the-art methods, this system maintains high localization accuracy in
dynamic environments while achieving a tracking frame rate of 56 fps on a
single laptop CPU without any hardware acceleration, thus proving that deep
learning methods are still feasible for dynamic SLAM even without GPU support.
Based on the available information, this is the first SLAM system to achieve
this. |
This paper presents NGD-SLAM, a real-time visual SLAM system for dynamic environments that achieves real-time performance on CPU by incorporating a novel mask prediction mechanism and dual-stage optical flow tracking. |
Accurate and robust camera tracking in dynamic environments is challenging for visual SLAM. Existing methods often rely on computationally expensive deep learning models, requiring GPUs for real-time performance. |
The system uses a mask prediction mechanism based on previous segmentation results and a dual-stage tracking approach employing optical flow for both dynamic and static feature tracking, coupled with ORB features for keyframe tracking. |
NGD-SLAM achieves localization accuracy comparable to state-of-the-art methods in dynamic environments.
It maintains high efficiency, achieving a tracking frame rate of 56 fps on a single laptop CPU without hardware acceleration.
The proposed system is the first to demonstrate real-time performance on CPU for dynamic SLAM with deep learning-based dynamic object detection. |
The mask prediction mechanism may fail when a new dynamic object suddenly enters the scene.
Future work will focus on improving the system's robustness in handling complex and large-scale dynamic environments and exploring other lightweight deep learning models for improved efficiency. |
visual slam, dynamic environments, deep learning, mask prediction, optical flow tracking, real-time |
2405.07346
Report |
Understanding and Evaluating Human Preferences for AI Generated Images with Instruction Tuning |
Jiarui Wang, Huiyu Duan, Guangtao Zhai, Xiongkuo Min |
Artificial Intelligence Generated Content (AIGC) has grown rapidly in recent
years, among which AI-based image generation has gained widespread attention
due to its efficient and imaginative image creation ability. However,
AI-generated Images (AIGIs) may not satisfy human preferences due to their
unique distortions, which highlights the necessity to understand and evaluate
human preferences for AIGIs. To this end, in this paper, we first establish a
novel Image Quality Assessment (IQA) database for AIGIs, termed AIGCIQA2023+,
which provides human visual preference scores and detailed preference
explanations from three perspectives including quality, authenticity, and
correspondence. Then, based on the constructed AIGCIQA2023+ database, this
paper presents a MINT-IQA model to evaluate and explain human preferences for
AIGIs from Multi-perspectives with INstruction Tuning. Specifically, the
MINT-IQA model first learn and evaluate human preferences for AI-generated
Images from multi-perspectives, then via the vision-language instruction tuning
strategy, MINT-IQA attains powerful understanding and explanation ability for
human visual preference on AIGIs, which can be used for feedback to further
improve the assessment capabilities. Extensive experimental results demonstrate
that the proposed MINT-IQA model achieves state-of-the-art performance in
understanding and evaluating human visual preferences for AIGIs, and the
proposed model also achieves competing results on traditional IQA tasks
compared with state-of-the-art IQA models. The AIGCIQA2023+ database and
MINT-IQA model will be released to facilitate future research. |
This paper introduces AIGCIQA2023+, an extended dataset for evaluating human preferences in AI-generated images, and proposes MINT-IQA, a novel method for evaluating and explaining these preferences from multiple perspectives using instruction tuning. |
Understanding human preferences for AI-generated images is crucial for improving the quality of generated content and bridging the gap between human expectations and AI capabilities. |
The authors construct AIGCIQA2023+ with fine-grained preference annotations and develop MINT-IQA, which leverages a multi-modal Q-Former for representation learning, score regression for preference prediction, and vision-language instruction tuning for detailed explanation. |
MINT-IQA achieves state-of-the-art performance on three AIGC IQA datasets, demonstrating its effectiveness in evaluating human preferences from multiple perspectives.
The model also demonstrates superior performance on traditional IQA databases, highlighting its versatility in assessing image quality.
Ablation studies validate the contribution of each module in MINT-IQA, emphasizing the importance of instruction tuning and multi-perspective evaluation. |
The current model is limited by the scale of the AIGCIQA2023+ dataset.
Future work can focus on expanding the dataset and exploring different modalities beyond text and images. |
artificial intelligence generated content (aigc), image quality assessment (iqa), human visual preference, instruction tuning, multi-perspective evaluation |
2405.07306
Report |
Point Resampling and Ray Transformation Aid to Editable NeRF Models |
Zhenyang Li, Zilong Chen, Feifan Qu, Mingqing Wang, Yizhou Zhao, Kai Zhang, Yifan Peng |
In NeRF-aided editing tasks, object movement presents difficulties in
supervision generation due to the introduction of variability in object
positions. Moreover, the removal operations of certain scene objects often lead
to empty regions, presenting challenges for NeRF models in inpainting them
effectively. We propose an implicit ray transformation strategy, allowing for
direct manipulation of the 3D object's pose by operating on the neural-point in
NeRF rays. To address the challenge of inpainting potential empty regions, we
present a plug-and-play inpainting module, dubbed differentiable neural-point
resampling (DNR), which interpolates those regions in 3D space at the original
ray locations within the implicit space, thereby facilitating object removal &
scene inpainting tasks. Importantly, employing DNR effectively narrows the gap
between ground truth and predicted implicit features, potentially increasing
the mutual information (MI) of the features across rays. Then, we leverage DNR
and ray transformation to construct a point-based editable NeRF pipeline
PR^2T-NeRF. Results primarily evaluated on 3D object removal & inpainting tasks
indicate that our pipeline achieves state-of-the-art performance. In addition,
our pipeline supports high-quality rendering visualization for diverse editing
operations without necessitating extra supervision. |
This paper introduces a novel approach for object removal and scene inpainting in neural radiance fields (NeRFs) by combining implicit ray transformation with a differentiable neural-point resampling (DNR) strategy. |
Object manipulation in NeRFs, particularly removal and inpainting, presents challenges due to the need for precise supervision and the potential for artifacts in the edited regions. This work addresses these issues by directly manipulating rays and developing a method for consistent inpainting. |
The method involves: 1) Implicit ray transformation for object manipulation (rotation, translation, scaling, removal). 2) Target object segmentation using a pretrained SAM model and depth estimation. 3) Differentiable Neural-Point Resampling (DNR) to interpolate features in empty regions, enhancing consistency and visual quality. 4) Fine-tuning with a combination of reconstruction, perceptual, depth, and sparse losses. |
The proposed method achieves state-of-the-art performance on scene object removal and inpainting benchmarks.
DNR strategies, particularly GWFA, are shown to significantly improve inpainting quality and convergence speed.
Theoretical analysis and experimental validation demonstrate that DNR effectively increases mutual information among rays, leading to better feature consistency and inpainting results. |
The method's reliance on pretrained models for depth estimation, object segmentation, and inpainting could introduce limitations depending on their performance.
Future work includes jointly optimizing depth estimation with object masks and integrating DNR directly into the NeRF rendering process. |
neural radiance fields, scene editing, object removal, scene inpainting, differentiable rendering |
2405.07288
Report |
Erasing Concepts from Text-to-Image Diffusion Models with Few-shot Unlearning |
Masane Fuchi, Tomohiro Takagi |
Generating images from text has become easier because of the scaling of
diffusion models and advancements in the field of vision and language. These
models are trained using vast amounts of data from the Internet. Hence, they
often contain undesirable content such as copyrighted material. As it is
challenging to remove such data and retrain the models, methods for erasing
specific concepts from pre-trained models have been investigated. We propose a
novel concept-erasure method that updates the text encoder using few-shot
unlearning in which a few real images are used. The discussion regarding the
generated images after erasing a concept has been lacking. While there are
methods for specifying the transition destination for concepts, the validity of
the specified concepts is unclear. Our method implicitly achieves this by
transitioning to the latent concepts inherent in the model or the images. Our
method can erase a concept within 10 s, making concept erasure more accessible
than ever before. Implicitly transitioning to related concepts leads to more
natural concept erasure. We applied the proposed method to various concepts and
confirmed that concept erasure can be achieved tens to hundreds of times faster
than with current methods. By varying the parameters to be updated, we obtained
results suggesting that, like previous research, knowledge is primarily
accumulated in the feed-forward networks of the text encoder. |
This paper proposes a novel, fast method for erasing specific concepts from text-to-image diffusion models by updating the text encoder using few-shot unlearning. |
Existing methods for removing undesirable concepts from pre-trained text-to-image models are computationally expensive and often lead to a decrease in generation quality. This paper addresses these limitations with a faster, more efficient approach. |
The proposed method leverages few-shot unlearning by maximizing the stable diffusion loss with a reversed gradient, focusing on the text encoder while keeping the U-Net parameters unchanged. This forces the model to 'forget' the target concept represented by the text. |
The method achieves a significant speedup (60-900 times) compared to baseline methods, enabling concept erasure within 10 seconds.
Concept erasure is achieved by providing only a few real images related to the target concept.
The method implicitly transitions to semantically similar concepts, leading to more natural concept erasure without requiring explicit anchor concepts. |
The method may face challenges erasing concepts with large semantic spaces.
Future work includes developing more robust evaluation metrics for concept erasure and exploring alternative methods like saliency map-based approaches. |
concept erasure, text-to-image diffusion models, few-shot unlearning, text encoder, stable diffusion |
2405.07145
Report |
Stable Signature is Unstable: Removing Image Watermark from Diffusion Models |
Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong |
Watermark has been widely deployed by industry to detect AI-generated images.
A recent watermarking framework called \emph{Stable Signature} (proposed by
Meta) roots watermark into the parameters of a diffusion model's decoder such
that its generated images are inherently watermarked. Stable Signature makes it
possible to watermark images generated by \emph{open-source} diffusion models
and was claimed to be robust against removal attacks. In this work, we propose
a new attack to remove the watermark from a diffusion model by fine-tuning it.
Our results show that our attack can effectively remove the watermark from a
diffusion model such that its generated images are non-watermarked, while
maintaining the visual quality of the generated images. Our results highlight
that Stable Signature is not as stable as previously thought. |
This paper introduces a new model-targeted attack method to remove in-generation watermarks from open-source diffusion models by fine-tuning the decoder. |
The misuse of AI-generated images presents risks of misinformation, making watermarking crucial for detection. Existing methods are vulnerable in open-source settings, and current removal attacks are either inefficient or significantly degrade image quality. |
The attack involves two steps: 1) Estimating the denoised latent vector for non-watermarked images in an attacking dataset, with different approaches for encoder-aware and encoder-agnostic scenarios. 2) Fine-tuning the decoder using the estimated latent vectors and non-watermarked images to minimize reconstruction error and fool a discriminator. |
The attack successfully evades watermark detection with high evasion rates and low bitwise accuracy.
It maintains significantly better image quality (FID) than the existing model purification attack.
The attack is more efficient than per-image-based removal attacks when processing a large number of images. |
The fine-tuning process in the encoder-agnostic scenario can be time-consuming.
Future work includes exploring more robust watermarking methods for open-source diffusion models. |
image watermarking, diffusion models, watermark removal, generative ai, adversarial attacks |
2405.07023
Report |
Efficient Real-world Image Super-Resolution Via Adaptive Directional Gradient Convolution |
Long Peng, Yang Cao, Renjing Pei, Wenbo Li, Jiaming Guo, Xueyang Fu, Yang Wang, Zheng-Jun Zha |
Real-SR endeavors to produce high-resolution images with rich details while
mitigating the impact of multiple degradation factors. Although existing
methods have achieved impressive achievements in detail recovery, they still
fall short when addressing regions with complex gradient arrangements due to
the intensity-based linear weighting feature extraction manner. Moreover, the
stochastic artifacts introduced by degradation cues during the imaging process
in real LR increase the disorder of the overall image details, further
complicating the perception of intrinsic gradient arrangement. To address these
challenges, we innovatively introduce kernel-wise differential operations
within the convolutional kernel and develop several learnable directional
gradient convolutions. These convolutions are integrated in parallel with a
novel linear weighting mechanism to form an Adaptive Directional Gradient
Convolution (DGConv), which adaptively weights and fuses the basic directional
gradients to improve the gradient arrangement perception capability for both
regular and irregular textures. Coupled with DGConv, we further devise a novel
equivalent parameter fusion method for DGConv that maintains its rich
representational capabilities while keeping computational costs consistent with
a single Vanilla Convolution (VConv), enabling DGConv to improve the
performance of existing super-resolution networks without incurring additional
computational expenses. To better leverage the superiority of DGConv, we
further develop an Adaptive Information Interaction Block (AIIBlock) to adeptly
balance the enhancement of texture and contrast while meticulously
investigating the interdependencies, culminating in the creation of a DGPNet
for Real-SR through simple stacking. Comparative results with 15 SOTA methods
across three public datasets underscore the effectiveness and efficiency of our
proposed approach. |
This paper introduces DGConv, a novel 'plug-and-play' convolutional unit that enhances detail and contrast representation in real-world image super-resolution (Real-SR) without increasing computational cost. |
Real-world low-resolution images suffer from complex degradations that disrupt texture arrangements and statistical properties, making detail and contrast restoration challenging. Existing methods struggle to address complex gradient arrangements and often introduce computational overhead. |
DGConv integrates learnable directional gradient and aggregation operations to enhance perception of regular and irregular textures, and image contrast. An equivalent parameter fusion method maintains computational cost comparable to Vanilla Convolution (VConv). An Adaptive Information Interaction Block (AIIBlock) balances texture and contrast enhancement. These components are combined in the Directional Gradient Perceiving Network (DGPNet). |
DGPNet outperforms 15 state-of-the-art Real-SR methods on benchmark datasets, achieving superior detail recovery and contrast enhancement with low computational complexity.
Replacing VConv with DGConv in five classical SR methods consistently improves performance.
Ablation studies confirm the contribution of each component in DGConv and the effectiveness of using local statistical mean for gradient and aggregation operations. |
Exploration of additional directional arrangement convolutions to further enhance DGConv's representation capacity.
Validation and extension of DGConv to other image and video super-resolution and restoration tasks. |
image super-resolution, real-world image super-resolution, deep learning, convolutional neural networks, directional gradient convolution |
2405.06948
Report |
Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation |
Shengyuan Liu, Bo Wang, Ye Ma, Te Yang, Xipeng Cao, Quan Chen, Han Li, Di Dong, Peng Jiang |
Existing subject-driven text-to-image generation models suffer from tedious
fine-tuning steps and struggle to maintain both text-image alignment and
subject fidelity. For generating compositional subjects, it often encounters
problems such as object missing and attribute mixing, where some subjects in
the input prompt are not generated or their attributes are incorrectly
combined. To address these limitations, we propose a subject-driven generation
framework and introduce training-free guidance to intervene in the generative
process during inference time. This approach strengthens the attention map,
allowing for precise attribute binding and feature injection for each subject.
Notably, our method exhibits exceptional zero-shot generation ability,
especially in the challenging task of compositional generation. Furthermore, we
propose a novel metric GroundingScore to evaluate subject alignment thoroughly.
The obtained quantitative results serve as compelling evidence showcasing the
effectiveness of our proposed method. The code will be released soon. |
This paper introduces SE-Guidance, a training-free method for subject-driven text-to-image generation that enhances attention maps to improve attribute binding and feature injection for each subject, particularly in compositional generation. |
Existing subject-driven generation models often require tedious fine-tuning and struggle with object missing and attribute mixing in compositional generation. This method addresses these limitations by providing a training-free approach. |
The method utilizes an image prompt adapter to inject subject representations and employs SE-Guidance during inference. SE-Guidance extracts subject attention maps, injects subject representations in the forward process, and enhances attention to subjects in the backward process. |
The method achieves comparable results to fine-tuned methods in single-concept generation while demonstrating superior text-image alignment.
In compositional generation, SE-Guidance effectively addresses object missing and attribute mixing, surpassing baseline methods in preserving subject fidelity and text alignment.
A novel metric, GroundingScore, is introduced for a more accurate evaluation of subject alignment in compositional generation. |
The method's effectiveness is limited by the expressive power of the underlying generative model, particularly for unique or rare objects.
Addressing fine-grained composition relations with more than two subjects remains challenging and requires further exploration. |
text-to-image generation, diffusion models, subject-driven generation, compositional generation, attention mechanisms |
2405.06914
Report |
Non-confusing Generation of Customized Concepts in Diffusion Models |
Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang |
We tackle the common challenge of inter-concept visual confusion in
compositional concept generation using text-guided diffusion models (TGDMs). It
becomes even more pronounced in the generation of customized concepts, due to
the scarcity of user-provided concept visual examples. By revisiting the two
major stages leading to the success of TGDMs -- 1) contrastive image-language
pre-training (CLIP) for text encoder that encodes visual semantics, and 2)
training TGDM that decodes the textual embeddings into pixels -- we point that
existing customized generation methods only focus on fine-tuning the second
stage while overlooking the first one. To this end, we propose a simple yet
effective solution called CLIF: contrastive image-language fine-tuning.
Specifically, given a few samples of customized concepts, we obtain
non-confusing textual embeddings of a concept by fine-tuning CLIP via
contrasting a concept and the over-segmented visual regions of other concepts.
Experimental results demonstrate the effectiveness of CLIF in preventing the
confusion of multi-customized concept generation. |
This paper introduces CLIF, a novel approach to prevent inter-concept visual confusion in composing multiple customized concepts using text-guided diffusion models. |
Existing methods for customized concept generation often lead to visual confusion, especially in complex compositions, hindering the generation of novel and distinct concepts. |
CLIF employs a two-stage fine-tuning approach: 1) Contrastive fine-tuning of the text encoder with an over-segmented concept dataset to distinguish textual embeddings. 2) Fine-tuning the text-to-image decoder to synthesize non-confusing images using the decoupled concept embeddings. |
CLIF effectively mitigates identity loss, attribute leaking, and concept missing in multi-concept customization.
CLIF demonstrates superior performance in both qualitative and quantitative evaluations compared to state-of-the-art methods.
Ablation studies confirm the importance of global, regional, and mix augmentation in enhancing identity preservation, attribute binding, and concept attendance respectively. |
Generating more than 2 customized concepts simultaneously, though achievable, has limitations requiring further research.
The potential misuse of CLIF for creating deepfakes necessitates robust ethical guidelines and monitoring. |
text-guided diffusion models, concept customization, multi-concept generation, contrastive learning, image generation |
2405.06535
Report |
Controllable Image Generation With Composed Parallel Token Prediction |
Jamie Stirling, Noura Al-Moubayed |
Compositional image generation requires models to generalise well in
situations where two or more input concepts do not necessarily appear together
in training (compositional generalisation). Despite recent progress in
compositional image generation via composing continuous sampling processes such
as diffusion and energy-based models, composing discrete generative processes
has remained an open challenge, with the promise of providing improvements in
efficiency, interpretability and simplicity. To this end, we propose a
formulation for controllable conditional generation of images via composing the
log-probability outputs of discrete generative models of the latent space. Our
approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art
generation accuracy in three distinct settings (FFHQ, Positional CLEVR and
Relational CLEVR) while attaining competitive Fr\'echet Inception Distance
(FID) scores. Our method attains an average generation accuracy of $80.71\%$
across the studied settings. Our method also outperforms the next-best approach
(ranked by accuracy) in terms of FID in seven out of nine experiments, with an
average FID of $24.23$ (an average improvement of $-9.58$). Furthermore, our
method offers a $2.3\times$ to $12\times$ speedup over comparable continuous
compositional methods on our hardware. We find that our method can generalise
to combinations of input conditions that lie outside the training data (e.g.
more objects per image) in addition to offering an interpretable dimension of
controllability via concept weighting. We further demonstrate that our approach
can be readily applied to an open pre-trained discrete text-to-image model
without any fine-tuning, allowing for fine-grained control of text-to-image
generation. |
This paper introduces a novel method for controllable conditional image generation by composing discrete iterative generative processes, achieving state-of-the-art accuracy. |
Compositional generalisation in image generation, particularly the ability to handle unfamiliar combinations of concepts, is crucial for creating AI with human-like intelligence. |
The method involves: (1) deriving formulae for logical operations on probabilistic outputs of discrete models, (2) adapting this for parallel token prediction, and (3) employing concept weighting for enhanced control. |
The method achieves state-of-the-art generation accuracy across three datasets (FFHQ, Positional CLEVR, Relational CLEVR) outperforming existing techniques.
It offers competitive Fréchet Inception Distance (FID) scores, indicating good image quality.
The approach is computationally efficient, demonstrating a 2.3x to 12x speedup compared to similar continuous methods. |
The method requires multiple feed-forward operations, potentially impacting efficiency, although this is mitigated by fast convergence.
The assumption of independent input conditions might not always hold in real-world scenarios, potentially limiting generalisation.
Future work could explore learned concept-weighting policies for enhanced controllability |
image generation, compositional generalisation, discrete generative models, parallel token prediction, controllable generation |
2405.06525
Report |
Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation |
Xiaowen Ma, Zhenliang Ni, Xinghao Chen |
Vanilla pixel-level classifiers for semantic segmentation are based on a
certain paradigm, involving the inner product of fixed prototypes obtained from
the training set and pixel features in the test image. This approach, however,
encounters significant limitations, i.e., feature deviation in the semantic
domain and information loss in the spatial domain. The former struggles with
large intra-class variance among pixel features from different images, while
the latter fails to utilize the structured information of semantic objects
effectively. This leads to blurred mask boundaries as well as a deficiency of
fine-grained recognition capability. In this paper, we propose a novel Semantic
and Spatial Adaptive (SSA) classifier to address the above challenges.
Specifically, we employ the coarse masks obtained from the fixed prototypes as
a guide to adjust the fixed prototype towards the center of the semantic and
spatial domains in the test image. The adapted prototypes in semantic and
spatial domains are then simultaneously considered to accomplish classification
decisions. In addition, we propose an online multi-domain distillation learning
strategy to improve the adaption process. Experimental results on three
publicly available benchmarks show that the proposed SSA significantly improves
the segmentation performance of the baseline models with only a minimal
increase in computational cost. Code is available at
https://github.com/xwmaxwma/SSA. |
This paper presents a Semantic and Spatial Adaptive (SSA) classifier designed to enhance pixel-level classification for semantic segmentation. |
Vanilla pixel-level classifiers suffer from limitations like feature deviation in the semantic domain and information loss in the spatial domain, leading to inaccurate segmentation. |
The SSA classifier uses coarse masks to guide the adaptation of fixed prototypes towards semantic and spatial centers in test images, capturing both semantic and spatial relationships for classification. It also employs online multi-domain distillation learning to refine feature representation and constrain prototype adaptation. |
SSA significantly improves segmentation performance on ADE20K, PASCAL-Context, and COCO-Stuff-10K datasets with minimal computational overhead.
It outperforms other state-of-the-art classifiers like GMMSeg and CAC.
The method enables lightweight models to achieve state-of-the-art performance in real-time segmentation tasks. |
The method requires 1.1 times more training time compared to the baseline due to the use of a teacher classifier.
Future work can explore the integration of attention mechanisms into SSA to further enhance feature representation. |
semantic segmentation, pixel-level classification, adaptive classifier, multi-domain distillation, spatial reasoning |
2405.06461
Report |
SketchDream: Sketch-based Text-to-3D Generation and Editing |
Feng-Lin Liu, Hongbo Fu, Yu-Kun Lai, Lin Gao |
Existing text-based 3D generation methods generate attractive results but
lack detailed geometry control. Sketches, known for their conciseness and
expressiveness, have contributed to intuitive 3D modeling but are confined to
producing texture-less mesh models within predefined categories. Integrating
sketch and text simultaneously for 3D generation promises enhanced control over
geometry and appearance but faces challenges from 2D-to-3D translation
ambiguity and multi-modal condition integration. Moreover, further editing of
3D models in arbitrary views will give users more freedom to customize their
models. However, it is difficult to achieve high generation quality, preserve
unedited regions, and manage proper interactions between shape components. To
solve the above issues, we propose a text-driven 3D content generation and
editing method, SketchDream, which supports NeRF generation from given
hand-drawn sketches and achieves free-view sketch-based local editing. To
tackle the 2D-to-3D ambiguity challenge, we introduce a sketch-based multi-view
image generation diffusion model, which leverages depth guidance to establish
spatial correspondence. A 3D ControlNet with a 3D attention module is utilized
to control multi-view images and ensure their 3D consistency. To support local
editing, we further propose a coarse-to-fine editing approach: the coarse phase
analyzes component interactions and provides 3D masks to label edited regions,
while the fine stage generates realistic results with refined details by local
enhancement. Extensive experiments validate that our method generates
higher-quality results compared with a combination of 2D ControlNet and
image-to-3D generation techniques and achieves detailed control compared with
existing diffusion-based 3D editing approaches. |
SketchDream, a novel method for text-driven 3D content generation and editing that leverages user-provided sketches to enable fine-grained control over object geometry and appearance. |
Existing text-to-3D generation methods lack detailed control over geometry, while sketch-based methods are limited in generating textured 3D models. SketchDream combines the expressiveness of sketches with the semantic richness of text prompts to allow users to create and edit high-quality 3D content with greater precision. |
The method utilizes a sketch-based multi-view image generation diffusion model to generate realistic multi-view images from input sketches and text prompts. It employs depth-guided warping to establish spatial correspondence and a 3D attention module for cross-view consistency. A coarse-to-fine editing framework is introduced for local editing, refining the initial results with a precise 3D mask and a local rendering strategy for enhanced quality and sketch faithfulness. |
SketchDream generates higher-quality 3D content than existing sketch-based text-to-3D baselines, achieving better geometry and appearance realism.
The method enables detailed control over 3D model generation by combining text prompts for appearance and sketches for shape and texture.
SketchDream outperforms existing sketch-based 3D editing approaches, offering more realistic editing results while preserving unedited regions. |
The generation and editing quality may be degraded for objects that are rare in the training dataset.
The current implementation is computationally expensive, limiting interactive generation and editing. |
sketch-based interaction, diffusion models, neural radiance fields, 3d generation, 3d editing |
2405.06408
Report |
I3DGS: Improve 3D Gaussian Splatting from Multiple Dimensions |
Jinwei Lin |
3D Gaussian Splatting is a novel method for 3D view synthesis, which can gain
an implicit neural learning rendering result than the traditional neural
rendering technology but keep the more high-definition fast rendering speed.
But it is still difficult to achieve a fast enough efficiency on 3D Gaussian
Splatting for the practical applications. To Address this issue, we propose the
I3DS, a synthetic model performance improvement evaluation solution and
experiments test. From multiple and important levels or dimensions of the
original 3D Gaussian Splatting, we made more than two thousand various kinds of
experiments to test how the selected different items and components can make an
impact on the training efficiency of the 3D Gaussian Splatting model. In this
paper, we will share abundant and meaningful experiences and methods about how
to improve the training, performance and the impacts caused by different items
of the model. A special but normal Integer compression in base 95 and a
floating-point compression in base 94 with ASCII encoding and decoding
mechanism is presented. Many real and effective experiments and test results or
phenomena will be recorded. After a series of reasonable fine-tuning, I3DS can
gain excellent performance improvements than the previous one. The project code
is available as open source. |
This paper proposes I3DS, a method to improve the training efficiency of 3D Gaussian Splatting models for 3D view synthesis. |
3D Gaussian Splatting is a promising technique for 3D view synthesis, offering high resolution and fast rendering speed. However, its training efficiency requires improvement for practical applications. |
The paper explores improvements from multiple dimensions: analyzing the impact of color components and backgrounds, optimizing learning rates, and introducing data compression techniques. |
Removing color components during training significantly improves speed (5-8x) but adding them back is challenging and requires further investigation.
Setting the maximum degree of Spherical Harmonics coefficients to 0 speeds up training (16.57% improvement) without significantly impacting the rendering quality.
Customizing learning rates, particularly the XYZ learning rate and scaling learning rate, demonstrably enhances training speed. |
Adding back color information after training without color is a significant challenge and requires further research to achieve satisfactory results.
The proposed ASCII encoding-decoding compression offers modest speed improvements (around 6%) and further optimization is needed for compressing color matrices. |
3d gaussian splatting, 3d view synthesis, training efficiency, spherical harmonics, data compression |
2405.06241
Report |
MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization |
Pengcheng Zhu, Yaoming Zhuang, Baoquan Chen, Li Li, Chengdong Wu, Zhanlin Liu |
This letter introduces a novel framework for dense Visual Simultaneous
Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently Gaussian
Splatting-based SLAM has yielded promising results, but rely on RGB-D input and
is weak in tracking. To address these limitations, we uniquely integrates
advanced sparse visual odometry with a dense Gaussian Splatting scene
representation for the first time, thereby eliminating the dependency on depth
maps typical of Gaussian Splatting-based SLAM systems and enhancing tracking
robustness. Here, the sparse visual odometry tracks camera poses in RGB stream,
while Gaussian Splatting handles map reconstruction. These components are
interconnected through a Multi-View Stereo (MVS) depth estimation network. And
we propose a depth smooth loss to reduce the negative effect of estimated depth
maps. Furthermore, the consistency in scale between the sparse visual odometry
and the dense Gaussian map is preserved by Sparse-Dense Adjustment Ring (SDAR).
We have evaluated our system across various synthetic and real-world datasets.
The accuracy of our pose estimation surpasses existing methods and achieves
state-of-the-art performance. Additionally, it outperforms previous monocular
methods in terms of novel view synthesis fidelity, matching the results of
neural SLAM systems that utilize RGB-D input. |
Introduces MGS-SLAM, a novel monocular dense SLAM system that combines sparse visual odometry with 3D Gaussian Splatting for the first time. |
Addresses limitations of existing Gaussian Splatting-based SLAM systems that rely on RGB-D input and suffer from weak tracking, enabling dense mapping with only RGB images. |
Integrates sparse visual odometry (DPVO) with a dense Gaussian Splatting scene representation. Employs a pre-trained MVS depth estimation network to bridge the two components and proposes a depth smooth loss and Sparse-Dense Adjustment Ring (SDAR) to ensure geometric accuracy and scale consistency. |
Achieves state-of-the-art pose estimation accuracy, outperforming previous monocular and some RGB-D methods on TUM and Replica datasets.
Demonstrates robust tracking on large-scale datasets like Replica, unlike previous monocular Gaussian Splatting-based SLAM systems.
Produces high-fidelity novel view synthesis results, comparable to neural SLAM systems using RGB-D input. |
Real-time performance is still limited compared to some traditional SLAM methods.
Further research on incorporating loop closure and global optimization techniques could improve performance in challenging scenarios. |
slam, gaussian splatting, monocular vision, dense mapping, differentiable rendering |
2405.06147
Report |
State-Free Inference of State-Space Models: The Transfer Function Approach |
Rom N. Parnichkun, Stefano Massaroli, Alessandro Moro, Jimmy T. H. Smith, Ramin Hasani, Mathias Lechner, Qi An, Christopher Ré, Hajime Asama, Stefano Ermon, Taiji Suzuki, Atsushi Yamashita, Michael Poli |
We approach designing a state-space model for deep learning applications
through its dual representation, the transfer function, and uncover a highly
efficient sequence parallel inference algorithm that is state-free: unlike
other proposed algorithms, state-free inference does not incur any significant
memory or computational cost with an increase in state size. We achieve this
using properties of the proposed frequency domain transfer function
parametrization, which enables direct computation of its corresponding
convolutional kernel's spectrum via a single Fast Fourier Transform. Our
experimental results across multiple sequence lengths and state sizes
illustrates, on average, a 35% training speed improvement over S4 layers --
parametrized in time-domain -- on the Long Range Arena benchmark, while
delivering state-of-the-art downstream performances over other attention-free
approaches. Moreover, we report improved perplexity in language modeling over a
long convolutional Hyena baseline, by simply introducing our transfer function
parametrization. Our code is available at https://github.com/ruke1ire/RTF. |
Presents Rational Transfer Function (RTF), a novel parametrization of state-space models (SSM) for sequence processing based on a frequency domain representation. |
Addresses limitations of existing SSMs like restricted expressiveness due to diagonal state transition matrices and high memory cost in parallel scan-based inference. |
Leverages the transfer function, the dual of impulse response, to design a state-free parallel inference algorithm based on the Fast Fourier Transform (FFT). |
Achieves state-of-the-art accuracy among attention-free models on the Long Range Arena benchmark.
Demonstrates faster training speed compared to S4 and S4D layers across different state sizes.
Shows improved perplexity over a long convolutional Hyena baseline in language modeling by introducing the transfer function parametrization. |
RTF with small state sizes struggled to learn a policy beyond random guessing on the Path-X task of the LRA benchmark.
Directly training RTF on language modeling exhibited instability issues, necessitating the use of parameter constraints and specific initialization schemes. |
state-space model, transfer function, sequence modeling, parallel inference, frequency domain |
2405.05967
Report |
Distilling Diffusion Models into Conditional GANs |
Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park |
We propose a method to distill a complex multistep diffusion model into a
single-step conditional GAN student model, dramatically accelerating inference,
while preserving image quality. Our approach interprets diffusion distillation
as a paired image-to-image translation task, using noise-to-image pairs of the
diffusion model's ODE trajectory. For efficient regression loss computation, we
propose E-LatentLPIPS, a perceptual loss operating directly in diffusion
model's latent space, utilizing an ensemble of augmentations. Furthermore, we
adapt a diffusion model to construct a multi-scale discriminator with a text
alignment loss to build an effective conditional GAN-based formulation.
E-LatentLPIPS converges more efficiently than many existing distillation
methods, even accounting for dataset construction costs. We demonstrate that
our one-step generator outperforms cutting-edge one-step diffusion distillation
models - DMD, SDXL-Turbo, and SDXL-Lightning - on the zero-shot COCO benchmark. |
This paper introduces Diffusion2GAN, a method to distill a multi-step diffusion model into a single-step conditional GAN, accelerating inference while preserving image quality. |
Diffusion models excel in image synthesis but suffer from slow inference due to multi-step sampling. This work addresses this limitation for real-time applications. |
The method interprets distillation as paired image-to-image translation using noise-image pairs from the diffusion ODE trajectory. It leverages E-LatentLPIPS, a proposed efficient perceptual loss in latent space, and a multi-scale conditional diffusion discriminator. |
Diffusion2GAN outperforms one-step diffusion distillation models like DMD, SDXL-Turbo, and SDXL-Lightning on zero-shot COCO benchmark.
The proposed E-LatentLPIPS significantly accelerates training and improves performance compared to pixel-based perceptual losses.
The multi-scale diffusion discriminator, initialized from a pre-trained diffusion model, further enhances image quality and text alignment. |
The current method uses a fixed classifier-free guidance scale, limiting control over text adherence.
The performance of the distilled model is limited by the quality of the teacher diffusion model. |
diffusion models, generative adversarial networks, knowledge distillation, image synthesis, text-to-image generation |
2405.05953
Report |
Frame Interpolation with Consecutive Brownian Bridge Diffusion |
Zonglin Lyu, Ming Li, Jianbo Jiao, Chen Chen |
Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a
diffusion-based conditional image generation problem, synthesizing the
intermediate frame given a random noise and neighboring frames. Due to the
relatively high resolution of videos, Latent Diffusion Models (LDMs) are
employed as the conditional generation model, where the autoencoder compresses
images into latent representations for diffusion and then reconstructs images
from these latent representations. Such a formulation poses a crucial
challenge: VFI expects that the output is deterministically equal to the ground
truth intermediate frame, but LDMs randomly generate a diverse set of different
images when the model runs multiple times. The reason for the diverse
generation is that the cumulative variance (variance accumulated at each step
of generation) of generated latent representations in LDMs is large. This makes
the sampling trajectory random, resulting in diverse rather than deterministic
generations. To address this problem, we propose our unique solution: Frame
Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we
propose consecutive Brownian Bridge diffusion that takes a deterministic
initial value as input, resulting in a much smaller cumulative variance of
generated latent representations. Our experiments suggest that our method can
improve together with the improvement of the autoencoder and achieve
state-of-the-art performance in VFI, leaving strong potential for further
enhancement. |
This paper introduces a novel consecutive Brownian Bridge diffusion model for Video Frame Interpolation (VFI), aiming to address the deterministic output requirement of VFI, which is not fulfilled by the random generation nature of traditional Latent Diffusion Models (LDMs). |
VFI requires deterministic output for a given input frame pair, while traditional LDMs generate diverse images due to high cumulative variance in the generation process, leading to difficulties in accurate intermediate frame interpolation. |
This work formulates VFI as a two-stage process: autoencoder and ground truth estimation. It proposes consecutive Brownian Bridge diffusion, which transits among three deterministic endpoints (previous, intermediate, next frames) to minimize cumulative variance. An autoencoder with warped feature pyramids from neighboring frames further enhances detail preservation in the generated frames. |
The proposed consecutive Brownian Bridge diffusion model demonstrates superior ground truth estimation compared to traditional conditional generation diffusion models.
The improved autoencoder effectively reduces overlaid image artifacts commonly observed in previous LDM-based VFI methods.
The method achieves state-of-the-art performance on standard VFI benchmarks, particularly excelling in motion consistency metrics like FloLPIPS. |
The current method utilizes a bisection-like approach for multi-frame interpolation, limiting its ability to directly interpolate arbitrary time steps.
Future work could explore more sophisticated autoencoder architectures or diffusion model designs to further enhance interpolation quality. |
video frame interpolation, diffusion models, brownian bridge, autoencoder, conditional image generation |
2405.05949
Report |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts |
Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen |
Recent advancements in Multimodal Large Language Models (LLMs) have focused
primarily on scaling by increasing text-image pair data and enhancing LLMs to
improve performance on multimodal tasks. However, these scaling approaches are
computationally expensive and overlook the significance of improving model
capabilities from the vision side. Inspired by the successful applications of
Mixture-of-Experts (MoE) in LLMs, which improves model scalability during
training while keeping inference costs similar to those of smaller models, we
propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated
Mixture-of-experts blocks into both the vision encoder and the MLP connector,
thereby enhancing the multimodal LLMs with minimal additional activated
parameters during inference. CuMo first pre-trains the MLP blocks and then
initializes each expert in the MoE block from the pre-trained MLP block during
the visual instruction tuning stage. Auxiliary losses are used to ensure a
balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs
across various VQA and visual-instruction-following benchmarks using models
within each model size group, all while training exclusively on open-sourced
datasets. The code and model weights for CuMo are open-sourced at
https://github.com/SHI-Labs/CuMo. |
CuMo enhances multimodal Large Language Models (LLMs) by incorporating co-upcycled sparsely-gated Mixture-of-Experts (MoE) blocks into the vision encoder and MLP connector. |
Scaling multimodal LLMs via increasing data and model size is computationally expensive. CuMo improves efficiency by enhancing visual capabilities with minimal additional parameters during inference. |
CuMo employs a three-stage training process: MLP connector pre-training, pre-finetuning the whole model, and visual instruction tuning with co-upcycled MoE blocks. Auxiliary losses ensure balanced expert loading. |
Outperforms state-of-the-art multimodal LLMs on various VQA and visual-instruction-following benchmarks.
Achieves performance comparable to larger models while using a smaller LLM size.
Demonstrates effectiveness of co-upcycled MoE blocks and training recipe through ablation studies. |
Hallucinations observed in responses, requiring further investigation for mitigation.
Limited exploration of scaling vision encoders beyond the used CLIP architecture. |
multimodal llm, mixture-of-experts, vision-language model, co-upcycling, visual instruction tuning |
2405.05945
Report |
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers |
Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li |
Sora unveils the potential of scaling Diffusion Transformer for generating
photorealistic images and videos at arbitrary resolutions, aspect ratios, and
durations, yet it still lacks sufficient implementation details. In this
technical report, we introduce the Lumina-T2X family - a series of Flow-based
Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized
attention, as a unified framework designed to transform noise into images,
videos, multi-view 3D objects, and audio clips conditioned on text
instructions. By tokenizing the latent spatial-temporal space and incorporating
learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X
seamlessly unifies the representations of different modalities across various
spatial-temporal resolutions. This unified approach enables training within a
single framework for different modalities and allows for flexible generation of
multimodal data at any resolution, aspect ratio, and length during inference.
Advanced techniques like RoPE, RMSNorm, and flow matching enhance the
stability, flexibility, and scalability of Flag-DiT, enabling models of
Lumina-T2X to scale up to 7 billion parameters and extend the context window to
128K tokens. This is particularly beneficial for creating ultra-high-definition
images with our Lumina-T2I model and long 720p videos with our Lumina-T2V
model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT,
requires only 35% of the training computational costs of a
600-million-parameter naive DiT. Our further comprehensive analysis underscores
Lumina-T2X's preliminary capability in resolution extrapolation,
high-resolution editing, generating consistent 3D views, and synthesizing
videos with seamless transitions. We expect that the open-sourcing of
Lumina-T2X will further foster creativity, transparency, and diversity in the
generative AI community. |
Introduces Lumina-T2X, a unified framework based on Flow-based Large Diffusion Transformers (Flag-DiT) for generating various modalities (images, videos, 3D objects, audio) from text at arbitrary resolutions and lengths. |
Addresses limitations of previous models like Sora and Stable Diffusion 3 by providing a unified framework, detailed implementation instructions, and publicly available pre-trained checkpoints. |
Employs Flag-DiT, incorporating improvements like RoPE, RMSNorm, KQ-norm, and flow matching formulation for scalability and stability. Utilizes learnable placeholders like '[nextline]' and '[nextframe]' tokens for handling arbitrary resolutions and lengths. |
Flag-DiT significantly outperforms existing models on ImageNet benchmark, demonstrating faster convergence and higher sample quality with increasing model size.
Lumina-T2I achieves superior visual quality and text alignment in image generation, enabling resolution extrapolation, style-consistent generation, compositional generation, and high-resolution editing in a training-free manner.
Lumina-T2V, Lumina-T2MV, and Lumina-T2Speech show promising preliminary results in generating temporally and spatially consistent videos, multi-view 3D objects, and speech from text prompts, respectively. |
Current version trains each modality separately due to data imbalance and diverse latent space distributions, hindering joint learning.
Limited data coverage leads to challenges in generating complex real-world details, such as human hands or intricate scenes. |
diffusion models, text-to-image generation, text-to-video generation, multi-modal generation, resolution extrapolation |
2405.05858
Report |
Free-Moving Object Reconstruction and Pose Estimation with Virtual Camera |
Haixin Shi, Yinlin Hu, Daniel Koguciuk, Juan-Ting Lin, Mathieu Salzmann, David Ferstl |
We propose an approach for reconstructing free-moving object from a monocular
RGB video. Most existing methods either assume scene prior, hand pose prior,
object category pose prior, or rely on local optimization with multiple
sequence segments. We propose a method that allows free interaction with the
object in front of a moving camera without relying on any prior, and optimizes
the sequence globally without any segments. We progressively optimize the
object shape and pose simultaneously based on an implicit neural
representation. A key aspect of our method is a virtual camera system that
reduces the search space of the optimization significantly. We evaluate our
method on the standard HO3D dataset and a collection of egocentric RGB
sequences captured with a head-mounted device. We demonstrate that our approach
outperforms most methods significantly, and is on par with recent techniques
that assume prior information. |
This paper presents a novel approach for reconstructing and estimating the pose of a rigid, dynamic object from a monocular RGB video, without relying on prior information like hand poses, object categories, or scene geometry. |
Existing methods for 3D object reconstruction often rely on restrictive assumptions such as static objects, hand-held rotations, or prior knowledge of object categories, limiting their applicability in general scenarios with free-moving objects. |
The method leverages a virtual camera system guided by 2D object masks to simplify the optimization process. It optimizes object shape and pose progressively with a single network, using a simplified 4-DOF pose representation and incorporating 2D matches between frames. Finally, it refines the results in the real camera coordinate system using a PnP solver. |
The method outperforms existing pose-free methods on the HO3D dataset and achieves comparable performance to methods relying on ground-truth poses or depth information.
It effectively handles free-moving objects and generalizes well to egocentric sequences captured with a head-mounted device.
The proposed virtual camera system is shown to significantly improve optimization stability and accuracy. |
The method struggles with objects that are heavily occluded for extended periods during capture.
Reconstruction of small, texture-less objects remains challenging due to the lack of distinctive features. |
3d reconstruction, pose estimation, virtual camera, implicit neural representation, monocular rgb video |
2405.05846
Report |
Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models |
Zhe Ma, Xuhong Zhang, Qingming Li, Tianyu Du, Wenzhi Chen, Zonghui Wang, Shouling Ji |
The past few years have witnessed substantial advancement in text-guided
image generation powered by diffusion models. However, it was shown that
text-to-image diffusion models are vulnerable to training image memorization,
raising concerns on copyright infringement and privacy invasion. In this work,
we perform practical analysis of memorization in text-to-image diffusion
models. Targeting a set of images to protect, we conduct quantitive analysis on
them without need to collect any prompts. Specifically, we first formally
define the memorization of image and identify three necessary conditions of
memorization, respectively similarity, existence and probability. We then
reveal the correlation between the model's prediction error and image
replication. Based on the correlation, we propose to utilize inversion
techniques to verify the safety of target images against memorization and
measure the extent to which they are memorized. Model developers can utilize
our analysis method to discover memorized images or reliably claim safety
against memorization. Extensive experiments on the Stable Diffusion, a popular
open-source text-to-image diffusion model, demonstrate the effectiveness of our
analysis method. |
This paper presents a practical, image-based method for analyzing and measuring memorization in text-to-image diffusion models, aiming to help developers identify and address potential copyright and privacy risks. |
Memorization in text-to-image models raises significant concerns about copyright infringement and privacy violation, as these models can potentially replicate images from their training data. |
The authors define three necessary conditions for memorization: similarity, existence, and probability. They leverage the model's prediction error as a metric for image replication and propose prompt and noise inversion techniques to analyze the existence and probability of memorization for target images. |
The model's prediction error is highly correlated with image replication, providing a reliable metric for identifying memorized images.
Unconditional diffusion models trained on large-scale datasets show resilience against memorization and can serve as a baseline for measuring memorization in conditional models.
The proposed method can effectively quantify the extent of memorization for a given image, enabling developers to assess and address potential risks. |
The hard prompt inversion algorithm, while more effective than existing methods, needs improvement for higher accuracy and applicability to a wider range of memorized images.
Future work should extend the analysis to different conditional diffusion models beyond text-to-image generation and explore corresponding regularization techniques. |
memorization, text-to-image diffusion models, privacy, copyright, inversion techniques |
2405.05806
Report |
MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation |
Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Lei Zhang, Wangmeng Zuo |
Text-to-image (T2I) diffusion models have shown significant success in
personalized text-to-image generation, which aims to generate novel images with
human identities indicated by the reference images. Despite promising identity
fidelity has been achieved by several tuning-free methods, they usually suffer
from overfitting issues. The learned identity tends to entangle with irrelevant
information, resulting in unsatisfied text controllability, especially on
faces. In this work, we present MasterWeaver, a test-time tuning-free method
designed to generate personalized images with both faithful identity fidelity
and flexible editability. Specifically, MasterWeaver adopts an encoder to
extract identity features and steers the image generation through additional
introduced cross attention. To improve editability while maintaining identity
fidelity, we propose an editing direction loss for training, which aligns the
editing directions of our MasterWeaver with those of the original T2I model.
Additionally, a face-augmented dataset is constructed to facilitate
disentangled identity learning, and further improve the editability. Extensive
experiments demonstrate that our MasterWeaver can not only generate
personalized images with faithful identity, but also exhibit superiority in
text controllability. Our code will be publicly available at
https://github.com/csyxwei/MasterWeaver. |
Proposes MasterWeaver, a tuning-free method for personalized text-to-image generation that balances identity fidelity and editability. |
Existing methods struggle to balance faithful identity preservation with flexible control over attributes and context in generated images. |
Uses an encoder to extract identity features from a reference image, injects these features into a Stable Diffusion model via cross-attention, and employs an editing direction loss and face-augmented dataset during training to enhance editability. |
Generates high-quality personalized images with faithful identity preservation in diverse scenarios.
Demonstrates superior text controllability compared to state-of-the-art methods, enabling flexible editing of attributes, clothing, background, and style.
Achieves competitive inference speed, generating an image in 4 seconds on a single V100 GPU. |
Limited ability to generate images with multiple personalized identities.
Challenges in achieving precise control over attributes due to the coarse granularity of text representations. |
text-to-image generation, personalized image synthesis, identity preservation, text controllability, diffusion models |
2405.05800
Report |
DragGaussian: Enabling Drag-style Manipulation on 3D Gaussian Representation |
Sitian Shen, Jing Xu, Yuheng Yuan, Xingyi Yang, Qiuhong Shen, Xinchao Wang |
User-friendly 3D object editing is a challenging task that has attracted
significant attention recently. The limitations of direct 3D object editing
without 2D prior knowledge have prompted increased attention towards utilizing
2D generative models for 3D editing. While existing methods like Instruct
NeRF-to-NeRF offer a solution, they often lack user-friendliness, particularly
due to semantic guided editing. In the realm of 3D representation, 3D Gaussian
Splatting emerges as a promising approach for its efficiency and natural
explicit property, facilitating precise editing tasks. Building upon these
insights, we propose DragGaussian, a 3D object drag-editing framework based on
3D Gaussian Splatting, leveraging diffusion models for interactive image
editing with open-vocabulary input. This framework enables users to perform
drag-based editing on pre-trained 3D Gaussian object models, producing modified
2D images through multi-view consistent editing. Our contributions include the
introduction of a new task, the development of DragGaussian for interactive
point-based 3D editing, and comprehensive validation of its effectiveness
through qualitative and quantitative experiments. |
DragGaussian, a novel 3D object drag-editing framework based on 3D Gaussian Splatting that leverages diffusion models for interactive editing with open-vocabulary input. |
Existing 3D editing methods using 2D generative models often lack user-friendliness due to their reliance on semantic guided editing. DragGaussian addresses this by enabling intuitive drag-based editing on 3D Gaussian models. |
DragGaussian uses a user interface for drag point selection, projects them onto multi-view 2D images, employs a fine-tuned multi-view diffusion model for consistent editing, and refines the original 3D Gaussian model with the edited 2D images. |
DragGaussian enables interactive point-based manipulation of 3D Gaussian objects.
Multi-view consistent editing ensures coherent modifications across different viewpoints.
Fine-tuning the diffusion model with multi-view LoRA enhances identity preservation and editing accuracy. |
Reliance on diffusion models prevents real-time editing.
Constraints of the pre-trained MVDream network may limit the quality of editing results on certain datasets. |
3d object editing, 3d gaussian splatting, diffusion models, multi-view consistency, drag-based editing |
2405.05768
Report |
FastScene: Text-Driven Fast 3D Indoor Scene Generation via Panoramic Gaussian Splatting |
Yikun Ma, Dandan Zhan, Zhi Jin |
Text-driven 3D indoor scene generation holds broad applications, ranging from
gaming and smart homes to AR/VR applications. Fast and high-fidelity scene
generation is paramount for ensuring user-friendly experiences. However,
existing methods are characterized by lengthy generation processes or
necessitate the intricate manual specification of motion parameters, which
introduces inconvenience for users. Furthermore, these methods often rely on
narrow-field viewpoint iterative generations, compromising global consistency
and overall scene quality. To address these issues, we propose FastScene, a
framework for fast and higher-quality 3D scene generation, while maintaining
the scene consistency. Specifically, given a text prompt, we generate a
panorama and estimate its depth, since the panorama encompasses information
about the entire scene and exhibits explicit geometric constraints. To obtain
high-quality novel views, we introduce the Coarse View Synthesis (CVS) and
Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene
consistency and view quality. Subsequently, we utilize Multi-View Projection
(MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for
scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses
other methods in both generation speed and quality with better scene
consistency. Notably, guided only by a text prompt, FastScene can generate a 3D
scene within a mere 15 minutes, which is at least one hour faster than
state-of-the-art methods, making it a paradigm for user-friendly scene
generation. |
This paper presents FastScene, a novel framework for fast and high-quality text-driven 3D indoor scene generation that prioritizes scene consistency. |
Existing methods for 3D indoor scene generation are either slow, require manual specification of motion parameters, or struggle to maintain global consistency. Fast and high-fidelity scene generation is crucial for user-friendly experiences in various applications like gaming, smart homes, and AR/VR. |
FastScene first generates a panorama from a text prompt and estimates its depth. It then uses Coarse View Synthesis (CVS) to generate novel panoramic views with holes, which are filled using Progressive Novel View Inpainting (PNVI) on cubemap representations. Finally, Multi-View Projection (MVP) converts panoramas to perspective views for 3D Gaussian Splatting (3DGS) reconstruction. |
FastScene outperforms existing methods in terms of both generation speed and visual quality, while maintaining better scene consistency.
FastScene can generate a 3D scene from a text prompt in just 15 minutes, at least one hour faster than state-of-the-art methods.
The proposed PNVI and MVP techniques are adaptable to existing panoramic datasets, enabling high-quality novel view synthesis and 3D reconstruction from various sources. |
The reliance on depth estimation accuracy can affect the quality of novel view synthesis.
Future work includes exploring 3D scene editing capabilities and incorporating multimodal learning. |
3d scene generation, text-to-3d, novel view synthesis, panorama, 3d gaussian splatting |
2405.05702
Report |
NGM-SLAM: Gaussian Splatting SLAM with Radiance Field Submap |
Mingrui Li, Jingwei Huang, Lei Sun, Aaron Xuxiang Tian, Tianchen Deng, Hongyu Wang |
SLAM systems based on Gaussian Splatting have garnered attention due to their
capabilities for rapid real-time rendering and high-fidelity mapping. However,
current Gaussian Splatting SLAM systems usually struggle with large scene
representation and lack effective loop closure detection. To address these
issues, we introduce NGM-SLAM, the first 3DGS based SLAM system that utilizes
neural radiance field submaps for progressive scene expression, effectively
integrating the strengths of neural radiance fields and 3D Gaussian Splatting.
We utilize neural radiance field submaps as supervision and achieve
high-quality scene expression and online loop closure adjustments through
Gaussian rendering of fused submaps. Our results on multiple real-world scenes
and large-scale scene datasets demonstrate that our method can achieve accurate
hole filling and high-quality scene expression, supporting monocular, stereo,
and RGB-D inputs, and achieving state-of-the-art scene reconstruction and
tracking performance. |
This paper introduces NGM-SLAM, a novel dense Gaussian splatting SLAM system that leverages neural radiance field submaps for progressive scene representation, effectively addressing limitations of current 3DGS-SLAM systems in handling large scenes and loop closures. |
Current Gaussian Splatting SLAM systems often struggle with representing extensive scenes and lack robust loop closure detection, hindering their applicability in large-scale environments. This work aims to overcome these limitations and improve the performance of dense SLAM systems. |
The proposed NGM-SLAM system employs neural radiance field submaps as priors for 3D Gaussian rendering, progressively constructing the scene. It incorporates a local-to-global loop closure detection and optimization process, utilizing submaps for supervision and achieving real-time error correction during mapping. |
NGM-SLAM demonstrates superior performance compared to state-of-the-art NeRF/GS-based SLAM methods in terms of rendering and tracking accuracy on various datasets, including Replica, ScanNet, TUM RGB-D, and EuRoC.
The system effectively addresses the issue of scene gaps by leveraging neural submaps for guidance, achieving more complete and detailed scene reconstruction compared to methods relying solely on 3DGS.
NGM-SLAM exhibits robust tracking and reconstruction capabilities in large-scale scenes, effectively mitigating drift through its loop closure mechanism and enabling real-time operation even with limited computational resources. |
Limited real-time reconstruction ability in extremely large-scale environments like city-level scenarios due to current memory and computational constraints.
Future work could explore porting the system to CUDA programming for enhanced mesh extraction and higher-quality mesh generation. |
slam, 3d gaussian splatting, neural radiance fields, loop closure, dense reconstruction |
2405.05691
Report |
StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework |
Yiheng Huang, Hui Yang, Chuanchen Luo, Yuxi Wang, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, Junran Peng |
Thanks to the powerful generative capacity of diffusion models, recent years
have witnessed rapid progress in human motion generation. Existing
diffusion-based methods employ disparate network architectures and training
strategies. The effect of the design of each component is still unclear. In
addition, the iterative denoising process consumes considerable computational
overhead, which is prohibitive for real-time scenarios such as virtual
characters and humanoid robots. For this reason, we first conduct a
comprehensive investigation into network architectures, training strategies,
and inference processs. Based on the profound analysis, we tailor each
component for efficient high-quality human motion generation. Despite the
promising performance, the tailored model still suffers from foot skating which
is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we
identify foot-ground contact and correct foot motions along the denoising
process. By organically combining these well-designed components together, we
present StableMoFusion, a robust and efficient framework for human motion
generation. Extensive experimental results show that our StableMoFusion
performs favorably against current state-of-the-art methods. Project page:
https://h-y1heng.github.io/StableMoFusion-page/ |
Presents StableMoFusion, a robust and efficient diffusion-based motion generation framework that leverages Conv1D UNet architecture and novel training/inference strategies. |
Addresses limitations of existing diffusion-based motion generation methods, including lack of systematic analysis, long inference time, and foot skating issues. |
Conducts comprehensive analysis of network architectures, training strategies, and inference processes, incorporating efficient samplers, text caching, parallel CFG computation, low-precision inference, and a footskate cleanup mechanism. |
Achieves state-of-the-art results in FID and R-Precision on HumanML3D dataset.
Significantly reduces inference time compared to previous methods, achieving an average of 0.5 seconds per motion.
Effectively mitigates foot skating issues in generated motions. |
Current inference speed, while improved, does not yet meet real-time industry standards.
Future work will focus on further acceleration through model scaling and reducing single-step latency. |
motion generation, diffusion models, text-to-motion, efficient inference, footskate cleanup |
2405.05663
Report |
RPBG: Towards Robust Neural Point-based Graphics in the Wild |
Qingtian Zhu, Zizhuang Wei, Zhongtian Zheng, Yifan Zhan, Zhuyu Yao, Jiawang Zhang, Kejian Wu, Yinqiang Zheng |
Point-based representations have recently gained popularity in novel view
synthesis, for their unique advantages, e.g., intuitive geometric
representation, simple manipulation, and faster convergence. However, based on
our observation, these point-based neural re-rendering methods are only
expected to perform well under ideal conditions and suffer from noisy, patchy
points and unbounded scenes, which are challenging to handle but defacto common
in real applications. To this end, we revisit one such influential method,
known as Neural Point-based Graphics (NPBG), as our baseline, and propose
Robust Point-based Graphics (RPBG). We in-depth analyze the factors that
prevent NPBG from achieving satisfactory renderings on generic datasets, and
accordingly reform the pipeline to make it more robust to varying datasets
in-the-wild. Inspired by the practices in image restoration, we greatly enhance
the neural renderer to enable the attention-based correction of point
visibility and the inpainting of incomplete rasterization, with only acceptable
overheads. We also seek for a simple and lightweight alternative for
environment modeling and an iterative method to alleviate the problem of poor
geometry. By thorough evaluation on a wide range of datasets with different
shooting conditions and camera trajectories, RPBG stably outperforms the
baseline by a large margin, and exhibits its great robustness over
state-of-the-art NeRF-based variants. Code available at
https://github.com/QT-Zhu/RPBG. |
This paper presents RPBG (Robust Point-based Graphics), a novel method for robust neural point-based re-rendering that enhances the existing NPBG method to handle generic, in-the-wild datasets. |
Existing point-based neural re-rendering methods, despite their advantages like intuitive representation and faster convergence, struggle with noisy and patchy point clouds and unbounded scenes common in real-world applications. This work aims to address these limitations and enhance robustness for wider applicability. |
RPBG improves upon NPBG by: (1) Introducing a Downgrade-aware Convolution (DAC) module in the neural renderer to accurately determine point visibility and inpaint incomplete rasterizations. (2) Using a lightweight, trainable feature vector for environment modeling instead of a computationally expensive environment map. (3) Employing a point cloud augmentation technique using pseudo densities to refine poorly triangulated point clouds. (4) Implementing a collaborative end-to-end optimization of neural textures and renderer parameters. |
RPBG significantly outperforms the baseline NPBG and achieves state-of-the-art performance on various datasets with challenging scene types, including unbounded, inside-out, large-scale, and sparse-view scenes.
RPBG exhibits robustness and generalizability by achieving high-quality renderings across diverse datasets using a single set of hyperparameters, eliminating the need for per-scene tuning.
The method proves to be computationally efficient and scalable, handling large-scale scenes with limited memory compared to some existing point-based methods. |
While computationally efficient in terms of rendering, RPBG requires more storage for CNN parameters and neural textures compared to lightweight RF-based methods.
The enhanced context exchange among points achieved by the DAC module can slightly decrease the editability of individual points in the scene. |
point-based graphics, novel view synthesis, neural rendering, 3d reconstruction, robustness |
2405.05615
Report |
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning |
Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, Yunhe Wang |
Current solutions for efficiently constructing large vision-language (VL)
models follow a two-step paradigm: projecting the output of pre-trained vision
encoders to the input space of pre-trained language models as visual prompts;
and then transferring the models to downstream VL tasks via end-to-end
parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits
inefficiency since it significantly increases the input length of the language
models. In this paper, in contrast to integrating visual prompts into inputs,
we regard visual prompts as additional knowledge that facilitates language
models in addressing tasks associated with visual information. Motivated by the
finding that Feed-Forward Network (FFN) of language models acts as "key-value
memory", we introduce a novel approach termed memory-space visual prompting
(MemVP), wherein visual prompts are concatenated with the weights of FFN for
visual knowledge injection. Experimental results across various VL tasks and
language models reveal that MemVP significantly reduces the training time and
inference latency of the finetuned VL models and surpasses the performance of
previous PEFT methods. Code: https://github.com/JieShibo/MemVP |
This paper introduces Memory-Space Visual Prompting (MemVP), a novel approach for efficient vision-language (VL) fine-tuning that integrates visual prompts as knowledge into the weights of Feed-Forward Networks (FFNs) in language models. |
Existing VL fine-tuning methods often extend input length with visual prompts, leading to inefficiency. MemVP addresses this limitation by treating visual prompts as external knowledge, injecting them directly into the memory space of language models. |
MemVP projects image features into visual prompts, adds positional embeddings, and concatenates them with the FFN weight matrices. This enables retrieval of visual knowledge during text generation without increasing input length. |
MemVP outperforms previous Parameter-Efficient Fine-Tuning (PEFT) baselines on various VL benchmarks, including VQAv2, GQA, ScienceQA, and COCO Captions.
MemVP significantly reduces training time and inference latency compared to input-space visual prompting methods.
Visualization experiments confirm that MemVP successfully injects visual knowledge into language model memory, enabling retrieval of relevant visual information during text generation. |
The inference speed advantage of MemVP is less pronounced for generating long texts, as its main impact is during the generation of the first token.
MemVP might inherit drawbacks of pre-trained language models, such as inherent biases, misinformation, or potential copyright violation. |
vision-language models, parameter-efficient fine-tuning, visual prompting, feed-forward networks, knowledge injection |
2405.05538
Report |
A Survey on Personalized Content Synthesis with Diffusion Models |
Xulu Zhang, Xiao-Yong Wei, Wengyu Zhang, Jinlin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li |
Recent advancements in generative models have significantly impacted content
creation, leading to the emergence of Personalized Content Synthesis (PCS).
With a small set of user-provided examples, PCS aims to customize the subject
of interest to specific user-defined prompts. Over the past two years, more
than 150 methods have been proposed. However, existing surveys mainly focus on
text-to-image generation, with few providing up-to-date summaries on PCS. This
paper offers a comprehensive survey of PCS, with a particular focus on the
diffusion models. Specifically, we introduce the generic frameworks of PCS
research, which can be broadly classified into optimization-based and
learning-based approaches. We further categorize and analyze these
methodologies, discussing their strengths, limitations, and key techniques.
Additionally, we delve into specialized tasks within the field, such as
personalized object generation, face synthesis, and style personalization,
highlighting their unique challenges and innovations. Despite encouraging
progress, we also present an analysis of the challenges such as overfitting and
the trade-off between subject fidelity and text alignment. Through this
detailed overview and analysis, we propose future directions to advance the
development of PCS. |
This paper presents a comprehensive survey of Personalized Content Synthesis (PCS) with diffusion models, focusing on generating customized images from user-provided references and text prompts. |
PCS is rapidly growing, with over 150 methods proposed in two years, highlighting its significance in content creation and the need for a consolidated overview. |
The paper categorizes PCS methods into optimization-based and learning-based approaches, analyzing their strengths, limitations, and key techniques. It further examines specialized tasks like personalized object and face generation, style transfer, and multi-subject composition. |
Optimization-based methods excel in subject fidelity but require fine-tuning for each subject, leading to high storage demands.
Learning-based methods offer fast inference without fine-tuning but often struggle to capture fine-grained details and might exhibit limited generalization ability.
Despite progress, PCS still faces challenges such as overfitting on limited references, balancing subject fidelity with text alignment, and the lack of standardized evaluation metrics and datasets. |
The overfitting problem in PCS, particularly for non-rigid subjects or semantically similar backgrounds, requires further investigation and solutions.
Achieving a balance between high subject fidelity and flexible text alignment remains a challenge, demanding innovative model architectures and training strategies. |
generative models, diffusion models, personalized content synthesis, image generation, text-to-image synthesis |
2405.05446
Report |
GDGS: Gradient Domain Gaussian Splatting for Sparse Representation of Radiance Fields |
Yuanhao Gong |
The 3D Gaussian splatting methods are getting popular. However, they work
directly on the signal, leading to a dense representation of the signal. Even
with some techniques such as pruning or distillation, the results are still
dense. In this paper, we propose to model the gradient of the original signal.
The gradients are much sparser than the original signal. Therefore, the
gradients use much less Gaussian splats, leading to the more efficient storage
and thus higher computational performance during both training and rendering.
Thanks to the sparsity, during the view synthesis, only a small mount of pixels
are needed, leading to much higher computational performance ($100\sim
1000\times$ faster). And the 2D image can be recovered from the gradients via
solving a Poisson equation with linear computation complexity. Several
experiments are performed to confirm the sparseness of the gradients and the
computation performance of the proposed method. The method can be applied
various applications, such as human body modeling and indoor environment
modeling. |
This paper proposes GDGS, a novel gradient domain Gaussian splatting method for sparse radiance field representation. |
Existing 3D Gaussian splatting methods, while popular, struggle with dense signal representation, impacting storage and computational efficiency. Gradient domain processing offers a sparser representation, potentially addressing these limitations. |
The method involves three steps: 1) approximating the Laplacian field of the signal with Gaussian splats, 2) projecting these splats onto the image plane to obtain the 2D Laplacian field, and 3) reconstructing the image from this field by solving a Poisson equation using a U-Net architecture. |
GDGS achieves higher accuracy (0.6-1dB PSNR improvement) compared to original 3D Gaussian splatting.
The gradient domain representation in GDGS results in significantly sparser representations (up to 100 times fewer particles).
The sparsity leads to faster rendering speeds due to reduced computational demands. |
The PSNR improvement can be influenced by factors like image resolution, scene complexity, and lighting conditions.
Future work could explore integrating techniques like importance sampling and adaptive splatting to further enhance efficiency. |
gaussian splatting, gradient domain, radiance fields, sparse representation, view synthesis |
2405.05252
Report |
Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models |
Hongjie Wang, Difan Liu, Yan Kang, Yijun Li, Zhe Lin, Niraj K. Jha, Yuchen Liu |
Diffusion Models (DMs) have exhibited superior performance in generating
high-quality and diverse images. However, this exceptional performance comes at
the cost of expensive architectural design, particularly due to the attention
module heavily used in leading models. Existing works mainly adopt a retraining
process to enhance DM efficiency. This is computationally expensive and not
very scalable. To this end, we introduce the Attention-driven Training-free
Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to
perform run-time pruning of redundant tokens, without the need for any
retraining. Specifically, for single-denoising-step pruning, we develop a novel
ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify
redundant tokens, and a similarity-based recovery method to restore tokens for
the convolution operation. In addition, we propose a Denoising-Steps-Aware
Pruning (DSAP) approach to adjust the pruning budget across different denoising
timesteps for better generation quality. Extensive evaluations show that AT-EDM
performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs
saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining
nearly the same FID and CLIP scores as the full model. Project webpage:
https://atedm.github.io. |
This paper introduces AT-EDM, a training-free framework for accelerating diffusion models (DMs) by pruning redundant tokens in attention blocks during run-time, leveraging attention map information. |
DMs excel at image generation but are computationally expensive, hindering their application on resource-constrained devices. Existing efficiency methods rely on retraining, which is computationally costly and inflexible for diverse deployment settings. This work offers a training-free approach, enabling dynamic and efficient DM acceleration without retraining. |
The method uses a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), derived from attention maps to identify and prune redundant tokens within each denoising step. To maintain image quality, a similarity-based token recovery method utilizes attention map information to restore pruned tokens for convolution operations. Furthermore, a Denoising-Steps-Aware Pruning (DSAP) approach adjusts the pruning ratio across denoising steps based on attention map variance analysis, preserving crucial information in early steps. |
AT-EDM achieves comparable image quality with a 38.8% FLOPs reduction compared to the full Stable Diffusion XL (SD-XL) model.
It outperforms the state-of-the-art training-free method, ToMe, in terms of both FID (image quality) and CLIP (text-image alignment) scores under various FLOPs budgets.
The DSAP schedule is shown to improve image quality significantly and is generalizable to other run-time acceleration techniques. |
The performance of AT-EDM is inherently upper bounded by the full-sized pre-trained model.
Accessing attention maps in efficiently implemented DMs might require additional computation due to fused operations in attention libraries. |
diffusion models, model compression, training-free acceleration, attention mechanism, text-to-image generation |
2405.05224
Report |
Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation |
Jonas Kohler, Albert Pumarola, Edgar Schönfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, Ali Thabet |
Diffusion models are a powerful generative framework, but come with expensive
inference. Existing acceleration methods often compromise image quality or fail
under complex conditioning when operating in an extremely low-step regime. In
this work, we propose a novel distillation framework tailored to enable
high-fidelity, diverse sample generation using just one to three steps. Our
approach comprises three key components: (i) Backward Distillation, which
mitigates training-inference discrepancies by calibrating the student on its
own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically
adapts knowledge transfer based on the current time step; and (iii) Noise
Correction, an inference-time technique that enhances sample quality by
addressing singularities in noise prediction. Through extensive experiments, we
demonstrate that our method outperforms existing competitors in quantitative
metrics and human evaluations. Remarkably, it achieves performance comparable
to the teacher model using only three denoising steps, enabling efficient
high-quality generation. |
\methodname~is a novel distillation framework for text-to-image diffusion models that enables high-fidelity image generation in just one to three steps. |
Diffusion models are powerful but computationally expensive. Existing acceleration methods often sacrifice quality or struggle with complex prompts in ultra-low step regimes. |
The framework uses three key components: (i) Backward Distillation to align training and inference, (ii) Shifted Reconstruction Loss (\lossname) to dynamically transfer knowledge from teacher to student, and (iii) Noise Correction for improved initial sample quality. |
\methodname~achieves comparable quality to the baseline Emu model using only three steps.
It outperforms state-of-the-art distillation methods (Step Distillation, LCM, ADD) in FID and CLIP scores.
Human evaluations show a clear preference for \methodname~generated images over ADD and Lightning. |
Human evaluation, while extensive, is subjective and may vary with different prompts and annotators.
Like other text-to-image models, there's a risk of generating biased or offensive content despite efforts to ensure fairness and safety. |
generative ai, efficient diffusion, image synthesis, text-to-image, distillation |
2405.05216
Report |
FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models |
Jinglin Xu, Yijie Guo, Yuxin Peng |
The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to
predict human joint coordinates in 3D space. Despite recent advancements in
deep learning-based methods, they mostly ignore the capability of coupling
accessible texts and naturally feasible knowledge of humans, missing out on
valuable implicit supervision to guide the 3D HPE task. Moreover, previous
efforts often study this task from the perspective of the whole human body,
neglecting fine-grained guidance hidden in different body parts. To this end,
we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model
for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing
the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt
learning (FPP) block constructs fine-grained part-aware prompts via coupling
accessible texts and naturally feasible knowledge of body parts with learnable
prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication
(FPC) block establishes fine-grained communications between learned part-aware
prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp
Stylization (PTS) block integrates learned prompt embedding and temporal
information related to the noise level to enable adaptive adjustment at each
denoising step. Extensive experiments on public single-human pose estimation
datasets show that FinePOSE outperforms state-of-the-art methods. We further
extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE
on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with
complex multi-human scenarios. Code is available at
https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024. |
This paper proposes FinePOSE, a novel fine-grained prompt-driven denoiser based on a diffusion model for 3D human pose estimation. |
Existing 3D HPE methods struggle with depth ambiguity, human body complexity, and generalizing to diverse actions. FinePOSE addresses these challenges by leveraging accessible texts and natural human knowledge to guide pose estimation. |
FinePOSE uses three core blocks: 1) Fine-grained Part-aware Prompt learning (FPP) to construct prompts capturing action class, body part movements, and kinematic information, 2) Fine-grained Prompt-pose Communication (FPC) to enhance denoising by injecting prompt embedding into noisy poses, and 3) Prompt-driven Timestamp Stylization (PTS) for adaptive adjustment at each denoising step using prompt embedding and noise level. |
FinePOSE achieves state-of-the-art performance on Human3.6M and MPI-INF-3DHP datasets for 3D human pose estimation.
Fine-grained part-aware prompt learning significantly improves denoising quality and estimation accuracy.
The proposed method shows promising results on multi-human pose estimation using a post-integration strategy. |
FinePOSE is not specifically designed for multi-person scenarios.
The diffusion model-based approach is computationally expensive. |
3d human pose estimation, diffusion models, prompt learning, denoising, computer vision |
2405.05173
Report |
A Survey on Occupancy Perception for Autonomous Driving: The Information Fusion Perspective |
Huaiyuan Xu, Junliang Chen, Shiyu Meng, Yi Wang, Lap-Pui Chau |
3D occupancy perception technology aims to observe and understand dense 3D
environments for autonomous vehicles. Owing to its comprehensive perception
capability, this technology is emerging as a trend in autonomous driving
perception systems, and is attracting significant attention from both industry
and academia. Similar to traditional bird's-eye view (BEV) perception, 3D
occupancy perception has the nature of multi-source input and the necessity for
information fusion. However, the difference is that it captures vertical
structures that are ignored by 2D BEV. In this survey, we review the most
recent works on 3D occupancy perception, and provide in-depth analyses of
methodologies with various input modalities. Specifically, we summarize general
network pipelines, highlight information fusion techniques, and discuss
effective network training. We evaluate and analyze the occupancy perception
performance of the state-of-the-art on the most popular datasets. Furthermore,
challenges and future research directions are discussed. We hope this paper
will inspire the community and encourage more research work on 3D occupancy
perception. A comprehensive list of studies in this survey is publicly
available in an active repository that continuously collects the latest work:
https://github.com/HuaiyuanXu/3D-Occupancy-Perception. |
This paper presents a comprehensive survey of recent advancements in 3D occupancy perception for autonomous driving, focusing on the crucial role of information fusion. |
3D occupancy perception provides a dense, 3D understanding of the environment, surpassing traditional BEV perception by capturing height information. It facilitates a range of downstream applications in autonomous driving, like object detection, tracking, and motion planning. |
The survey categorizes occupancy perception methods based on input modalities: LiDAR-centric, vision-centric, and multi-modal. It dissects core methodological issues including network pipelines, spatial and temporal information fusion techniques, and training strategies (strong, weak, semi, and self-supervised learning). |
LiDAR-centric methods currently achieve higher accuracy than vision-centric approaches due to precise depth information from LiDAR.
Vision-centric occupancy perception is rapidly advancing, driven by the cost-effectiveness of cameras and advancements in deep learning.
Multi-modal occupancy perception shows promising potential but requires further research to fully leverage the benefits of fusing different data modalities. |
Current occupancy methods struggle to achieve real-time performance for deployment on autonomous driving systems.
Robustness and generalization of occupancy perception models in complex, real-world scenarios remain open challenges. |
autonomous driving, occupancy perception, information fusion, lidar, computer vision |
2405.05027
Report |
StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer |
Zijia Wang, Zhi-Song Liu |
We present StyleMamba, an efficient image style transfer framework that
translates text prompts into corresponding visual styles while preserving the
content integrity of the original images. Existing text-guided stylization
requires hundreds of training iterations and takes a lot of computing
resources. To speed up the process, we propose a conditional State Space Model
for Efficient Text-driven Image Style Transfer, dubbed StyleMamba, that
sequentially aligns the image features to the target text prompts. To enhance
the local and global style consistency between text and image, we propose
masked and second-order directional losses to optimize the stylization
direction to significantly reduce the training iterations by 5 times and the
inference time by 3 times. Extensive experiments and qualitative evaluation
confirm the robust and superior stylization performance of our methods compared
to the existing baselines. |
This paper presents
et, an efficient text-driven image style transfer framework that leverages a conditional State Space Model within an AutoEncoder architecture to rapidly translate text prompts into corresponding visual styles while preserving content integrity. |
Existing text-guided stylization methods are computationally expensive, requiring hundreds of training iterations and significant GPU resources.
et addresses this limitation by significantly speeding up the process through an innovative framework and novel loss functions. |
et utilizes a pretrained VAE for encoding and decoding, a Style Fusion Module with AdaLN and Mamba for efficient style fusion, and a SigLIP Module for enhanced text-image alignment. It introduces masked and second-order directional losses to expedite training convergence and improve style fidelity. |
et demonstrates superior performance over state-of-the-art methods in terms of stylization quality, content preservation, and computational efficiency, achieving significant speedups in both training and inference.
Ablation studies confirm the effectiveness of the proposed Style Fusion Module, the use of SigLIP for text-image alignment, and the impact of the novel loss functions on stylization quality and training speed.
et exhibits strong generalization capabilities, extending its applications to diverse creative domains, including multiple style transfer, product design, painting assistance, UI design, cinematic style transformation, and fashion design. |
While
et excels in many areas, it currently exhibits limitations in understanding content-guided or less commonly used text prompts, particularly in scenarios involving face editing or object manipulation.
Future research will focus on addressing these limitations by improving the model's handling of diverse facial features and expanding its comprehension of novel and abstract concepts for enhanced style transfer capabilities. |
text-driven image style transfer, state space model, autoencoder, masked directional loss, second-order directional loss |
2405.05010
Report |
${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields |
Ning Wang, Lefei Zhang, Angel X Chang |
Neural fields (NeRF) have emerged as a promising approach for representing
continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs
poses a significant challenge for scene decomposition. To address this
challenge, we present a single model, Multi-Modal Decomposition NeRF
(${M^2D}$NeRF), that is capable of both text-based and visual patch-based
edits. Specifically, we use multi-modal feature distillation to integrate
teacher features from pretrained visual and language models into 3D semantic
feature volumes, thereby facilitating consistent 3D editing. To enforce
consistency between the visual and language features in our 3D feature volumes,
we introduce a multi-modal similarity constraint. We also introduce a
patch-based joint contrastive loss that helps to encourage object-regions to
coalesce in the 3D feature space, resulting in more precise boundaries.
Experiments on various real-world scenes show superior performance in 3D scene
decomposition tasks compared to prior NeRF-based methods. |
This paper proposes \fullM (\M), a novel NeRF-based method that uses multi-modal feature distillation to enable both text-based and visual patch-based 3D scene decomposition. |
NeRF-based 3D scene decomposition often lacks object-level awareness and struggles with semantic ambiguity at object boundaries. Existing methods either rely on expensive 3D annotations or have difficulty generalizing to real-world scenes. This paper leverages the power of pretrained foundation models to enable more accurate and flexible decomposition for real-world scenes. |
The \M model extends the NeRF model with visual and language feature branches, trained via multi-modal feature distillation using DINO and CLIP-LSeg as teacher models, respectively. It further introduces a multi-modal similarity constraint and a patch-based joint contrastive loss to encourage consistency and distinct boundaries between objects. |
Outperforms existing distillation-based scene decomposition methods (DFF and N3F) in both quantitative and qualitative evaluation on the LLFF dataset.
Achieves comparable segmentation performance to annotation-based NeRF-SOS.
Supports both image patch and text queries for flexible object extraction and editing. |
The density-based representations can lead to noise in the decomposition.
Lacks 3D inpainting capabilities, potentially struggling with scenes containing occlusions or missing parts. |
neural radiance fields, 3d scene decomposition, multi-modal learning, feature distillation, contrastive learning |
2405.04834
Report |
FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation |
Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang |
Controllable text-to-image (T2I) diffusion models generate images conditioned
on both text prompts and semantic inputs of other modalities like edge maps.
Nevertheless, current controllable T2I methods commonly face challenges related
to efficiency and faithfulness, especially when conditioning on multiple inputs
from either the same or diverse modalities. In this paper, we propose a novel
Flexible and Efficient method, FlexEControl, for controllable T2I generation.
At the core of FlexEControl is a unique weight decomposition strategy, which
allows for streamlined integration of various input types. This approach not
only enhances the faithfulness of the generated image to the control, but also
significantly reduces the computational overhead typically associated with
multimodal conditioning. Our approach achieves a reduction of 41% in trainable
parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it
doubles data efficiency and can flexibly generate images under the guidance of
multiple input conditions of various modalities. |
The paper introduces FlexEControl, a novel approach for multimodal control in text-to-image generation that enhances efficiency without compromising the flexibility or controllability of existing methods. |
Existing methods for incorporating structural conditions in text-to-image generation often require extensive training, leading to inefficiencies in model development and deployment. FlexEControl addresses this limitation by enabling efficient training while maintaining controllability and flexibility. |
FlexEControl builds upon the architecture of Uni-ControlNet, incorporating a multi-scale condition injection strategy with learnable convolutional layers. It leverages pretrained Stable Diffusion weights and employs efficient training techniques. |
FlexEControl achieves comparable or superior performance compared to state-of-the-art baselines like T2I-Adapter, PHM, Uni-ControlNet, and LoRA across various structural conditions including edge maps, sketches, pose information, depth maps, and segmentation maps.
The method demonstrates a significant reduction in training time and computational resources, enhancing efficiency without sacrificing performance.
FlexEControl excels in generating images that adhere to both textual prompts and structural conditions, showcasing its efficacy in multimodal control for text-to-image generation. |
The paper acknowledges the potential limitations of the selected structural condition extraction methods, which may impact the overall performance.
Future work aims to explore more advanced architectures and training techniques to further enhance the efficiency and controllability of the approach. |
text-to-image generation, multimodal control, efficient training, stable diffusion, structural conditions |
2405.04682
Report |
TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation |
Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang |
Recent advances in diffusion-based generative modeling have led to the
development of text-to-video (T2V) models that can generate high-quality videos
conditioned on a text prompt. Most of these T2V models often produce
single-scene video clips that depict an entity performing a particular action
(e.g., `a red panda climbing a tree'). However, it is pertinent to generate
multi-scene videos since they are ubiquitous in the real-world (e.g., `a red
panda climbing a tree' followed by `the red panda sleeps on the top of the
tree'). To generate multi-scene videos from the pretrained T2V model, we
introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the
text-conditioning mechanism in the T2V architecture to recognize the temporal
alignment between the video scenes and scene descriptions. For instance, we
condition the visual features of the earlier and later scenes of the generated
video with the representations of the first scene description (e.g., `a red
panda climbing a tree') and second scene description (e.g., `the red panda
sleeps on the top of the tree'), respectively. As a result, we show that the
T2V model can generate multi-scene videos that adhere to the multi-scene text
descriptions and be visually consistent (e.g., entity and background). Further,
we finetune the pretrained T2V model with multi-scene video-text data using the
TALC framework. We show that the TALC-finetuned model outperforms the baseline
methods by 15.5 points in the overall score, which averages visual consistency
and text adherence using human evaluation. The project website is
https://talc-mst2v.github.io/. |
The paper proposes Time-Aligned Captions (TALC), a framework for generating multi-scene videos from text using pre-trained text-to-video diffusion models by conditioning parts of the video on corresponding scene descriptions. |
Most existing text-to-video models struggle to generate coherent multi-scene videos, limiting their applicability to real-world scenarios where such videos are common. |
TALC modifies the text conditioning mechanism of diffusion models to align visual features of specific video segments with embeddings of corresponding scene descriptions. The paper also introduces a method to create a multi-scene video-text dataset using Gemini-Pro-Vision for fine-tuning. |
TALC, without fine-tuning, outperforms baselines like merging captions or videos in terms of visual consistency and text adherence.
Fine-tuning the model with TALC and a multi-scene dataset further improves performance, particularly text adherence, as measured by automatic and human evaluation.
The proposed approach maintains visual quality comparable to the base model, unlike fine-tuning with naively merged captions. |
The performance of multi-scene video generation decreases as the number of scenes increases.
The reliance on scene detection for multi-scene data generation could introduce errors if the detection is inaccurate. |
text-to-video generation, diffusion models, multi-scene video, video generation, text conditioning |
2405.04533
Report |
ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning |
Jing Lin, Yao Feng, Weiyang Liu, Michael J. Black |
Numerous methods have been proposed to detect, estimate, and analyze
properties of people in images, including the estimation of 3D pose, shape,
contact, human-object interaction, emotion, and more. Each of these methods
works in isolation instead of synergistically. Here we address this problem and
build a language-driven human understanding system -- ChatHuman, which combines
and integrates the skills of many different methods. To do so, we finetune a
Large Language Model (LLM) to select and use a wide variety of existing tools
in response to user inputs. In doing so, ChatHuman is able to combine
information from multiple tools to solve problems more accurately than the
individual tools themselves and to leverage tool output to improve its ability
to reason about humans. The novel features of ChatHuman include leveraging
academic publications to guide the application of 3D human-related tools,
employing a retrieval-augmented generation model to generate
in-context-learning examples for handling new tools, and discriminating and
integrating tool results to enhance 3D human understanding. Our experiments
show that ChatHuman outperforms existing models in both tool selection accuracy
and performance across multiple 3D human-related tasks. ChatHuman is a step
towards consolidating diverse methods for human analysis into a single,
powerful, system for 3D human reasoning. |
ChatHuman is a multi-modal Large Language Model (LLM) specialized for 3D human understanding. It leverages a wide range of existing 3D human analysis tools to perform tasks like pose estimation, shape measurement, and contact reasoning. |
Existing 3D human analysis methods often work in isolation. ChatHuman integrates these disparate tools into a single system, enabling more accurate and comprehensive 3D human reasoning. |
ChatHuman uses a paper-based Retrieval-Augmented Generation (RAG) mechanism to understand tool functions by reading relevant academic papers. It is fine-tuned on a dataset of instruction-following data constructed with GPT-4V, learning to select, use, discriminate, and integrate tool results. |
ChatHuman outperforms existing LLM-based methods in tool selection and usage accuracy, especially for tools unseen during training.
It achieves state-of-the-art performance on various 3D human understanding tasks, including pose estimation, shape measurement, and human-object interaction detection.
The model demonstrates the ability to combine tool outputs with its general knowledge to solve complex reasoning tasks, like reasoning-based pose estimation and speculative pose generation. |
ChatHuman's performance depends on the clarity of user requests and the capabilities of existing tools.
It currently primarily focuses on text and image modalities, with limited exploration of video and motion analysis. |
3d human understanding, large language models, tool use, retrieval-augmented generation, multi-modal learning |
2405.04496
Report |
Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing |
Yi Zuo, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Shuyuan Yang, Yuwei Guo |
Existing diffusion-based video editing methods have achieved impressive
results in motion editing. Most of the existing methods focus on the motion
alignment between the edited video and the reference video. However, these
methods do not constrain the background and object content of the video to
remain unchanged, which makes it possible for users to generate unexpected
videos. In this paper, we propose a one-shot video motion editing method called
Edit-Your-Motion that requires only a single text-video pair for training.
Specifically, we design the Detailed Prompt-Guided Learning Strategy (DPL) to
decouple spatio-temporal features in space-time diffusion models. DPL separates
learning object content and motion into two training stages. In the first
training stage, we focus on learning the spatial features (the features of
object content) and breaking down the temporal relationships in the video
frames by shuffling them. We further propose Recurrent-Causal Attention
(RC-Attn) to learn the consistent content features of the object from unordered
video frames. In the second training stage, we restore the temporal
relationship in video frames to learn the temporal feature (the features of the
background and object's motion). We also adopt the Noise Constraint Loss to
smooth out inter-frame differences. Finally, in the inference stage, we inject
the content features of the source object into the editing branch through a
two-branch structure (editing branch and reconstruction branch). With
Edit-Your-Motion, users can edit the motion of objects in the source video to
generate more exciting and diverse videos. Comprehensive qualitative
experiments, quantitative experiments and user preference studies demonstrate
that Edit-Your-Motion performs better than other methods. |
Proposes Edit-Your-Motion, a one-shot video motion editing method that decouples spatio-temporal features in space-time diffusion models for accurate motion editing while preserving object content and background. |
Existing video motion editing methods struggle to maintain content and background consistency due to entangled spatial and temporal features in the diffusion models. |
Introduces Detailed Prompt-Guided Learning Strategy (DPL) with two training stages: (1) learning spatial features from shuffled, background-masked frames using Recurrent-Causal Attention, and (2) learning temporal features from ordered frames using Temporal Attention and Noise Constraint Loss. Employs a two-branch structure during inference to inject spatial features from the reconstruction branch into the editing branch. |
Achieves accurate motion alignment with the reference video while preserving the source video's object content and background.
Outperforms state-of-the-art methods in both qualitative and quantitative evaluations, including CLIP similarity and LPIPS metrics.
Demonstrates superior performance in user studies, with participants preferring Edit-Your-Motion for text alignment, content alignment, and motion alignment. |
Two-stage training demands considerable computational resources.
Further exploration is needed to enable video motion editing with limited computational power. |
video motion editing, space-time diffusion model, detailed prompt-guided learning, recurrent-causal attention, noise constraint loss |
2405.04404
Report |
Vision Mamba: A Comprehensive Survey and Taxonomy |
Xiao Liu, Chenxu Zhang, Lei Zhang |
State Space Model (SSM) is a mathematical model used to describe and analyze
the behavior of dynamic systems. This model has witnessed numerous applications
in several fields, including control theory, signal processing, economics and
machine learning. In the field of deep learning, state space models are used to
process sequence data, such as time series analysis, natural language
processing (NLP) and video understanding. By mapping sequence data to state
space, long-term dependencies in the data can be better captured. In
particular, modern SSMs have shown strong representational capabilities in NLP,
especially in long sequence modeling, while maintaining linear time complexity.
Notably, based on the latest state-space models, Mamba merges time-varying
parameters into SSMs and formulates a hardware-aware algorithm for efficient
training and inference. Given its impressive efficiency and strong long-range
dependency modeling capability, Mamba is expected to become a new AI
architecture that may outperform Transformer. Recently, a number of works have
attempted to study the potential of Mamba in various fields, such as general
vision, multi-modal, medical image analysis and remote sensing image analysis,
by extending Mamba from natural language domain to visual domain. To fully
understand Mamba in the visual domain, we conduct a comprehensive survey and
present a taxonomy study. This survey focuses on Mamba's application to a
variety of visual tasks and data types, and discusses its predecessors, recent
advances and far-reaching impact on a wide range of domains. Since Mamba is now
on an upward trend, please actively notice us if you have new findings, and new
progress on Mamba will be included in this survey in a timely manner and
updated on the Mamba project at
https://github.com/lx6c78/Vision-Mamba-A-Comprehensive-Survey-and-Taxonomy. |
This paper presents a comprehensive survey of Vision Mamba, a recent advancement in deep learning that leverages State Space Models (SSMs) for visual tasks, surpassing traditional CNN and Transformer architectures. |
Vision Mamba is gaining increasing attention due to its superior performance in visual tasks, particularly its ability to efficiently process long sequences and handle high-resolution images, making it a potential game-changer in the field. |
The paper systematically categorizes Vision Mamba variants based on their applications, such as general vision, multi-modal learning, and vertical domains like remote sensing and medical image analysis. It provides a detailed taxonomy, principles, and technical details of each variant. |
Vision Mamba models exhibit remarkable computational efficiency and effectiveness in various tasks, including image classification, object detection, semantic segmentation, video analysis, and image restoration.
They excel in handling high-resolution inputs and complex data dependencies, achieving superior performance with lower computational costs compared to CNNs and Transformers.
The survey highlights the advancements in scanning mechanisms and synergistic hybrid architectures that contribute to Vision Mamba's success. |
Further research is needed to design more sophisticated scanning mechanisms to optimize Vision Mamba's performance for specific visual tasks, especially for capturing intricate spatial relationships.
Exploring the combination of Vision Mamba with other architectures like Transformers, while addressing their inherent differences in sequence modeling, holds potential for further performance improvement. |
state space model, vision mamba, computer vision, deep learning, multi-modal learning, remote sensing, medical image analysis |
2405.04356
Report |
Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation |
Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, Kwanghoon Sohn |
We present a new multi-modal face image generation method that converts a
text prompt and a visual input, such as a semantic mask or scribble map, into a
photo-realistic face image. To do this, we combine the strengths of Generative
Adversarial networks (GANs) and diffusion models (DMs) by employing the
multi-modal features in the DM into the latent space of the pre-trained GANs.
We present a simple mapping and a style modulation network to link two models
and convert meaningful representations in feature maps and attention maps into
latent codes. With GAN inversion, the estimated latent codes can be used to
generate 2D or 3D-aware facial images. We further present a multi-step training
strategy that reflects textual and structural representations into the
generated image. Our proposed network produces realistic 2D, multi-view, and
stylized face images, which align well with inputs. We validate our method by
using pre-trained 2D and 3D GANs, and our results outperform existing methods.
Our project page is available at
https://github.com/1211sh/Diffusion-driven_GAN-Inversion/. |
This paper introduces a novel multi-modal face image generation method that leverages the strengths of both diffusion models (DMs) and Generative Adversarial Networks (GANs). The proposed approach uses a pre-trained DM as an encoder to extract multi-modal features from text prompts and visual inputs (e.g., semantic masks, scribbles) and then maps them into the latent space of a pre-trained GAN for high-quality face image synthesis. |
Existing methods for multi-modal face image generation struggle to effectively combine text and visual inputs, often leading to inconsistencies between the generated image and the input conditions. This work addresses these limitations by introducing a novel framework that effectively bridges the gap between DMs and GANs, enabling more accurate and controllable face image generation. |
The proposed method utilizes a mapping network to connect the latent spaces of the pre-trained DM and GAN. An attention-based style modulation network refines the mapped latent code by leveraging multi-scale features and cross-attention maps from the DM decoder, capturing fine-grained details from the input text and visual conditions. The model is trained across multiple denoising steps to effectively capture the evolving semantic representations in the DM. |
The method generates high-quality 2D and 3D-aware face images that are consistent with both text prompts and visual inputs, outperforming existing GAN-based and DM-based methods in terms of visual quality and semantic accuracy.
The proposed approach demonstrates superior performance in preserving the identity of the input image while modifying facial attributes based on the given text prompt, as evidenced by quantitative metrics like ID similarity.
Ablation studies validate the effectiveness of each component, highlighting the importance of the mapping network, the attention-based style modulation network, and the multi-step training strategy. |
The method currently faces limitations in transferring significantly distinct styles from artistic domains to the photo-realistic domain of GANs.
Future work will explore mapping diffusion features related to pose into the GAN latent space to enhance 3D-aware face style transfer. |
multi-modal image generation, face image synthesis, diffusion models, gan inversion, attention mechanisms |
2405.04312
Report |
Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer |
Zhuoyi Yang, Heyang Jiang, Wenyi Hong, Jiayan Teng, Wendi Zheng, Yuxiao Dong, Ming Ding, Jie Tang |
Diffusion models have shown remarkable performance in image generation in
recent years. However, due to a quadratic increase in memory during generating
ultra-high-resolution images (e.g. 4096*4096), the resolution of generated
images is often limited to 1024*1024. In this work. we propose a unidirectional
block attention mechanism that can adaptively adjust the memory overhead during
the inference process and handle global dependencies. Building on this module,
we adopt the DiT structure for upsampling and develop an infinite
super-resolution model capable of upsampling images of various shapes and
resolutions. Comprehensive experiments show that our model achieves SOTA
performance in generating ultra-high-resolution images in both machine and
human evaluation. Compared to commonly used UNet structures, our model can save
more than 5x memory when generating 4096*4096 images. The project URL is
https://github.com/THUDM/Inf-DiT. |
Inf-DiT, an infinite-resolution diffusion model capable of upsampling images of various shapes and resolutions with memory efficiency, especially for ultra-high-resolution images. |
Existing image diffusion models struggle to generate ultra-high-resolution images due to quadratic memory increase, limiting their application in various fields. |
The paper proposes a Unidirectional Block Attention (UniBA) algorithm to reduce memory consumption from O(N^2) to O(N) by dividing the image into blocks and processing them sequentially, while maintaining global consistency. This allows for generating parts of the image in parallel based on memory restrictions. Additionally, global and local consistency techniques are employed using CLIP image embedding and nearby LR cross-attention. |
Inf-DiT achieves state-of-the-art performance in ultra-high-resolution image generation on HPDv2 dataset, outperforming baselines in FID and FIDcrop metrics.
It excels in classic super-resolution tasks on DIV2K dataset, surpassing other models in perceptual and fidelity metrics.
Human evaluation confirms Inf-DiT’s superiority in detail authenticity, global coherence, and consistency with low-resolution input. |
Correcting inaccuracies from earlier upsampling stages in iterative upsampling needs further exploration.
The model's performance with different block sizes and their impact on memory and generation quality require further investigation. |
diffusion models, ultra-high-resolution generation, super-resolution, unidirectional block attention, memory efficiency |
2405.04233
Report |
Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models |
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu |
We introduce Vidu, a high-performance text-to-video generator that is capable
of producing 1080p videos up to 16 seconds in a single generation. Vidu is a
diffusion model with U-ViT as its backbone, which unlocks the scalability and
the capability for handling long videos. Vidu exhibits strong coherence and
dynamism, and is capable of generating both realistic and imaginative videos,
as well as understanding some professional photography techniques, on par with
Sora -- the most powerful reported text-to-video generator. Finally, we perform
initial experiments on other controllable video generation, including
canny-to-video generation, video prediction and subject-driven generation,
which demonstrate promising results. |
\name{} is a high-performance text-to-video generator that produces 1080p videos up to 16 seconds long in a single generation, using a U-ViT backbone for scalability and long sequence modeling. |
Breaks duration limitations of previous video generation models that primarily relied on U-Net backbones and focused on shorter durations. |
Employs a video autoencoder for dimensionality reduction, and a U-ViT model for noise prediction, trained on a vast dataset of text-video pairs automatically annotated using a high-performance video captioner. |
Generates coherent and dynamic videos with 3D consistency, cuts, transitions, camera movements, lighting effects, and emotional portrayal.
Exhibits imaginative ability, generating scenes beyond real-world scenarios.
Shows promising results in controllable video generation tasks like canny-to-video, video prediction, and subject-driven generation. |
Occasional flaws in details and interactions between subjects.
Limited exploration of controllable generation at higher resolutions. |
text-to-video generation, diffusion models, u-vit, video synthesis, controllable generation |
2405.04007
Report |
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing |
Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan |
In this technical report, we introduce SEED-Data-Edit: a unique hybrid
dataset for instruction-guided image editing, which aims to facilitate image
manipulation using open-form language. SEED-Data-Edit is composed of three
distinct types of data: (1) High-quality editing data produced by an automated
pipeline, ensuring a substantial volume of diverse image editing pairs. (2)
Real-world scenario data collected from the internet, which captures the
intricacies of user intentions for promoting the practical application of image
editing in the real world. (3) High-precision multi-turn editing data annotated
by humans, which involves multiple rounds of edits for simulating iterative
editing processes. The combination of these diverse data sources makes
SEED-Data-Edit a comprehensive and versatile dataset for training
language-guided image editing model. We fine-tune a pretrained Multimodal Large
Language Model (MLLM) that unifies comprehension and generation with
SEED-Data-Edit. The instruction tuned model demonstrates promising results,
indicating the potential and effectiveness of SEED-Data-Edit in advancing the
field of instructional image editing. The datasets are released in
https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit. |
Introduces SEED-Data-Edit, a hybrid dataset for instruction-guided image editing, combining automated, real-world, and multi-turn data. |
Addresses the lack of high-quality, large-scale datasets for training models in the challenging field of instruction-guided image editing. |
Combines three data sources: 1) Automated pipeline generating 'remove' and 'add' edits and style/object changes, 2) Real-world editing requests from photography websites, 3) Human-annotated multi-turn edits simulating iterative editing. |
SEED-Data-Edit contains 3.7M image pairs and 21K multi-turn sequences (up to 5 rounds).
Fine-tuned MLLM model SEED-X-Edit on the dataset shows promising results in following editing instructions.
SEED-X-Edit outperforms baseline models, demonstrating the dataset's potential in advancing instructional image editing. |
Current model training utilizes multi-turn data in a single-turn way.
Future work will explore true multi-turn image editing. |
image editing, instruction-guided, dataset, multimodal, large language model |
2405.03958
Report |
Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model |
Joo Young Choi, Jaesung R. Park, Inkyu Park, Jaewoong Cho, Albert No, Ernest K. Ryu |
Current state-of-the-art diffusion models employ U-Net architectures
containing convolutional and (qkv) self-attention layers. The U-Net processes
images while being conditioned on the time embedding input for each sampling
step and the class or caption embedding input corresponding to the desired
conditional generation. Such conditioning involves scale-and-shift operations
to the convolutional layers but does not directly affect the attention layers.
While these standard architectural choices are certainly effective, not
conditioning the attention layers feels arbitrary and potentially suboptimal.
In this work, we show that simply adding LoRA conditioning to the attention
layers without changing or tuning the other parts of the U-Net architecture
improves the image generation quality. For example, a drop-in addition of LoRA
conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for
unconditional and class-conditional CIFAR-10 generation, improving upon the
baseline of 1.97/1.79. |
This paper introduces a novel method for conditioning attention layers in diffusion models using Low-Rank Adaptation (LoRA), improving image generation quality. |
Current state-of-the-art diffusion models lack direct conditioning on attention layers, potentially hindering performance optimization. This method addresses this gap by incorporating conditioning directly into attention mechanisms. |
The authors implement various LoRA conditioning methods including TimeLoRA, ClassLoRA for discrete-time settings, and UC-LoRA for continuous SNR settings. They evaluate these methods on popular diffusion model architectures like IDDPM and EDM trained on CIFAR-10, FFHQ64, and ImageNet datasets. |
Adding LoRA conditioning to attention layers consistently improves FID scores across different models and datasets, demonstrating its effectiveness.
LoRA conditioning alone, even without conventional scale-and-shift conditioning on convolutional layers, achieves comparable FID scores, highlighting its capability.
The method exhibits robustness in extrapolating conditioning information, showing potential for broader applications beyond class conditioning. |
The paper acknowledges limited exploration of optimal LoRA rank and the number of bases due to computational constraints.
Further research is needed to investigate the full potential of LoRA conditioning, particularly in large-scale diffusion models and text-to-image generation. |
diffusion models, low-rank adaptation (lora), attention mechanism, image generation, conditioning methods |
2405.03894
Report |
MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View |
Emmanuelle Bourigault, Pauline Bourigault |
Generating consistent multiple views for 3D reconstruction tasks is still a
challenge to existing image-to-3D diffusion models. Generally, incorporating 3D
representations into diffusion model decrease the model's speed as well as
generalizability and quality. This paper proposes a general framework to
generate consistent multi-view images from single image or leveraging scene
representation transformer and view-conditioned diffusion model. In the model,
we introduce epipolar geometry constraints and multi-view attention to enforce
3D consistency. From as few as one image input, our model is able to generate
3D meshes surpassing baselines methods in evaluation metrics, including PSNR,
SSIM and LPIPS. |
This paper presents MVDiff, a novel multi-view diffusion model for consistent image generation and 3D reconstruction using epipolar geometry constraints and multi-view attention within a transformer-based architecture. |
Existing image-to-3D diffusion models struggle with generating consistent multiple views, limiting their use in tasks requiring accurate 3D understanding. MVDiff addresses this by enhancing consistency and enabling high-quality 3D reconstruction from limited input. |
MVDiff leverages a scene representation transformer (SRT) to learn a latent 3D representation from input views. It then employs a view-conditioned latent diffusion model guided by epipolar geometry and multi-view attention to generate consistent novel views. These views are used for 3D reconstruction via techniques like NeuS. |
MVDiff achieves superior novel view synthesis performance on the GSO dataset compared to baselines like Zero123-XL, exhibiting significant improvements in PSNR, SSIM, and LPIPS.
The model demonstrates strong 3D reconstruction capabilities, outperforming methods like One-2-3-45 and SyncDreamer in Chamfer Distance and Volume IoU.
Ablation studies confirm the importance of both epipolar and multi-view attention mechanisms for achieving consistent and high-fidelity results. |
The model's computational cost, particularly during inference, presents a limitation. Future work could explore efficiency improvements.
While MVDiff shows promising results, the generation of implausible meshes remains a challenge. Expanding the training dataset and refining data curation could alleviate this. |
multi-view synthesis, 3d reconstruction, diffusion models, epipolar geometry, scene representation transformer |
2405.03689
Report |
Pose Priors from Language Models |
Sanjay Subramanian, Evonne Ng, Lea Müller, Dan Klein, Shiry Ginosar, Trevor Darrell |
We present a zero-shot pose optimization method that enforces accurate
physical contact constraints when estimating the 3D pose of humans. Our central
insight is that since language is often used to describe physical interaction,
large pretrained text-based models can act as priors on pose estimation.
We can thus leverage this insight to improve pose estimation by converting
natural language descriptors, generated by a large multimodal model (LMM), into
tractable losses to constrain the 3D pose optimization. Despite its simplicity,
our method produces surprisingly compelling pose reconstructions of people in
close contact, correctly capturing the semantics of the social and physical
interactions. We demonstrate that our method rivals more complex
state-of-the-art approaches that require expensive human annotation of contact
points and training specialized models. Moreover, unlike previous approaches,
our method provides a unified framework for resolving self-contact and
person-to-person contact. |
This paper presents ProsePose, a zero-shot pose optimization method that leverages large multimodal models (LMMs) to improve 3D human pose estimation by enforcing accurate physical contact constraints. |
Accurately capturing physical contact (self-contact and person-to-person) in 3D pose estimation is crucial for understanding human behavior and social interactions, but existing methods often struggle to do so accurately without expensive contact annotations. |
ProsePose uses an LMM to generate natural language descriptions of contact points from an image. These descriptions are then converted into mathematical constraints, and a loss function based on these constraints is used to optimize the 3D pose estimates from a pose regressor. |
ProsePose produces more accurate 3D pose reconstructions than previous zero-shot methods on multiple datasets of one- and two-person interactions.
The method accurately captures semantically relevant contact points, improving both joint error and the percentage of correct contact points (PCC).
This work demonstrates that LMMs have implicit knowledge of human pose and can be used as effective priors for 3D pose estimation. |
The method's performance depends on the accuracy and consistency of the LMM's outputs, as LMM hallucination of contact points can occur.
Future work could explore using multiple LMMs or developing more robust methods for handling LMM uncertainty and potential biases. |
3d pose estimation, contact inference, large multimodal models, language priors, zero-shot learning |
2405.03685
Report |
Language-Image Models with 3D Understanding |
Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone |
Multi-modal large language models (MLLMs) have shown incredible capabilities
in a variety of 2D vision and language tasks. We extend MLLMs' perceptual
capabilities to ground and reason about images in 3-dimensional space. To that
end, we first develop a large-scale pre-training dataset for 2D and 3D called
LV3D by combining multiple existing 2D and 3D recognition datasets under a
common task formulation: as multi-turn question-answering. Next, we introduce a
new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data
scaling makes a strong 3D perception capability without 3D specific
architectural design or training objective. Cube-LLM exhibits intriguing
properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting
to improve 3D understanding from 2D context information. (2) Cube-LLM can
follow complex and diverse instructions and adapt to versatile input and output
formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of
candidate 3D boxes from specialists. Our experiments on outdoor benchmarks
demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3
points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7
points on the DriveLM dataset for complex reasoning about driving scenarios,
respectively. Cube-LLM also shows competitive results in general MLLM
benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well
as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for
complex reasoning. Our project is available at
https://janghyuncho.github.io/Cube-LLM. |
This work introduces Cube-LLM, a multi-modal large language model (MLLM) capable of reasoning in both 2D and 3D for image understanding, by leveraging a new large-scale pretraining dataset and a unified training framework. |
Extending the perceptual capabilities of MLLMs from 2D image coordinates to 3D view coordinates enables them to perceive and reason about visual input closer to how humans perceive the world, which is crucial for applications like autonomous driving. |
The authors create LV3D, a large-scale 2D and 3D pretraining dataset by unifying existing datasets and formulating tasks as multi-turn question-answering. They decompose 3D labels into simpler components (point, depth, size, orientation), enabling versatile input/output formats and inducing 2D to 3D generalization. They also introduce visual chain-of-thought prompting and specialist model prompting to improve reasoning. |
Cube-LLM significantly outperforms existing methods on 3D grounded reasoning tasks, achieving 21.3 points higher AP_BEV on the Talk2Car dataset.
Cube-LLM demonstrates strong performance in complex reasoning about driving scenarios, improving the overall score by 17.7 points on the DriveLM dataset.
Cube-LLM achieves state-of-the-art results in 2D referring expression comprehension (87.0 average score on refCOCO) and maintains competitive performance in standard MLLM benchmarks (VQAv2, GQA), showing that 3D reasoning is an expansion, not a trade-off. |
Cube-LLM currently does not employ resampling methods to reduce the number of vision tokens, limiting its input resolution.
Cube-LLM only supports single frame input, lacking the ability to reason about the dynamics of the environment from videos. |
multi-modal large language models, 3d scene understanding, foundation models, autonomous driving, visual grounding |
2405.03682
Report |
An Empty Room is All We Want: Automatic Defurnishing of Indoor Panoramas |
Mira Slavcheva, Dave Gausebeck, Kevin Chen, David Buchhofer, Azwad Sabik, Chen Ma, Sachal Dhillon, Olaf Brandt, Alan Dolhasz |
We propose a pipeline that leverages Stable Diffusion to improve inpainting
results in the context of defurnishing -- the removal of furniture items from
indoor panorama images. Specifically, we illustrate how increased context,
domain-specific model fine-tuning, and improved image blending can produce
high-fidelity inpaints that are geometrically plausible without needing to rely
on room layout estimation. We demonstrate qualitative and quantitative
improvements over other furniture removal techniques. |
This paper presents a pipeline for defurnishing indoor panorama images, leveraging Stable Diffusion for improved inpainting results. |
Defurnishing is crucial for digital twins in real estate, enabling personalized layouts, interior design experimentation, and property evaluation. |
The pipeline involves furniture segmentation, context maximization via rolling and padding, robust unfurnished space inpainting using a fine-tuned Stable Diffusion model trained on a dataset of unfurnished panoramas with synthetic furniture and shadows, superresolution, and a custom blending strategy. |
The fine-tuned Stable Diffusion model effectively reduces hallucinations of furniture in empty spaces.
The method produces high-fidelity inpaints that are geometrically plausible without requiring room layout estimation.
Quantitative and qualitative evaluations demonstrate superior performance compared to existing techniques like LaMa and LGPN-Net. |
The method may occasionally exhibit structural alterations or lingering hallucinations.
The reliance on synthetic data for training may lead to domain shift issues, impacting the quality of results on real-world images. |
image inpainting, stable diffusion, defurnishing, panorama images, digital twins |
2405.03673
Report |
MemoryMamba: Memory-Augmented State Space Model for Defect Recognition |
Qianning Wang, He Hu, Yucheng Zhou |
As automation advances in manufacturing, the demand for precise and
sophisticated defect detection technologies grows. Existing vision models for
defect recognition methods are insufficient for handling the complexities and
variations of defects in contemporary manufacturing settings. These models
especially struggle in scenarios involving limited or imbalanced defect data.
In this work, we introduce MemoryMamba, a novel memory-augmented state space
model (SSM), designed to overcome the limitations of existing defect
recognition models. MemoryMamba integrates the state space model with the
memory augmentation mechanism, enabling the system to maintain and retrieve
essential defect-specific information in training. Its architecture is designed
to capture dependencies and intricate defect characteristics, which are crucial
for effective defect detection. In the experiments, MemoryMamba was evaluated
across four industrial datasets with diverse defect types and complexities. The
model consistently outperformed other methods, demonstrating its capability to
adapt to various defect recognition scenarios. |
This paper introduces MemoryMamba, a novel memory-augmented state space model (SSM) for defect recognition, designed to overcome limitations of existing models in handling complexities and variations of defects, especially with limited or imbalanced data. |
Accurate defect recognition is crucial in manufacturing for quality control, production efficiency, cost reduction, and product reliability. Existing methods struggle with limited or imbalanced defect data, common in industrial settings. |
MemoryMamba integrates SSMs with memory augmentation, enabling it to retain and retrieve defect-specific information. It uses coarse- and fine-grained memory networks with a fusion module, optimized by contrastive learning and mutual information maximization. |
MemoryMamba consistently outperforms existing models (ResNet, DeiT, Swin, Vmamba) in accuracy, precision, recall, and F1 score across four industrial datasets.
Ablation studies confirm the essential role of coarse-grained memory networks, fine-grained memory networks, and the fusion module in achieving superior performance.
The choice of similarity metric in the fusion module and memory size for both memory networks significantly impacts the model's effectiveness, with cosine similarity and specific sizes showing better results depending on the dataset. |
The optimal memory size is dataset-dependent and requires careful tuning.
The model's performance with other memory augmentation techniques or optimization strategies is yet to be explored. |
defect recognition, state space models, memory augmentation, computer vision, manufacturing |
2405.03659
Report |
A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose |
Kaiwen Jiang, Yang Fu, Mukund Varma T, Yash Belhe, Xiaolong Wang, Hao Su, Ravi Ramamoorthi |
Novel view synthesis from a sparse set of input images is a challenging
problem of great practical interest, especially when camera poses are absent or
inaccurate. Direct optimization of camera poses and usage of estimated depths
in neural radiance field algorithms usually do not produce good results because
of the coupling between poses and depths, and inaccuracies in monocular depth
estimation. In this paper, we leverage the recent 3D Gaussian splatting method
to develop a novel construct-and-optimize method for sparse view synthesis
without camera poses. Specifically, we construct a solution progressively by
using monocular depth and projecting pixels back into the 3D world. During
construction, we optimize the solution by detecting 2D correspondences between
training views and the corresponding rendered images. We develop a unified
differentiable pipeline for camera registration and adjustment of both camera
poses and depths, followed by back-projection. We also introduce a novel notion
of an expected surface in Gaussian splatting, which is critical to our
optimization. These steps enable a coarse solution, which can then be low-pass
filtered and refined using standard optimization methods. We demonstrate
results on the Tanks and Temples and Static Hikes datasets with as few as three
widely-spaced views, showing significantly better quality than competing
methods, including those with approximate camera pose information. Moreover,
our results improve with more views and outperform previous InstantNGP and
Gaussian Splatting algorithms even when using half the dataset. |
This paper introduces a novel construct-and-optimize approach for sparse view synthesis using 3D Gaussian splatting, eliminating the need for known camera poses. |
Existing NeRF-based methods struggle with sparse view synthesis, especially when camera poses are unknown or inaccurate. This work addresses this challenge by constructing a solution based on monocular depth and optimizing it using correspondences. |
The method constructs a coarse solution by progressively registering and adjusting camera poses and depths via a differentiable pipeline that leverages 2D correspondences. It introduces a novel concept of an expected surface in Gaussian splatting for accurate correspondence matching. This coarse solution is then refined using standard optimization. |
Achieves state-of-the-art results on Tanks and Temples and Static Hikes datasets with as few as three views.
Significantly outperforms competing methods, including those using approximate camera pose information.
Performance improves with more views, outperforming previous methods even when using half the dataset. |
Constructing the coarse solution depends on the scale-consistent assumption of estimated monocular depth, which doesn't always hold for complex scenes.
Assumes overlapping between consecutive frames, limiting its applicability to unordered image collections. |
view synthesis, 3d gaussian splatting, camera optimization, sparse view, correspondence matching |
2405.03486
Report |
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images |
Yiting Qu, Xinyue Shen, Yixin Wu, Michael Backes, Savvas Zannettou, Yang Zhang |
Image safety classifiers play an important role in identifying and mitigating
the spread of unsafe images online (e.g., images including violence, hateful
rhetoric, etc.). At the same time, with the advent of text-to-image models and
increasing concerns about the safety of AI models, developers are increasingly
relying on image safety classifiers to safeguard their models. Yet, the
performance of current image safety classifiers remains unknown for real-world
and AI-generated images. To bridge this research gap, in this work, we propose
UnsafeBench, a benchmarking framework that evaluates the effectiveness and
robustness of image safety classifiers. First, we curate a large dataset of 10K
real-world and AI-generated images that are annotated as safe or unsafe based
on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.).
Then, we evaluate the effectiveness and robustness of five popular image safety
classifiers, as well as three classifiers that are powered by general-purpose
visual language models. Our assessment indicates that existing image safety
classifiers are not comprehensive and effective enough in mitigating the
multifaceted problem of unsafe images. Also, we find that classifiers trained
only on real-world images tend to have degraded performance when applied to
AI-generated images. Motivated by these findings, we design and implement a
comprehensive image moderation tool called PerspectiveVision, which effectively
identifies 11 categories of real-world and AI-generated unsafe images. The best
PerspectiveVision model achieves an overall F1-Score of 0.810 on six evaluation
datasets, which is comparable with closed-source and expensive state-of-the-art
models like GPT-4V. UnsafeBench and PerspectiveVision can aid the research
community in better understanding the landscape of image safety classification
in the era of generative AI. |
This paper introduces UnsafeBench, a benchmarking framework for evaluating image safety classifiers on both real-world and AI-generated images. |
Image safety classifiers are crucial for mitigating the spread of harmful content online, but their performance on diverse and AI-generated imagery remains underexplored. |
The authors curate a dataset of 10K real-world and AI-generated images, labeled across 11 unsafe categories. They evaluate the effectiveness and robustness of 5 conventional and 3 VLM-based image safety classifiers. Additionally, they develop PerspectiveVision, a comprehensive image moderation tool. |
Existing image safety classifiers show imbalanced performance across unsafe categories and struggle with AI-generated images.
Classifiers trained on real-world images experience performance degradation on AI-generated images, likely due to distinct characteristics in the latter, such as artistic representations and grid layouts.
PerspectiveVision, the proposed image moderation tool, achieves comparable performance to GPT-4V in identifying unsafe images, showcasing its potential as a benchmark tool. |
The UnsafeBench dataset exhibits bias towards unsafe content and limited diversity in AI-generated images, potentially affecting the generalizability of findings.
The generalizability of PerspectiveVision is evaluated on a limited range of unsafe categories due to a lack of labeled datasets. |
image safety, content moderation, ai-generated images, benchmarking framework, visual language models |
2405.03436
Report |
DBDH: A Dual-Branch Dual-Head Neural Network for Invisible Embedded Regions Localization |
Chengxin Zhao, Hefei Ling, Sijing Xie, Nan Sun, Zongyi Li, Yuxuan Shi, Jiazhong Chen |
Embedding invisible hyperlinks or hidden codes in images to replace QR codes
has become a hot topic recently. This technology requires first localizing the
embedded region in the captured photos before decoding. Existing methods that
train models to find the invisible embedded region struggle to obtain accurate
localization results, leading to degraded decoding accuracy. This limitation is
primarily because the CNN network is sensitive to low-frequency signals, while
the embedded signal is typically in the high-frequency form. Based on this,
this paper proposes a Dual-Branch Dual-Head (DBDH) neural network tailored for
the precise localization of invisible embedded regions. Specifically, DBDH uses
a low-level texture branch containing 62 high-pass filters to capture the
high-frequency signals induced by embedding. A high-level context branch is
used to extract discriminative features between the embedded and normal
regions. DBDH employs a detection head to directly detect the four vertices of
the embedding region. In addition, we introduce an extra segmentation head to
segment the mask of the embedding region during training. The segmentation head
provides pixel-level supervision for model learning, facilitating better
learning of the embedded signals. Based on two state-of-the-art invisible
offline-to-online messaging methods, we construct two datasets and augmentation
strategies for training and testing localization models. Extensive experiments
demonstrate the superior performance of the proposed DBDH over existing
methods. |
This paper proposes DBDH, a dual-branch dual-head neural network for accurate localization of invisible embedded regions in images. |
Accurate localization of embedded regions is crucial for decoding messages in invisible offline-to-online messaging systems, but existing methods struggle to accurately locate these regions due to the high-frequency nature of embedded signals. |
DBDH uses a low-level texture branch with high-pass filters to capture high-frequency embedded signals and a high-level context branch to extract discriminative features. It employs a vertex detection head for localization and a segmentation head during training for region-wise supervision. |
DBDH outperforms existing methods like StegaStamp and Invisible Markers in localization accuracy.
The use of high-pass filters in the texture branch is shown to be effective for capturing embedded signals.
The addition of the segmentation head during training improves localization performance. |
The model is evaluated on datasets based on only two specific offline-to-online messaging schemes.
Further research could explore the generalization of DBDH to other embedding methods and real-world scenarios. |
offline-to-online messaging, invisible embedded regions localization, high-pass filter, segmentation, keypoint detection |
2405.03349
Report |
Retinexmamba: Retinex-based Mamba for Low-light Image Enhancement |
Jiesong Bai, Yuhao Yin, Qiyuan He, Yuanxian Li, Xiaofeng Zhang |
In the field of low-light image enhancement, both traditional Retinex methods
and advanced deep learning techniques such as Retinexformer have shown distinct
advantages and limitations. Traditional Retinex methods, designed to mimic the
human eye's perception of brightness and color, decompose images into
illumination and reflection components but struggle with noise management and
detail preservation under low light conditions. Retinexformer enhances
illumination estimation through traditional self-attention mechanisms, but
faces challenges with insufficient interpretability and suboptimal enhancement
effects. To overcome these limitations, this paper introduces the RetinexMamba
architecture. RetinexMamba not only captures the physical intuitiveness of
traditional Retinex methods but also integrates the deep learning framework of
Retinexformer, leveraging the computational efficiency of State Space Models
(SSMs) to enhance processing speed. This architecture features innovative
illumination estimators and damage restorer mechanisms that maintain image
quality during enhancement. Moreover, RetinexMamba replaces the IG-MSA
(Illumination-Guided Multi-Head Attention) in Retinexformer with a
Fused-Attention mechanism, improving the model's interpretability. Experimental
evaluations on the LOL dataset show that RetinexMamba outperforms existing deep
learning approaches based on Retinex theory in both quantitative and
qualitative metrics, confirming its effectiveness and superiority in enhancing
low-light images. |
This paper presents RetinexMamba, a novel architecture for low-light image enhancement leveraging Retinex theory and State Space Models (SSMs). |
Traditional Retinex methods and deep learning techniques, while effective, have limitations in noise management, detail preservation, interpretability, and computational efficiency in low-light image enhancement. |
RetinexMamba combines an Illumination Estimator (inspired by traditional Retinex) with a Damage Restorer based on Illumination Fusion Visual Mamba (IFVM). IFVM utilizes Illumination Fusion State Space Model (IFSSM) featuring 2D Selective Scanning (SS2D) for linear computational efficiency and Illumination Fusion Attention (IFA) for improved interpretability. |
RetinexMamba outperforms existing deep learning methods based on Retinex theory on the LOL dataset in quantitative metrics like PSNR and RMSE.
Qualitative results show RetinexMamba effectively controls exposure, reduces color distortion, and minimizes noise compared to other SOTA algorithms.
Ablation studies demonstrate the benefits of individual components like SS2D and IFA in improving performance. |
Despite reduced complexity in SS2D, the overall parameter count is increased, leading to higher computational resource consumption.
Future work will focus on reducing the total number of parameters while preserving computational efficiency. |
retinex, low-light enhancement, fused-attention, retinexformer, state space model |
2405.03318
Report |
Enhancing DETRs Variants through Improved Content Query and Similar Query Aggregation |
Yingying Zhang, Chuangji Shi, Xin Guo, Jiangwei Lao, Jian Wang, Jiaotuan Wang, Jingdong Chen |
The design of the query is crucial for the performance of DETR and its
variants. Each query consists of two components: a content part and a
positional one. Traditionally, the content query is initialized with a zero or
learnable embedding, lacking essential content information and resulting in
sub-optimal performance. In this paper, we introduce a novel plug-and-play
module, Self-Adaptive Content Query (SACQ), to address this limitation. The
SACQ module utilizes features from the transformer encoder to generate content
queries via self-attention pooling. This allows candidate queries to adapt to
the input image, resulting in a more comprehensive content prior and better
focus on target objects. However, this improved concentration poses a challenge
for the training process that utilizes the Hungarian matching, which selects
only a single candidate and suppresses other similar ones. To overcome this, we
propose a query aggregation strategy to cooperate with SACQ. It merges similar
predicted candidates from different queries, easing the optimization. Our
extensive experiments on the COCO dataset demonstrate the effectiveness of our
proposed approaches across six different DETR's variants with multiple
configurations, achieving an average improvement of over 1.0 AP. |
This paper introduces a novel plug-and-play module named Self-Adaptive Content Query (SACQ) for improving object detection in DETR and its variants by optimizing the content aspect of object queries. |
The content query, crucial for DETR's performance, is often initialized with limited information, leading to sub-optimal results. This paper addresses this limitation to enhance object detection accuracy. |
The SACQ module utilizes self-attention pooling to generate content queries from transformer encoder features. It uses global pooling for initialization and local pooling for refinement. A Query Aggregation (QA) strategy is also proposed to merge similar predictions, further boosting performance. |
SACQ consistently improves performance across six different DETR variants with an average gain of over 1.0 AP on the COCO dataset.
The method shows effectiveness in both iterative bounding box refinement and two-stage Deformable-DETR settings.
Visualization of attention maps confirms that SACQ accurately focuses on target objects, demonstrating its effectiveness in content query enhancement. |
The performance gain on the state-of-the-art DINO method is not significant, suggesting further research on joint optimization of content query and matching strategies.
The influence of different thresholds in query aggregation and its interaction with SACQ requires further investigation. |
object detection, detr, transformer, content query, self-attention |
2405.03243
Report |
Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data |
Leonhard Hennicke, Christian Medeiros Adriano, Holger Giese, Jan Mathias Koehler, Lukas Schott |
Generative foundation models like Stable Diffusion comprise a diverse
spectrum of knowledge in computer vision with the potential for transfer
learning, e.g., via generating data to train student models for downstream
tasks. This could circumvent the necessity of collecting labeled real-world
data, thereby presenting a form of data-free knowledge distillation. However,
the resultant student models show a significant drop in accuracy compared to
models trained on real data. We investigate possible causes for this drop and
focus on the role of the different layers of the student model. By training
these layers using either real or synthetic data, we reveal that the drop
mainly stems from the model's final layers. Further, we briefly investigate
other factors, such as differences in data-normalization between synthetic and
real, the impact of data augmentations, texture vs.\ shape learning, and
assuming oracle prompts. While we find that some of those factors can have an
impact, they are not sufficient to close the gap towards real data. Building
upon our insights that mainly later layers are responsible for the drop, we
investigate the data-efficiency of fine-tuning a synthetically trained model
with real data applied to only those last layers. Our results suggest an
improved trade-off between the amount of real training data used and the
model's accuracy. Our findings contribute to the understanding of the gap
between synthetic and real data and indicate solutions to mitigate the scarcity
of labeled real data. |
The study investigates transfer learning capabilities between models trained on real and synthetic ImageNet-100 datasets, particularly focusing on freezing initial layers pre-trained on one dataset and training remaining layers on the other. |
Understanding how knowledge transfers between real and synthetic datasets is crucial for leveraging synthetic data's potential in training models for real-world applications, especially when real data is scarce or expensive. |
The authors systematically experiment with freezing varying numbers of initial layers in a ResNet-like model. They pre-train models on either real or synthetic ImageNet-100, then train the remaining layers on the other dataset. Performance is evaluated on real ImageNet-100 validation data. |
Transferring knowledge from real to synthetic data is less effective than vice-versa.
Freezing a significant number of initial layers pre-trained on synthetic data shows comparable results to models trained entirely on real data.
Models with initial layers pre-trained on synthetic data exhibit better resilience to reductions in the amount of real training data. |
The study focuses solely on ImageNet-100; generalizability to other datasets needs further investigation.
Exploring the impact of different synthetic data generation techniques on transfer learning could be beneficial. |
transfer learning, synthetic data, imagenet, deep learning, computer vision |
2405.03150
Report |
Video Diffusion Models: A Survey |
Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter |
Diffusion generative models have recently become a robust technique for
producing and modifying coherent, high-quality video. This survey offers a
systematic overview of critical elements of diffusion models for video
generation, covering applications, architectural choices, and the modeling of
temporal dynamics. Recent advancements in the field are summarized and grouped
into development trends. The survey concludes with an overview of remaining
challenges and an outlook on the future of the field. Website:
https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models |
This paper presents a survey of video diffusion models, focusing on applications, architectural choices, temporal dynamic modeling, and training approaches. |
Video diffusion models have the potential to revolutionize video generation, editing, and simulation, making this survey timely and relevant. |
The paper systematically categorizes existing work on video diffusion models by application, architectural choices for temporal modeling, training strategies, and benchmarks. It summarizes notable papers in each category and discusses their contributions. |
The survey identifies key architectural trends such as the use of UNets, Vision Transformers, cascaded, and latent diffusion models for video generation.
It highlights various methods for modeling temporal dynamics, including spatio-temporal attention mechanisms, temporal upsampling techniques, and structure preservation.
The paper discusses ongoing challenges including training data limitations, computational costs, and modeling long-term dependencies, while providing directions for future research. |
The rapid evolution of video diffusion models may lead to some discussed approaches becoming quickly outdated.
The survey primarily focuses on technical aspects and might benefit from a deeper discussion of ethical considerations surrounding generative video models. |
video diffusion models, generative models, video generation, video editing, deep learning |
2405.03121
Report |
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding |
Tao Liu, Feilong Chen, Shuai Fan, Chenpeng Du, Qi Chen, Xie Chen, Kai Yu |
The paper introduces AniTalker, an innovative framework designed to generate
lifelike talking faces from a single portrait. Unlike existing models that
primarily focus on verbal cues such as lip synchronization and fail to capture
the complex dynamics of facial expressions and nonverbal cues, AniTalker
employs a universal motion representation. This innovative representation
effectively captures a wide range of facial dynamics, including subtle
expressions and head movements. AniTalker enhances motion depiction through two
self-supervised learning strategies: the first involves reconstructing target
video frames from source frames within the same identity to learn subtle motion
representations, and the second develops an identity encoder using metric
learning while actively minimizing mutual information between the identity and
motion encoders. This approach ensures that the motion representation is
dynamic and devoid of identity-specific details, significantly reducing the
need for labeled data. Additionally, the integration of a diffusion model with
a variance adapter allows for the generation of diverse and controllable facial
animations. This method not only demonstrates AniTalker's capability to create
detailed and realistic facial movements but also underscores its potential in
crafting dynamic avatars for real-world applications. Synthetic results can be
viewed at https://github.com/X-LANCE/AniTalker. |
AniTalker is a novel framework that generates realistic and diverse talking face animations from a single portrait image by decoupling identity and motion. |
Existing methods often fail to capture the complex dynamics of facial expressions and nonverbal cues, limiting their ability to create truly lifelike avatars. |
The framework utilizes a self-supervised learning approach with a universal motion encoder, metric learning for identity recognition, mutual information minimization for disentanglement, and a diffusion model with a variance adapter for generating diverse and controllable facial animations. |
AniTalker outperforms existing methods in generating realistic and expressive talking face animations, as evidenced by both quantitative metrics (PSNR, SSIM, LPIPS, CSIM) and qualitative assessments.
The framework demonstrates strong identity preservation capabilities, effectively separating motion from appearance even in cross-driven scenarios where the source and target identities differ.
The motion representation learned by AniTalker exhibits strong generalization ability, enabling animation of diverse facial structures including cartoons and sculptures. |
The current rendering network generates frames individually, which can lead to temporal inconsistencies, especially in complex backgrounds.
Extreme head poses can lead to blurring artifacts due to limitations of the warping technique. |
talking face, self-supervised learning, motion encoding, disentanglement, diffusion models |
2405.03025
Report |
Matten: Video Generation with Mamba-Attention |
Yu Gao, Jiancheng Huang, Xiaopeng Sun, Zequn Jie, Yujie Zhong, Lin Ma |
In this paper, we introduce Matten, a cutting-edge latent diffusion model
with Mamba-Attention architecture for video generation. With minimal
computational cost, Matten employs spatial-temporal attention for local video
content modeling and bidirectional Mamba for global video content modeling. Our
comprehensive experimental evaluation demonstrates that Matten has competitive
performance with the current Transformer-based and GAN-based models in
benchmark performance, achieving superior FVD scores and efficiency.
Additionally, we observe a direct positive correlation between the complexity
of our designed model and the improvement in video quality, indicating the
excellent scalability of Matten. |
This paper introduces Matten, a novel latent diffusion model for video generation that leverages the Mamba-Attention architecture for efficient and high-quality video synthesis. |
The development of efficient and effective video generation models is crucial due to the increasing demand for high-quality video content in various applications. Existing methods often suffer from high computational costs or limitations in capturing complex spatio-temporal dynamics in videos. |
The study explored four variants of the Matten model, each employing different combinations of Mamba and attention mechanisms for capturing local and global spatio-temporal relationships in videos. The models were trained and evaluated on four benchmark datasets: FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. |
Matten achieves competitive FVD scores compared to state-of-the-art video generation models, demonstrating its effectiveness in generating high-quality videos.
The study found that combining global Mamba scans for temporal modeling with attention mechanisms for local spatio-temporal modeling yielded the best performance.
Matten exhibits good scalability, with larger model sizes leading to improved video generation quality. |
The lack of pre-trained Mamba-based image generation models necessitates training Matten from scratch, potentially limiting its initial performance.
Further research can explore incorporating pre-trained Mamba models and advanced techniques like distillation to enhance Matten's efficiency and performance. |
video generation, diffusion models, mamba, attention mechanism, spatio-temporal modeling |
2405.03008
Report |
DVMSR: Distillated Vision Mamba for Efficient Super-Resolution |
Xiaoyan Lei, Wenlong Zhang, Weifeng Cao |
Efficient Image Super-Resolution (SR) aims to accelerate SR network inference
by minimizing computational complexity and network parameters while preserving
performance. Existing state-of-the-art Efficient Image Super-Resolution methods
are based on convolutional neural networks. Few attempts have been made with
Mamba to harness its long-range modeling capability and efficient computational
complexity, which have shown impressive performance on high-level vision tasks.
In this paper, we propose DVMSR, a novel lightweight Image SR network that
incorporates Vision Mamba and a distillation strategy. The network of DVMSR
consists of three modules: feature extraction convolution, multiple stacked
Residual State Space Blocks (RSSBs), and a reconstruction module. Specifically,
the deep feature extraction module is composed of several residual state space
blocks (RSSB), each of which has several Vision Mamba Moudles(ViMM) together
with a residual connection. To achieve efficiency improvement while maintaining
comparable performance, we employ a distillation strategy to the vision Mamba
network for superior performance. Specifically, we leverage the rich
representation knowledge of teacher network as additional supervision for the
output of lightweight student networks. Extensive experiments have demonstrated
that our proposed DVMSR can outperform state-of-the-art efficient SR methods in
terms of model parameters while maintaining the performance of both PSNR and
SSIM. The source code is available at https://github.com/nathan66666/DVMSR.git |
This paper introduces DVMSR, a lightweight image super-resolution network leveraging Vision Mamba and a knowledge distillation strategy to achieve efficient inference and maintain high performance. |
Efficient image super-resolution aims to improve image quality with minimal computational cost and parameter usage, which is crucial for applications on resource-constrained devices. This work explores the potential of Mamba networks for efficient SR. |
DVMSR consists of feature extraction, multiple Residual State Space Blocks (RSSBs) with Vision Mamba Modules, and a reconstruction module. A distillation strategy is employed where a larger, pre-trained Mamba network guides the learning of the smaller DVMSR. |
DVMSR outperforms state-of-the-art efficient SR methods in terms of parameter count while achieving comparable or even better PSNR and SSIM scores.
The use of Vision Mamba modules enables long-range dependency modeling, leading to improved image details in the reconstruction process.
The distillation strategy effectively transfers knowledge from the teacher network to DVMSR, further enhancing its performance. |
The current study primarily focuses on the final model architecture without extensive exploration of parameter optimization.
Further research is needed to investigate the balance point between teacher and student model performance in the distillation process. |
image super-resolution, efficient deep learning, vision mamba, state space models, knowledge distillation |
2405.02982
Report |
Paintings and Drawings Aesthetics Assessment with Rich Attributes for Various Artistic Categories |
Xin Jin, Qianqian Qiao, Yi Lu, Shan Gao, Heng Huang, Guangdong Li |
Image aesthetic evaluation is a highly prominent research domain in the field
of computer vision. In recent years, there has been a proliferation of datasets
and corresponding evaluation methodologies for assessing the aesthetic quality
of photographic works, leading to the establishment of a relatively mature
research environment. However, in contrast to the extensive research in
photographic aesthetics, the field of aesthetic evaluation for paintings and
Drawings has seen limited attention until the introduction of the BAID dataset
in March 2023. This dataset solely comprises overall scores for high-quality
artistic images. Our research marks the pioneering introduction of a
multi-attribute, multi-category dataset specifically tailored to the field of
painting: Aesthetics of Paintings and Drawings Dataset (APDD). The construction
of APDD received active participation from 28 professional artists worldwide,
along with dozens of students specializing in the field of art. This dataset
encompasses 24 distinct artistic categories and 10 different aesthetic
attributes. Each image in APDD has been evaluated by six professionally trained
experts in the field of art, including assessments for both total aesthetic
scores and aesthetic attribute scores. The final APDD dataset comprises a total
of 4985 images, with an annotation count exceeding 31100 entries. Concurrently,
we propose an innovative approach: Art Assessment Network for Specific Painting
Styles (AANSPS), designed for the assessment of aesthetic attributes in
mixed-attribute art datasets. Through this research, our goal is to catalyze
advancements in the field of aesthetic evaluation for paintings and drawings,
while enriching the available resources and methodologies for its further
development and application. |
This paper introduces APDD, the first multi-attribute, multi-category dataset for aesthetic evaluation of paintings and drawings, addressing the limitations of existing datasets that primarily focus on photographic images or lack attribute annotations. |
The development of aesthetic evaluation models for paintings and drawings is hampered by the lack of comprehensive datasets that consider diverse artistic categories, styles, and aesthetic attributes. APDD fills this gap and enables research on more nuanced and interpretable aesthetic assessment. |
A team of 28 professional artists and students constructed APDD by: 1) Defining 24 artistic categories based on painting type, style, and subject matter; 2) Identifying 10 relevant aesthetic attributes; 3) Collecting 4,985 images from various sources; 4) Developing a scoring system and criteria; 5) Annotating images for overall aesthetic score and attribute scores with at least 6 evaluations per image. |
APDD is the first multi-attribute, multi-category dataset for painting aesthetic evaluation, encompassing 24 categories, 10 attributes, and over 31,100 annotations.
The paper proposes AANSPS, a novel network for assessing both total and attribute-specific aesthetic scores in paintings, outperforming existing methods on APDD.
The research provides a clear framework for considering aesthetic components in paintings, classifying artistic categories, and defining scoring criteria for aesthetic attributes. |
The current categorization and attributes in APDD, while extensive, are not exhaustive and can be expanded in future work to encompass the full breadth of painting styles and aesthetic qualities.
Future work will focus on increasing the size of APDD, adding annotations for more attributes, and incorporating detailed language comments to enhance score interpretability. |
computer vision, computational aesthetics, image aesthetic assessment, painting dataset, deep learning |
2405.02945
Report |
Invertible Residual Rescaling Models |
Jinmin Li, Tao Dai, Yaohua Zha, Yilu Luo, Longfei Lu, Bin Chen, Zhi Wang, Shu-Tao Xia, Jingyun Zhang |
Invertible Rescaling Networks (IRNs) and their variants have witnessed
remarkable achievements in various image processing tasks like image rescaling.
However, we observe that IRNs with deeper networks are difficult to train, thus
hindering the representational ability of IRNs. To address this issue, we
propose Invertible Residual Rescaling Models (IRRM) for image rescaling by
learning a bijection between a high-resolution image and its low-resolution
counterpart with a specific distribution. Specifically, we propose IRRM to
build a deep network, which contains several Residual Downscaling Modules
(RDMs) with long skip connections. Each RDM consists of several Invertible
Residual Blocks (IRBs) with short connections. In this way, RDM allows rich
low-frequency information to be bypassed by skip connections and forces models
to focus on extracting high-frequency information from the image. Extensive
experiments show that our IRRM performs significantly better than other
state-of-the-art methods with much fewer parameters and complexity.
Particularly, our IRRM has respectively PSNR gains of at least 0.3 dB over
HCFlow and IRN in the x4 rescaling while only using 60% parameters and 50%
FLOPs. The code will be available at https://github.com/THU-Kingmin/IRRM. |
This paper proposes Invertible Residual Rescaling Models (IRRM) for highly accurate image rescaling by learning a bijection between a high-resolution image and its low-resolution counterpart with a specific distribution. |
Existing IRNs face training difficulties with deep networks, hindering their representational ability, and previous methods struggle with high-frequency information recovery. |
IRRM employs Residual Downscaling Modules (RDMs) with long skip connections to facilitate training and focus on high-frequency information. Each RDM comprises Invertible Residual Blocks (IRBs) with short skip connections to enhance non-linear representation. |
IRRM significantly outperforms state-of-the-art methods in PSNR and SSIM with fewer parameters and complexity.
The residual connections in IRRM enhance model extensibility, enabling stable training with deeper networks.
IRRM exhibits robustness to variations in the sampled latent variable 'z', ensuring accurate detail preservation. |
The paper primarily focuses on image rescaling and may not directly generalize to other image processing tasks.
Further investigation into alternative loss functions or network architectures within IRRM could potentially yield additional performance improvements. |
image rescaling, invertible neural networks, residual learning, deep learning, image processing |
2405.02859
Report |
MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior |
Honghua Chen, Chen Change Loy, Xingang Pan |
Despite the emergence of successful NeRF inpainting methods built upon
explicit RGB and depth 2D inpainting supervisions, these methods are inherently
constrained by the capabilities of their underlying 2D inpainters. This is due
to two key reasons: (i) independently inpainting constituent images results in
view-inconsistent imagery, and (ii) 2D inpainters struggle to ensure
high-quality geometry completion and alignment with inpainted RGB images.
To overcome these limitations, we propose a novel approach called MVIP-NeRF
that harnesses the potential of diffusion priors for NeRF inpainting,
addressing both appearance and geometry aspects. MVIP-NeRF performs joint
inpainting across multiple views to reach a consistent solution, which is
achieved via an iterative optimization process based on Score Distillation
Sampling (SDS). Apart from recovering the rendered RGB images, we also extract
normal maps as a geometric representation and define a normal SDS loss that
motivates accurate geometry inpainting and alignment with the appearance.
Additionally, we formulate a multi-view SDS score function to distill
generative priors simultaneously from different view images, ensuring
consistent visual completion when dealing with large view variations. Our
experimental results show better appearance and geometry recovery than previous
NeRF inpainting methods. |
Presents MVIP-NeRF, a novel method for multiview-consistent inpainting on NeRF scenes using diffusion priors for appearance and geometry completion. |
Existing NeRF inpainting methods depend on explicit 2D inpainting, leading to inconsistencies and inaccurate geometry. MVIP-NeRF overcomes these limitations by leveraging diffusion priors for joint multiview inpainting. |
Uses a masked NeRF training scheme with an appearance SDS loss for RGB images and a normal SDS loss for geometry, both guided by diffusion priors. Introduces multi-view score distillation for consistency in large view variations. |
Achieves better appearance and geometry recovery compared to existing NeRF inpainting techniques on two real-world datasets.
Demonstrates the effectiveness of appearance and geometry diffusion priors over using explicit 2D inpainting results.
Shows the benefit of multi-view score distillation in improving consistency for scenes with large view changes. |
Efficiency is impacted by the iterative detail recovery process using diffusion priors.
Requires tuning of hyper-parameters related to diffusion priors (e.g., CFGs). |
nerf, inpainting, diffusion models, score distillation sampling, multiview consistency |
2405.02844
Report |
SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion |
Ziyun Qian, Zeyu Xiao, Zhenyi Wu, Dingkang Yang, Mingcheng Li, Shunli Wang, Shuaibing Wang, Dongliang Kou, Lihua Zhang |
Motion style transfer is a significant research direction in multimedia
applications. It enables the rapid switching of different styles of the same
motion for virtual digital humans, thus vastly increasing the diversity and
realism of movements. It is widely applied in multimedia scenarios such as
movies, games, and the Metaverse. However, most of the current work in this
field adopts the GAN, which may lead to instability and convergence issues,
making the final generated motion sequence somewhat chaotic and unable to
reflect a highly realistic and natural style. To address these problems, we
consider style motion as a condition and propose the Style Motion Conditioned
Diffusion (SMCD) framework for the first time, which can more comprehensively
learn the style features of motion. Moreover, we apply Mamba model for the
first time in the motion style transfer field, introducing the Motion Style
Mamba (MSM) module to handle longer motion sequences. Thirdly, aiming at the
SMCD framework, we propose Diffusion-based Content Consistency Loss and Content
Consistency Loss to assist the overall framework's training. Finally, we
conduct extensive experiments. The results reveal that our method surpasses
state-of-the-art methods in both qualitative and quantitative comparisons,
capable of generating more realistic motion sequences. |
This paper introduces the Style Motion Conditioned Diffusion (SMCD) framework, a novel approach for motion style transfer that utilizes diffusion models with style motion as a condition, aiming to enhance the realism and naturalness of generated motion sequences. |
Existing GAN-based methods for motion style transfer suffer from instability and convergence issues, hindering the generation of high-fidelity motion sequences. The SMCD framework addresses these limitations by leveraging the stability and convergence benefits of diffusion models. |
The SMCD framework utilizes a diffusion model with style motion as a condition to learn motion features and variations comprehensively. It also incorporates a Motion Style Mamba (MSM) module, inspired by the Mamba model, to capture temporal information and preserve long-term dependencies within motion sequences. Additionally, Diffusion-based Content Consistency Loss and Diffusion-based Style Consistency Loss functions are introduced to constrain the content and style of generated motions. |
SMCD generates more realistic motion sequences compared to state-of-the-art methods, as demonstrated by visual comparisons and quantitative metrics like FID, KID, and Diversity.
The framework exhibits strong generalizability, effectively transferring styles to unseen motion categories.
Ablation studies confirm the importance of each component within the SMCD framework, including the MSM module and the proposed loss functions, in achieving superior performance. |
The definition of 'style' in motion style transfer remains an open question and requires further exploration.
Future research can focus on expanding the application of diffusion-based methods in motion style transfer to further enhance performance. |
motion style transfer, diffusion models, motion generation, mamba model, multimedia applications |
2405.02843
Report |
Residual-Conditioned Optimal Transport: Towards Structure-Preserving Unpaired and Paired Image Restoration |
Xiaole Tang, Xin Hu, Xiang Gu, Jian Sun |
Deep learning-based image restoration methods generally struggle with
faithfully preserving the structures of the original image. In this work, we
propose a novel Residual-Conditioned Optimal Transport (RCOT) approach, which
models image restoration as an optimal transport (OT) problem for both unpaired
and paired settings, introducing the transport residual as a unique
degradation-specific cue for both the transport cost and the transport map.
Specifically, we first formalize a Fourier residual-guided OT objective by
incorporating the degradation-specific information of the residual into the
transport cost. We further design the transport map as a two-pass RCOT map that
comprises a base model and a refinement process, in which the transport
residual is computed by the base model in the first pass and then encoded as a
degradation-specific embedding to condition the second-pass restoration. By
duality, the RCOT problem is transformed into a minimax optimization problem,
which can be solved by adversarially training neural networks. Extensive
experiments on multiple restoration tasks show that RCOT achieves competitive
performance in terms of both distortion measures and perceptual quality,
restoring images with more faithful structures as compared with
state-of-the-art methods. |
This paper proposes Residual-Conditioned Optimal Transport (RCOT), modeling image restoration as an Optimal Transport problem. RCOT introduces a "transport residual" that captures degradation-specific information, improving structure preservation in restored images. |
Current deep learning-based image restoration methods struggle to balance removing distortions and preserving original image structures. This new approach aims to address this challenge by incorporating degradation-specific knowledge. |
The method leverages a two-pass process: 1) A base model generates an initial restored image and calculates the "transport residual." 2) The residual is encoded as an embedding, conditioning a second restoration pass for structure preservation. This is framed as a minimax optimization problem, solved by adversarially training neural networks. |
RCOT achieves competitive performance on benchmark datasets for denoising, deraining, dehazing, and super-resolution.
The method excels in preserving structural details compared to existing techniques.
Ablation studies confirm the contribution of the residual conditioning and the Fourier residual-guided OT objective. |
The handcrafted priors used to characterize the Fourier residual may not be optimal for all degradation types.
Future work aims to explore automatic and adaptive learning of these priors and extend RCOT to an all-in-one restoration framework. |
image restoration, optimal transport, structure preservation, deep learning, residual learning |
2405.02793
Report |
ImageInWords: Unlocking Hyper-Detailed Image Descriptions |
Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut |
Despite the longstanding adage "an image is worth a thousand words," creating
accurate and hyper-detailed image descriptions for training Vision-Language
models remains challenging. Current datasets typically have web-scraped
descriptions that are short, low-granularity, and often contain details
unrelated to the visual content. As a result, models trained on such data
generate descriptions replete with missing information, visual inconsistencies,
and hallucinations. To address these issues, we introduce ImageInWords (IIW), a
carefully designed human-in-the-loop annotation framework for curating
hyper-detailed image descriptions and a new dataset resulting from this
process. We validate the framework through evaluations focused on the quality
of the dataset and its utility for fine-tuning with considerations for
readability, comprehensiveness, specificity, hallucinations, and
human-likeness. Our dataset significantly improves across these dimensions
compared to recently released datasets (+66%) and GPT-4V outputs (+48%).
Furthermore, models fine-tuned with IIW data excel by +31% against prior work
along the same human evaluation dimensions. Given our fine-tuned models, we
also evaluate text-to-image generation and vision-language reasoning. Our
model's descriptions can generate images closest to the original, as judged by
both automated and human metrics. We also find our model produces more
compositionally rich descriptions, outperforming the best baseline by up to 6%
on ARO, SVO-Probes, and Winoground datasets. |
The paper introduces ImageInWords (IIW), a novel human-in-the-loop framework for creating hyper-detailed, hallucination-free image descriptions, along with a new dataset produced using this method. |
Existing image description datasets are limited by short, noisy web-scraped captions, hindering the development of vision-language models capable of generating rich, accurate descriptions. |
IIW combines human annotations with machine-generated seeds in a sequential process. First, object-level descriptions are generated and refined. Then, these are used to create a detailed image description, iteratively improved by multiple annotators. |
Human evaluation shows IIW descriptions are significantly preferred over those from existing datasets (DCI, DOCCI) and GPT-4V outputs.
Models fine-tuned on IIW generate higher-quality descriptions, enabling better image reconstruction with T2I models.
IIW descriptions improve compositional reasoning accuracy on ARO, SVO-Probes, and Winoground, demonstrating their richness and detail. |
The seeded, sequential nature of the framework may introduce biases or inefficiencies depending on initial annotation quality.
Human side-by-side evaluations, while comprehensive, were limited to hundreds of samples due to their cost and complexity. |
image description, vision-language models, dataset, human-in-the-loop, compositional reasoning |
2405.02730
Report |
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers |
Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang |
Diffusion Transformers (DiTs) introduce the transformer architecture to
diffusion tasks for latent-space image generation. With an isotropic
architecture that chains a series of transformer blocks, DiTs demonstrate
competitive performance and good scalability; but meanwhile, the abandonment of
U-Net by DiTs and their following improvements is worth rethinking. To this
end, we conduct a simple toy experiment by comparing a U-Net architectured DiT
with an isotropic one. It turns out that the U-Net architecture only gain a
slight advantage amid the U-Net inductive bias, indicating potential
redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net
backbone features are low-frequency-dominated, we perform token downsampling on
the query-key-value tuple for self-attention and bring further improvements
despite a considerable amount of reduction in computation. Based on
self-attention with downsampled tokens, we propose a series of U-shaped DiTs
(U-DiTs) in the paper and conduct extensive experiments to demonstrate the
extraordinary performance of U-DiT models. The proposed U-DiT could outperform
DiT-XL/2 with only 1/6 of its computation cost. Codes are available at
https://github.com/YuchuanTian/U-DiT. |
The paper proposes U-shaped Diffusion Transformers (U-DiTs) for latent-space image generation by leveraging the U-Net architecture with downsampled self-attention to reduce redundancy and enhance performance. |
Recent Diffusion Transformer (DiT) models utilize isotropic architectures, neglecting the potential benefits of the U-Net structure commonly employed in diffusion models. |
The authors first investigate a naive U-Net-style DiT (DiT-UNet), then introduce token downsampling for self-attention to improve efficiency. They scale up this approach, incorporating cosine similarity attention, RoPE2D, depthwise convolution in FFN, and re-parametrization. |
U-DiTs significantly outperform isotropic DiTs, achieving comparable or superior performance with reduced computational costs.
U-DiT-B surpasses DiT-XL/2 in FID score with only 1/6th of its computation cost.
U-DiTs demonstrate consistent performance improvements with extended training steps (up to 1 million). |
Further exploration of larger U-DiT models and extended training iterations is limited by computational resources.
Future work may involve exploring the application of U-DiTs in other generative tasks beyond image synthesis. |
diffusion models, vision transformers, u-net, image generation, latent space |
2405.02700
Report |
Towards a Scalable Identification of Novel Modes in Generative Models |
Jingwei Zhang, Mohammad Jalali, Cheuk Ting Li, Farzan Farnia |
An interpretable comparison of generative models requires the identification
of sample types produced more frequently by each of the involved models. While
several quantitative scores have been proposed in the literature to rank
different generative models, such score-based evaluations do not reveal the
nuanced differences between the generative models in capturing various sample
types. In this work, we propose a method called Fourier-based Identification of
Novel Clusters (FINC) to identify modes produced by a generative model with a
higher frequency in comparison to a reference distribution. FINC provides a
scalable stochastic algorithm based on random Fourier features to estimate the
eigenspace of kernel covariance matrices of two generative models and utilize
the principal eigendirections to detect the sample types present more
dominantly in each model. We demonstrate the application of the FINC method to
standard computer vision datasets and generative model frameworks. Our
numerical results suggest the scalability and efficiency of the developed
Fourier-based method in highlighting the sample types captured with different
frequencies by widely-used generative models. |
The paper proposes FINC, a scalable algorithm for identifying and clustering novel sample types generated by a generative model with a higher frequency compared to a reference distribution. |
This addresses the limitations of score-based generative model evaluations, which fail to provide nuanced comparisons of how differently models capture various sample types. |
FINC uses random Fourier features to approximate the eigenspace of kernel covariance matrices of two generative models. It leverages principal eigendirections to detect dominant sample types in each model. |
FINC effectively identifies novel modes between real datasets (e.g., AFHQ vs. ImageNet-dogs) and between generative models (e.g., LDM vs. others).
Theoretical analysis shows FINC's scalability, requiring a logarithmic number of Fourier features relative to the data size.
Empirical evaluation demonstrates FINC's efficiency and accuracy on large-scale image datasets like ImageNet. |
The paper focuses on Gaussian kernels, exploring other kernels is left for future work.
Extending the framework to compare more than two generative models simultaneously is an interesting research direction. |
generative models, differential clustering, random fourier features, novelty detection, scalability |
2405.02696
Report |
DiffuseTrace: A Transparent and Flexible Watermarking Scheme for Latent Diffusion Model |
Liangqi Lei, Keke Gai, Jing Yu, Liehuang Zhu |
Latent Diffusion Models (LDMs) enable a wide range of applications but raise
ethical concerns regarding illegal utilization.Adding watermarks to generative
model outputs is a vital technique employed for copyright tracking and
mitigating potential risks associated with AI-generated content. However,
post-hoc watermarking techniques are susceptible to evasion. Existing
watermarking methods for LDMs can only embed fixed messages. Watermark message
alteration requires model retraining. The stability of the watermark is
influenced by model updates and iterations. Furthermore, the current
reconstruction-based watermark removal techniques utilizing variational
autoencoders (VAE) and diffusion models have the capability to remove a
significant portion of watermarks. Therefore, we propose a novel technique
called DiffuseTrace. The goal is to embed invisible watermarks in all generated
images for future detection semantically. The method establishes a unified
representation of the initial latent variables and the watermark information
through training an encoder-decoder model. The watermark information is
embedded into the initial latent variables through the encoder and integrated
into the sampling process. The watermark information is extracted by reversing
the diffusion process and utilizing the decoder. DiffuseTrace does not rely on
fine-tuning of the diffusion model components. The watermark is embedded into
the image space semantically without compromising image quality. The
encoder-decoder can be utilized as a plug-in in arbitrary diffusion models. We
validate through experiments the effectiveness and flexibility of DiffuseTrace.
DiffuseTrace holds an unprecedented advantage in combating the latest attacks
based on variational autoencoders and Diffusion Models. |
DiffuseTrace is a plug-in multi-bit watermarking module for latent diffusion models that protects copyright and enables semantic tracing of generated images. |
The illicit use of text-to-image models necessitates robust watermarking techniques for copyright protection, user tracing, and mitigating harmful content. |
DiffuseTrace embeds watermarks into the initial latent variables of the model, subtly influencing the sampling phase without post-processing. It utilizes an encoder-decoder model for watermark embedding and extraction, ensuring semantic consistency and image quality. |
DiffuseTrace exhibits superior robustness against various image processing techniques and state-of-the-art watermark removal attacks, including VAE and diffusion-based methods.
It maintains high image quality and semantic consistency, as evidenced by NIQE, PIQE, and CLIP score metrics.
DiffuseTrace is flexible, allowing for watermark message modification without retraining or fine-tuning the model, and generalizable across different diffusion model versions. |
The paper acknowledges the potential for bit errors at the edges of watermark regions due to diffusion inversion and image processing.
Future work may explore further optimization of error correction techniques to address this limitation. |
latent diffusion model, model watermarking, copyright protection, image generation, deep learning |
2405.02608
Report |
UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model |
Shuai Yuan, Lei Luo, Zhuo Hui, Can Pu, Xiaoyu Xiang, Rakesh Ranjan, Denis Demandolx |
Traditional unsupervised optical flow methods are vulnerable to occlusions
and motion boundaries due to lack of object-level information. Therefore, we
propose UnSAMFlow, an unsupervised flow network that also leverages object
information from the latest foundation model Segment Anything Model (SAM). We
first include a self-supervised semantic augmentation module tailored to SAM
masks. We also analyze the poor gradient landscapes of traditional smoothness
losses and propose a new smoothness definition based on homography instead. A
simple yet effective mask feature module has also been added to further
aggregate features on the object level. With all these adaptations, our method
produces clear optical flow estimation with sharp boundaries around objects,
which outperforms state-of-the-art methods on both KITTI and Sintel datasets.
Our method also generalizes well across domains and runs very efficiently. |
This paper introduces UnSAMFlow, a novel unsupervised optical flow network that leverages object information from the Segment Anything Model (SAM) to enhance flow estimation accuracy, particularly around object boundaries and occlusion regions. |
Traditional unsupervised optical flow methods struggle with occlusions and motion boundaries due to their reliance on low-level information and lack of object-level understanding. This paper addresses this by integrating SAM, a powerful image segmentation model, into the flow estimation process. |
The paper proposes three key adaptations: (1) a self-supervised semantic augmentation module utilizing SAM masks, (2) a regional smoothness loss based on homography to enforce smooth motion within SAM segments, (3) a mask feature module to aggregate features from the same SAM mask for robustness. |
UnSAMFlow outperforms state-of-the-art unsupervised methods on both KITTI and Sintel benchmarks, achieving 7.83% test error on KITTI-2015.
The method produces clear optical flow estimations with sharp boundaries around objects, demonstrating the effectiveness of incorporating object-level information.
UnSAMFlow generalizes well across different datasets and runs efficiently. |
The performance of UnSAMFlow is dependent on the accuracy of SAM masks, which can be affected by factors like lighting conditions and motion blur.
The lack of semantic class information in SAM outputs presents a limitation, suggesting an area for future improvement. |
unsupervised optical flow, segment anything model (sam), semantic augmentation, homography smoothness loss, mask feature module |
2405.02568
Report |
ActiveNeuS: Active 3D Reconstruction using Neural Implicit Surface Uncertainty |
Hyunseo Kim, Hyeonseo Yang, Taekyung Kim, YoonSung Kim, Jin-Hwa Kim, Byoung-Tak Zhang |
Active learning in 3D scene reconstruction has been widely studied, as
selecting informative training views is critical for the reconstruction.
Recently, Neural Radiance Fields (NeRF) variants have shown performance
increases in active 3D reconstruction using image rendering or geometric
uncertainty. However, the simultaneous consideration of both uncertainties in
selecting informative views remains unexplored, while utilizing different types
of uncertainty can reduce the bias that arises in the early training stage with
sparse inputs. In this paper, we propose ActiveNeuS, which evaluates candidate
views considering both uncertainties. ActiveNeuS provides a way to accumulate
image rendering uncertainty while avoiding the bias that the estimated
densities can introduce. ActiveNeuS computes the neural implicit surface
uncertainty, providing the color uncertainty along with the surface
information. It efficiently handles the bias by using the surface information
and a grid, enabling the fast selection of diverse viewpoints. Our method
outperforms previous works on popular datasets, Blender and DTU, showing that
the views selected by ActiveNeuS significantly improve performance. |
Proposes ActiveNeuS, an active 3D reconstruction framework that improves next-best view selection by considering both geometric and image rendering uncertainty using a novel acquisition function. |
Existing methods for active 3D reconstruction using neural implicit representations typically consider only one type of uncertainty (color or density), leading to biased uncertainty integration and suboptimal view selection. |
Introduces 'neural implicit surface uncertainty' to measure color prediction confidence. Leverages a surface grid and uncertainty grid to efficiently integrate color entropy, prioritizing views with high uncertainty in regions of incomplete reconstruction. |
Outperforms previous methods in image rendering and mesh reconstruction on Blender and DTU datasets.
Selects more diverse viewpoints, leading to better coverage of the scene.
Significantly faster next-best view selection compared to ActiveNeRF. |
Does not address combining uncertainties from different networks (e.g., NeRF and NeuS) for scenes with backgrounds.
Future work includes applying ActiveNeuS to robotic active 3D reconstruction. |
active learning, 3d reconstruction, neural implicit surface, uncertainty estimation, next-best view |
2405.02386
Report |
Rip-NeRF: Anti-aliasing Radiance Fields with Ripmap-Encoded Platonic Solids |
Junchen Liu, Wenbo Hu, Zhuo Yang, Jianteng Chen, Guoliang Wang, Xiaoxue Chen, Yantong Cai, Huan-ang Gao, Hao Zhao |
Despite significant advancements in Neural Radiance Fields (NeRFs), the
renderings may still suffer from aliasing and blurring artifacts, since it
remains a fundamental challenge to effectively and efficiently characterize
anisotropic areas induced by the cone-casting procedure. This paper introduces
a Ripmap-Encoded Platonic Solid representation to precisely and efficiently
featurize 3D anisotropic areas, achieving high-fidelity anti-aliasing
renderings. Central to our approach are two key components: Platonic Solid
Projection and Ripmap encoding. The Platonic Solid Projection factorizes the 3D
space onto the unparalleled faces of a certain Platonic solid, such that the
anisotropic 3D areas can be projected onto planes with distinguishable
characterization. Meanwhile, each face of the Platonic solid is encoded by the
Ripmap encoding, which is constructed by anisotropically pre-filtering a
learnable feature grid, to enable featurzing the projected anisotropic areas
both precisely and efficiently by the anisotropic area-sampling. Extensive
experiments on both well-established synthetic datasets and a newly captured
real-world dataset demonstrate that our Rip-NeRF attains state-of-the-art
rendering quality, particularly excelling in the fine details of repetitive
structures and textures, while maintaining relatively swift training times. |
This paper presents Rip-NeRF, a novel method employing Ripmap-encoded Platonic solids for anti-aliasing in neural radiance fields. |
Existing NeRF methods struggle to effectively characterize anisotropic areas, leading to aliasing and blurring artifacts. This method aims to provide high-fidelity, anti-aliased renderings by accurately representing these areas. |
The method uses Platonic Solid Projection to factorize 3D space onto 2D planes, then employs Ripmap Encoding, an anisotropic area-sampling technique, to accurately featurize projected anisotropic areas on these planes. |
Rip-NeRF achieves state-of-the-art rendering quality on both synthetic and real-world datasets.
It excels in rendering fine details in challenging areas, like specular highlights and repetitive structures.
The method offers a flexible trade-off between quality and efficiency through the choice of Platonic solids. |
The representation faces challenges for unbounded scenes due to self-occlusion and space warping.
Future work could explore more advanced 3D to 2D mapping functions to address limitations in unbounded scenes. |
neural radiance fields, anti-aliasing, anisotropic area-sampling, platonic solid projection, ripmap encoding |
2405.02280
Report |
DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos |
Wen-Hsuan Chu, Lei Ke, Katerina Fragkiadaki |
View-predictive generative models provide strong priors for lifting
object-centric images and videos into 3D and 4D through rendering and score
distillation objectives. A question then remains: what about lifting complete
multi-object dynamic scenes? There are two challenges in this direction: First,
rendering error gradients are often insufficient to recover fast object motion,
and second, view predictive generative models work much better for objects than
whole scenes, so, score distillation objectives cannot currently be applied at
the scene level directly. We present DreamScene4D, the first approach to
generate 3D dynamic scenes of multiple objects from monocular videos via
360-degree novel view synthesis. Our key insight is a "decompose-recompose"
approach that factorizes the video scene into the background and object tracks,
while also factorizing object motion into 3 components: object-centric
deformation, object-to-world-frame transformation, and camera motion. Such
decomposition permits rendering error gradients and object view-predictive
models to recover object 3D completions and deformations while bounding box
tracks guide the large object movements in the scene. We show extensive results
on challenging DAVIS, Kubric, and self-captured videos with quantitative
comparisons and a user preference study. Besides 4D scene generation,
DreamScene4D obtains accurate 2D persistent point track by projecting the
inferred 3D trajectories to 2D. We will release our code and hope our work will
stimulate more research on fine-grained 4D understanding from videos. |
DreamScene4D is the first video-to-4D scene generation approach to produce realistic 4D scene representations from complex multi-object videos with large object motion. |
Existing video-to-4D methods struggle with multi-object scenes exhibiting fast motion, limiting their real-world applicability for tasks like robot perception and augmented reality. |
DreamScene4D employs a "decompose-recompose" strategy: 1) decompose the video into background and object tracks, 2) lift objects to 3D using Gaussian Splatting and Score Distillation Sampling, 3) factorize and optimize object motion (object-centric, object-to-world, camera), 4) recompose the scene using monocular depth guidance. |
Significantly outperforms state-of-the-art methods in generating 4D scenes from challenging DAVIS and Kubric videos, as well as self-captured videos with fast motion.
Shows superior performance in user preference studies, highlighting its ability to generate more realistic and consistent 4D representations.
Achieves accurate 2D persistent point tracking by projecting inferred 3D trajectories, even surpassing methods specifically trained for point tracking. |
SDS prior struggles with videos captured from steep elevation angles.
Scene composition can be suboptimal if rendered and estimated depths misalign.
Future work includes exploring data-driven approaches for video-to-4D generation to overcome limitations. |
video-to-4d, 4d scene generation, novel view synthesis, gaussian splatting, score distillation sampling |
2405.02246
Report |
What matters when building vision-language models? |
Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh |
The growing interest in vision-language models (VLMs) has been driven by
improvements in large language models and vision transformers. Despite the
abundance of literature on this subject, we observe that critical decisions
regarding the design of VLMs are often not justified. We argue that these
unsupported decisions impede progress in the field by making it difficult to
identify which choices improve model performance. To address this issue, we
conduct extensive experiments around pre-trained models, architecture choice,
data, and training methods. Our consolidation of findings includes the
development of Idefics2, an efficient foundational VLM of 8 billion parameters.
Idefics2 achieves state-of-the-art performance within its size category across
various multimodal benchmarks, and is often on par with models four times its
size. We release the model (base, instructed, and chat) along with the datasets
created for its training. |
This paper investigates critical design choices in Vision-Language Models (VLMs) through extensive experiments and introduces Idefics2, an efficient 8B parameter VLM achieving state-of-the-art performance in its size category. |
Unsupported design decisions in VLMs hinder progress by obscuring performance drivers. This work aims to address this by systematically comparing design choices and their impact on performance, efficiency, and training stability. |
The authors conduct ablations on various VLM components including pre-trained models, architectures, and data. They analyze the impact of these choices on performance across benchmarks like VQAv2, TextVQA, OKVQA, and COCO. |
The quality of the language model backbone has a greater impact than the vision backbone on the final VLM performance.
While cross-attention architectures excel with frozen backbones, fully autoregressive architectures outperform them when backbones are trained (using techniques like LoRA for stability).
Reducing visual tokens with learned pooling and adapting pre-trained vision encoders to preserve aspect ratio/resolution improve efficiency without sacrificing performance. |
The lack of a large, well-trained, open-source vision encoder is identified as a limitation in the field.
Future work includes investigating more nuanced evaluation metrics for open-ended visual question answering tasks to better reflect model capabilities. |
vision-language models, multimodal learning, model efficiency, benchmarking, open-source models |
2405.02066
Report |
WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights |
Youngdong Jang, Dong In Lee, MinHyuk Jang, Jong Wook Kim, Feng Yang, Sangpil Kim |
The advances in the Neural Radiance Fields (NeRF) research offer extensive
applications in diverse domains, but protecting their copyrights has not yet
been researched in depth. Recently, NeRF watermarking has been considered one
of the pivotal solutions for safely deploying NeRF-based 3D representations.
However, existing methods are designed to apply only to implicit or explicit
NeRF representations. In this work, we introduce an innovative watermarking
method that can be employed in both representations of NeRF. This is achieved
by fine-tuning NeRF to embed binary messages in the rendering process. In
detail, we propose utilizing the discrete wavelet transform in the NeRF space
for watermarking. Furthermore, we adopt a deferred back-propagation technique
and introduce a combination with the patch-wise loss to improve rendering
quality and bit accuracy with minimum trade-offs. We evaluate our method in
three different aspects: capacity, invisibility, and robustness of the embedded
watermarks in the 2D-rendered images. Our method achieves state-of-the-art
performance with faster training speed over the compared state-of-the-art
methods. |
This paper introduces a novel watermarking method applicable to both implicit and explicit Neural Radiance Fields (NeRF) representations. |
Protecting the copyright of NeRF-based 3D representations is crucial with their increasing use in various applications, and existing methods are limited to a single representation type. |
The method fine-tunes a pre-trained NeRF model to embed binary messages in the rendering process, utilizing discrete wavelet transform in the NeRF space and a deferred back-propagation technique with patch-wise loss. |
The method achieves state-of-the-art performance in bit accuracy, exceeding previous methods, especially for longer message lengths.
It maintains a good balance between watermark invisibility and reconstruction quality, evidenced by high PSNR, SSIM, and low LPIPS scores.
The method exhibits robustness against various image distortions, including cropping, brightness changes, and JPEG compression. |
Training the watermark decoder is time-consuming.
The current implementation only allows for a single message per model, requiring retraining for each new message. |
neural radiance fields, nerf, watermarking, copyright protection, discrete wavelet transform |
2405.02005
Report |
HoloGS: Instant Depth-based 3D Gaussian Splatting with Microsoft HoloLens 2 |
Miriam Jäger, Theodor Kapler, Michael Feßenbecker, Felix Birkelbach, Markus Hillemann, Boris Jutzi |
In the fields of photogrammetry, computer vision and computer graphics, the
task of neural 3D scene reconstruction has led to the exploration of various
techniques. Among these, 3D Gaussian Splatting stands out for its explicit
representation of scenes using 3D Gaussians, making it appealing for tasks like
3D point cloud extraction and surface reconstruction. Motivated by its
potential, we address the domain of 3D scene reconstruction, aiming to leverage
the capabilities of the Microsoft HoloLens 2 for instant 3D Gaussian Splatting.
We present HoloGS, a novel workflow utilizing HoloLens sensor data, which
bypasses the need for pre-processing steps like Structure from Motion by
instantly accessing the required input data i.e. the images, camera poses and
the point cloud from depth sensing. We provide comprehensive investigations,
including the training process and the rendering quality, assessed through the
Peak Signal-to-Noise Ratio, and the geometric 3D accuracy of the densified
point cloud from Gaussian centers, measured by Chamfer Distance. We evaluate
our approach on two self-captured scenes: An outdoor scene of a cultural
heritage statue and an indoor scene of a fine-structured plant. Our results
show that the HoloLens data, including RGB images, corresponding camera poses,
and depth sensing based point clouds to initialize the Gaussians, are suitable
as input for 3D Gaussian Splatting. |
This paper introduces HoloGS, a novel workflow for instant 3D scene reconstruction using 3D Gaussian Splatting with data directly acquired from Microsoft HoloLens 2 sensors, eliminating the need for pre-processing steps like Structure from Motion. |
This work is important because it leverages the real-time capabilities of HoloLens 2 for 3D Gaussian Splatting, potentially enabling instant 3D scene reconstruction and point cloud extraction without time-consuming pre-processing. |
HoloGS utilizes HoloLens 2 sensor data, including RGB images, camera poses, and depth maps, to initialize and optimize 3D Gaussians for scene representation. The authors evaluate their approach by analyzing rendering quality, PSNR, and geometric accuracy of the densified point cloud extracted from Gaussian centers. |
HoloGS with internal HoloLens data leads to relatively smooth convergence of 3D Gaussian Splatting, enabling the rendering of novel views that reasonably reflect the scene's geometry and appearance.
The quality of results obtained using internal HoloLens data is lower compared to using pre-processed SfM data, indicating potential inaccuracies in HoloLens camera poses.
Densified point cloud extraction from Gaussian centers provides a promising avenue for refilling sparse input point clouds, but requires further post-processing to address limitations like floater artifacts and non-uniform point density on low-textured surfaces. |
The accuracy of HoloGS is limited by the precision of HoloLens camera poses, leading to blurriness and artifacts in rendering and point cloud extraction.
Extracting the densified point cloud solely from Gaussian centers has limitations, such as floater artifacts and non-uniform point density on homogeneous surfaces, necessitating further post-processing and refinement. |
3d gaussian splatting, microsoft hololens 2, depth sensor, point cloud, 3d reconstruction |
2405.01825
Report |
Improving Concept Alignment in Vision-Language Concept Bottleneck Models |
Nithish Muthuchamy Selvaraj, Xiaobao Guo, Bingquan Shen, Adams Wai-Kin Kong, Alex Kot |
Concept Bottleneck Models (CBM) map the input image to a high-level
human-understandable concept space and then make class predictions based on
these concepts. Recent approaches automate the construction of CBM by prompting
Large Language Models (LLM) to generate text concepts and then use Vision
Language Models (VLM) to obtain concept scores to train a CBM. However, it is
desired to build CBMs with concepts defined by human experts instead of LLM
generated concepts to make them more trustworthy. In this work, we take a
closer inspection on the faithfulness of VLM concept scores for such
expert-defined concepts in domains like fine-grain bird species classification
and animal classification. Our investigations reveal that frozen VLMs, like
CLIP, struggle to correctly associate a concept to the corresponding visual
input despite achieving a high classification performance. To address this, we
propose a novel Contrastive Semi-Supervised (CSS) learning method which uses a
few labeled concept examples to improve concept alignment (activate truthful
visual concepts) in CLIP model. Extensive experiments on three benchmark
datasets show that our approach substantially increases the concept accuracy
and classification accuracy, yet requires only a fraction of the
human-annotated concept labels. To further improve the classification
performance, we also introduce a new class-level intervention procedure for
fine-grain classification problems that identifies the confounding classes and
intervenes their concept space to reduce errors. |
This paper investigates the faithfulness of Vision-Language Concept Bottleneck Models (VL-CBM) for expert-defined concepts and proposes a Contrastive Semi-Supervised (CSS) learning approach to improve their concept alignment using a limited number of concept labels. |
While VL-CBMs offer interpretability by leveraging human-understandable concepts, existing methods often exhibit poor concept alignment, hindering their reliability and faithfulness. This work addresses this issue to ensure that VL-CBMs accurately associate visual concepts with the corresponding image regions. |
The authors introduce a CSS learning method that combines contrastive learning in the concept space with semi-supervised learning from a few labeled concept examples. This encourages consistent concept scores within classes and discriminates between classes while aligning predictions with ground truth. |
CSS substantially increases concept accuracy on CUB (+39.1%), RIVAL (+18.63%), and AwA2 (+31.11%) datasets using only a small percentage of human-annotated concept labels.
CSS enhances classification accuracy, surpassing black-box models on CUB and approaching their performance on other datasets.
A proposed class-level intervention procedure effectively reduces errors for confounding classes in fine-grain classification, further improving overall performance. |
VL-CBMs may struggle with ineffable concepts that are difficult to express in language.
The assumption that all salient concepts are known beforehand may not hold for all tasks, limiting their applicability in such cases. |
concept bottleneck model, interpretability, semi-supervised learning, vision-language models, concept alignment |
2405.01536
Report |
Customizing Text-to-Image Models with a Single Image Pair |
Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu |
Art reinterpretation is the practice of creating a variation of a reference
work, making a paired artwork that exhibits a distinct artistic style. We ask
if such an image pair can be used to customize a generative model to capture
the demonstrated stylistic difference. We propose Pair Customization, a new
customization method that learns stylistic difference from a single image pair
and then applies the acquired style to the generation process. Unlike existing
methods that learn to mimic a single concept from a collection of images, our
method captures the stylistic difference between paired images. This allows us
to apply a stylistic change without overfitting to the specific image content
in the examples. To address this new task, we employ a joint optimization
method that explicitly separates the style and content into distinct LoRA
weight spaces. We optimize these style and content weights to reproduce the
style and content images while encouraging their orthogonality. During
inference, we modify the diffusion process via a new style guidance based on
our learned weights. Both qualitative and quantitative experiments show that
our method can effectively learn style while avoiding overfitting to image
content, highlighting the potential of modeling such stylistic differences from
a single image pair. |
This paper presents Paired Customization, a method for customizing text-to-image models using a single image pair to learn stylistic differences. |
Existing model customization methods struggle to disentangle style from content when trained on single images, often leading to overfitting. |
The method employs joint optimization of separate style and content LoRA weights, enforcing orthogonality for better disentanglement. It also introduces style guidance during inference for enhanced stylization control and content preservation. |
Paired Customization successfully learns stylistic differences from a single image pair and applies them to new content while preserving structure.
Quantitative evaluation demonstrates superior performance in style similarity and structure preservation compared to baselines.
Human preference studies confirm user preference for Paired Customization over existing methods. |
The method may struggle to transfer styles to categories significantly different from the training pair.
Reliance on test-time optimization can be computationally demanding, suggesting future exploration of encoder-based approaches for efficiency. |
text-to-image synthesis, model customization, style transfer, diffusion models, low-rank adaptation |
2405.01533
Report |
OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning |
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez |
The advances in multimodal large language models (MLLMs) have led to growing
interests in LLM-based autonomous driving agents to leverage their strong
reasoning capabilities. However, capitalizing on MLLMs' strong reasoning
capabilities for improved planning behavior is challenging since planning
requires full 3D situational awareness beyond 2D reasoning. To address this
challenge, our work proposes a holistic framework for strong alignment between
agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM
architecture that uses sparse queries to lift and compress visual
representations into 3D before feeding them into an LLM. This query-based
representation allows us to jointly encode dynamic objects and static map
elements (e.g., traffic lanes), providing a condensed world model for
perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new
visual question-answering dataset challenging the true 3D situational awareness
of a model with comprehensive visual question-answering (VQA) tasks, including
scene description, traffic regulation, 3D grounding, counterfactual reasoning,
decision making and planning. Extensive studies show the effectiveness of the
proposed architecture as well as the importance of the VQA tasks for reasoning
and planning in complex 3D scenes. |
This paper presents OmniDrive, a holistic framework for autonomous driving that leverages large language models (LLMs) for enhanced 3D perception, reasoning, and planning. |
Existing LLM-based driving systems struggle with 3D spatial understanding and often rely on limited open-loop benchmarks. OmniDrive addresses these limitations, aiming to improve decision-making and planning in complex driving scenarios. |
The authors introduce a novel 3D MLLM architecture (OmniDrive-Agent) that uses sparse queries to process high-resolution multi-view video input, enabling efficient 3D perception. They also develop a comprehensive benchmark (OmniDrive-nuScenes) with visual question-answering tasks for evaluating 3D reasoning, counterfactual reasoning, and planning. |
OmniDrive-Agent exhibits strong 3D reasoning capabilities, surpassing previous methods in counterfactual reasoning and open-loop planning tasks.
The use of sparse queries and a Q-Former-styled design allows for efficient processing of multi-view video data, addressing limitations of prior LLM architectures.
The proposed OmniDrive-nuScenes benchmark offers valuable insights into the capabilities and limitations of LLM-based autonomous driving systems. |
The method's effectiveness needs further validation on larger datasets like nuPlan.
Current counterfactual reasoning simulations don't incorporate reactions from other agents, requiring a more sophisticated closed-loop setup. |
autonomous driving, large language models, 3d perception, counterfactual reasoning, planning |
2405.01496
Report |
LocInv: Localization-aware Inversion for Text-Guided Image Editing |
Chuanming Tang, Kai Wang, Fei Yang, Joost van de Weijer |
Large-scale Text-to-Image (T2I) diffusion models demonstrate significant
generation capabilities based on textual prompts. Based on the T2I diffusion
models, text-guided image editing research aims to empower users to manipulate
generated images by altering the text prompts. However, existing image editing
techniques are prone to editing over unintentional regions that are beyond the
intended target area, primarily due to inaccuracies in cross-attention maps. To
address this problem, we propose Localization-aware Inversion (LocInv), which
exploits segmentation maps or bounding boxes as extra localization priors to
refine the cross-attention maps in the denoising phases of the diffusion
process. Through the dynamic updating of tokens corresponding to noun words in
the textual input, we are compelling the cross-attention maps to closely align
with the correct noun and adjective words in the text prompt. Based on this
technique, we achieve fine-grained image editing over particular objects while
preventing undesired changes to other regions. Our method LocInv, based on the
publicly available Stable Diffusion, is extensively evaluated on a subset of
the COCO dataset, and consistently obtains superior results both quantitatively
and qualitatively.The code will be released at
https://github.com/wangkai930418/DPL |
This paper introduces Localization-aware Inversion (LocInv), a method enhancing text-guided image editing in text-to-image diffusion models by refining cross-attention maps using segmentation maps or bounding boxes as localization priors. |
Existing text-guided image editing methods often struggle with unintended modifications outside the target area due to inaccuracies in cross-attention maps, especially in complex multi-object images. This method addresses this 'cross-attention leakage' issue. |
LocInv utilizes dynamic prompt learning, updating token representations of objects at each denoising step. It optimizes similarity and overlapping losses to align cross-attention maps with provided localization priors. Additionally, it introduces an adjective binding loss to improve attribute editing by strengthening the connection between adjectives and their corresponding nouns. |
LocInv significantly improves the accuracy of cross-attention maps, leading to more precise and controlled image editing.
The method excels in local editing tasks, including Word-Swap (replacing an object) and Attribute-Edit (modifying object attributes), outperforming existing methods in both qualitative and quantitative evaluations.
It effectively preserves the background and maintains semantic similarity between the original and edited objects, particularly in complex multi-object images. |
The method's reliance on the size of cross-attention maps (smaller maps offer better semantic information but limit fine-grained control) poses a limitation.
The frozen Stable Diffusion model’s inherent limitations in editing capabilities and high-frequency detail reconstruction might affect the quality of editing results. |
image editing, text-to-image synthesis, diffusion models, cross-attention, localization priors |
2405.01413
Report |
MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors |
Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Yixue Hao, Long Hu, Min Chen |
Large 2D vision-language models (2D-LLMs) have gained significant attention
by bridging Large Language Models (LLMs) with images using a simple projector.
Inspired by their success, large 3D point cloud-language models (3D-LLMs) also
integrate point clouds into LLMs. However, directly aligning point clouds with
LLM requires expensive training costs, typically in hundreds of GPU-hours on
A100, which hinders the development of 3D-LLMs. In this paper, we introduce
MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA
results while training for only 27 hours on one RTX 3090. Specifically, we
propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which
can leverage the similarity between 2D and 3D visual information. We introduce
a novel four-stage training strategy for modality alignment in a cascaded way,
and a mixture of query experts module to adaptively aggregate features with
high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods
LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which
is up to 260x fewer than existing methods. Extensive experiments show that
MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with
significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12
increase on GPT-4 evaluation score for the challenging object captioning task
compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800.
We are the first to explore the efficient 3D-LLM, offering new insights to the
community. Code and weights are available at
https://github.com/TangYuan96/MiniGPT-3D. |
Presents MiniGPT-3D, an efficient 3D-LLM that leverages 2D priors from 2D-LLMs to align 3D point clouds with LLMs, achieving state-of-the-art results with significantly reduced training costs. |
Training 3D-LLMs is computationally expensive, hindering research and applications. This work introduces an efficient approach using 2D-LLMs as priors to bridge the modality gap between 3D point clouds and LLMs. |
Introduces a four-stage training strategy: (1) Align point cloud encoder with 2D-LLM, (2) Transfer 2D knowledge to 3D, (3) Enhance 3D-language understanding with challenging tasks, (4) Utilize Mixture of Query Experts for adaptive feature aggregation. Employs parameter-efficient fine-tuning methods and an efficient LLM backbone. |
Achieves state-of-the-art performance on generative 3D object classification, outperforming baselines by significant margins on ModelNet40 and Objaverse datasets.
Sets new state-of-the-art in 3D object captioning, demonstrating superior detail comprehension and accuracy compared to existing methods.
Exhibits strong generalization ability, robustness to prompt variations, and comprehensive understanding of 3D objects, enabling detailed captioning and open-ended dialogue. |
Limited to object-level understanding, not applicable to large-scale point clouds.
Focuses on static 3D objects, lacking the ability to recognize actions in dynamic scenarios. |
multimodal large language models, 3d point cloud understanding, efficiently multimedia alignment, mixture of experts, 2d priors |
2405.01356
Report |
Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance |
Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang |
In subject-driven text-to-image synthesis, the synthesis process tends to be
heavily influenced by the reference images provided by users, often overlooking
crucial attributes detailed in the text prompt. In this work, we propose
Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the
problem. We show that through constructing a subject-agnostic condition and
applying our proposed dual classifier-free guidance, one could obtain outputs
consistent with both the given subject and input text prompts. We validate the
efficacy of our approach through both optimization-based and encoder-based
methods. Additionally, we demonstrate its applicability in second-order
customization methods, where an encoder-based model is fine-tuned with
DreamBooth. Our approach is conceptually simple and requires only minimal code
modifications, but leads to substantial quality improvements, as evidenced by
our evaluations and user studies. |
This paper introduces Subject-Agnostic Guidance (SAG), a simple yet effective method to address the "content ignorance" issue in subject-driven text-to-image synthesis, where crucial text prompt attributes are often overlooked due to the strong influence of reference subject images. |
The dominance of subject information in existing methods hinders the generation of diverse and text-aligned outputs, limiting the flexibility and creative potential of subject-driven synthesis. |
SAG constructs a subject-agnostic embedding from user inputs and employs a dual classifier-free guidance (DCFG) strategy. This approach leverages the subject-agnostic embedding, especially in early generation stages, to prioritize text-guided content and structure before incorporating subject details. |
SAG significantly improves text alignment without sacrificing subject fidelity, as demonstrated through qualitative and quantitative comparisons with existing methods like DreamBooth, Textual Inversion, and ELITE.
The effectiveness of SAG is validated across different subject-driven synthesis approaches, including optimization-based, encoder-based, and second-order customization methods.
User studies consistently show a strong preference for SAG-generated outputs, indicating its ability to achieve a desirable balance between content and subject consistency. |
The quality of outputs generated with SAG is inherently limited by the capabilities of the underlying text-to-image generation model, which might struggle with uncommon content.
Future work could explore incorporating a more robust synthesis network to further enhance the quality and diversity of outputs. |
text-to-image synthesis, subject-driven generation, content ignorance, classifier-free guidance, subject-agnostic embedding |
2405.01008
Report |
On Mechanistic Knowledge Localization in Text-to-Image Generative Models |
Samyadeep Basu, Keivan Rezaei, Priyatham Kattakinda, Ryan Rossi, Cherry Zhao, Vlad Morariu, Varun Manjunatha, Soheil Feizi |
Identifying layers within text-to-image models which control visual
attributes can facilitate efficient model editing through closed-form updates.
Recent work, leveraging causal tracing show that early Stable-Diffusion
variants confine knowledge primarily to the first layer of the CLIP
text-encoder, while it diffuses throughout the UNet.Extending this framework,
we observe that for recent models (e.g., SD-XL, DeepFloyd), causal tracing
fails in pinpointing localized knowledge, highlighting challenges in model
editing. To address this issue, we introduce the concept of Mechanistic
Localization in text-to-image models, where knowledge about various visual
attributes (e.g., "style", "objects", "facts") can be mechanistically localized
to a small fraction of layers in the UNet, thus facilitating efficient model
editing. We localize knowledge using our method LocoGen which measures the
direct effect of intermediate layers to output generation by performing
interventions in the cross-attention layers of the UNet. We then employ
LocoEdit, a fast closed-form editing method across popular open-source
text-to-image models (including the latest SD-XL)and explore the possibilities
of neuron-level model editing. Using Mechanistic Localization, our work offers
a better view of successes and failures in localization-based text-to-image
model editing. Code will be available at
https://github.com/samyadeepbasu/LocoGen. |
This paper introduces \crossprompt{}, a method for identifying localized control regions for visual attributes in text-to-image models, and explores efficient model editing using \crossedit{}. |
Existing methods, like causal tracing, are not generalizable to newer text-to-image models, limiting the ability to interpret and edit these models effectively. |
\crossprompt{} identifies controlling layers in the UNet by measuring the effect of altered text embeddings on specific visual attributes. \crossedit{} performs closed-form weight updates in identified layers for model editing. |
\crossprompt{} successfully identifies unique control points for visual attributes across various text-to-image models.
\crossedit{} enables efficient and interpretable model editing by updating specific layers in the UNet.
The paper demonstrates the potential for neuron-level model editing by selectively dropping out neurons in identified layers. |
Closed-form edits using \crossedit{} are not effective for DeepFloyd, likely due to the use of a bi-directional T5 text-encoder.
Neuron-level editing, while promising, requires further investigation to address the trade-off between style removal and image quality. |
text-to-image generation, model interpretability, model editing, knowledge localization, cross-attention |
2405.00998
Report |
Part-aware Shape Generation with Latent 3D Diffusion of Neural Voxel Fields |
Yuhang Huang, SHilong Zou, Xinwang Liu, Kai Xu |
This paper presents a novel latent 3D diffusion model for the generation of
neural voxel fields, aiming to achieve accurate part-aware structures. Compared
to existing methods, there are two key designs to ensure high-quality and
accurate part-aware generation. On one hand, we introduce a latent 3D diffusion
process for neural voxel fields, enabling generation at significantly higher
resolutions that can accurately capture rich textural and geometric details. On
the other hand, a part-aware shape decoder is introduced to integrate the part
codes into the neural voxel fields, guiding the accurate part decomposition and
producing high-quality rendering results. Through extensive experimentation and
comparisons with state-of-the-art methods, we evaluate our approach across four
different classes of data. The results demonstrate the superior generative
capabilities of our proposed method in part-aware shape generation,
outperforming existing state-of-the-art methods. |
This paper introduces a novel latent 3D diffusion model for generating neural voxel fields with accurate part-aware structures. |
Generating part-aware 3D shapes is important for downstream tasks such as editing, mix-and-match modeling, and segmentation learning. Existing methods are often part-oblivious or have limitations in generative ability and rendering quality. |
The method uses a latent 3D diffusion process on a compressed latent space for high-resolution generation and a part-aware shape decoder that integrates part codes into the neural voxel field to guide accurate part decomposition. |
Achieves higher resolution (96^3) than previous diffusion-based methods for neural fields, capturing richer details.
Outperforms state-of-the-art methods in terms of FID metric across four different object classes (Chair, Table, Airplane, Car).
Exhibits superior qualitative results, demonstrating accurate part-aware generation and high-quality rendering. |
Collecting 2D semantic part maps for supervision can be challenging.
Future work includes exploring pseudo-label based part-aware generation to reduce reliance on labeled data. |
shape generation, diffusion model, part-aware generation, neural voxel fields, 3d deep learning |
2405.00954
Report |
X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation |
Yiwei Ma, Zhekai Lin, Jiayi Ji, Yijun Fan, Xiaoshuai Sun, Rongrong Ji |
Recent advancements in automatic 3D avatar generation guided by text have
made significant progress. However, existing methods have limitations such as
oversaturation and low-quality output. To address these challenges, we propose
X-Oscar, a progressive framework for generating high-quality animatable avatars
from text prompts. It follows a sequential Geometry->Texture->Animation
paradigm, simplifying optimization through step-by-step generation. To tackle
oversaturation, we introduce Adaptive Variational Parameter (AVP), representing
avatars as an adaptive distribution during training. Additionally, we present
Avatar-aware Score Distillation Sampling (ASDS), a novel technique that
incorporates avatar-aware noise into rendered images for improved generation
quality during optimization. Extensive evaluations confirm the superiority of
X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous
project page: https://xmu-xiaoma666.github.io/Projects/X-Oscar/. |
This paper presents X-Oscar, a novel progressive framework for generating high-quality, animatable 3D avatars from text prompts. |
Existing methods for text-guided 3D avatar generation often suffer from limitations like oversaturation and low-quality output, hindering their applicability in various domains like gaming and animation. |
X-Oscar leverages the SMPL-X body model and adopts a sequential "Geometry→Texture→Animation" optimization strategy. It introduces two novel modules: (1) Adaptive Perturbation Module (APM) to represent avatars as adaptive distributions, mitigating oversaturation, and (2) Avatar-Aware Denoising Module (AADM) to incorporate geometry and appearance-aware noise for improved quality. |
X-Oscar effectively addresses oversaturation in avatar generation, resulting in visually appealing and realistic outputs.
The progressive modeling paradigm with separate optimization stages for geometry, texture, and animation significantly enhances the quality of generated avatars.
Extensive evaluations, including user studies and comparisons with state-of-the-art methods, demonstrate X-Oscar's superiority in generating high-quality, animatable avatars consistent with text prompts. |
The reliance on the SMPL-X model might limit the diversity of generatable body shapes.
Exploring higher-resolution textures and more complex animation sequences could further enhance avatar realism. |
3d avatar generation, text-guided synthesis, score distillation sampling, oversaturation mitigation, progressive modeling |
2405.00942
Report |
LLaVA Finds Free Lunch: Teaching Human Behavior Improves Content Understanding Abilities Of LLMs |
Somesh Singh, Harini S I, Yaman K Singla, Veeky Baths, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy |
Communication is defined as "Who says what to whom with what effect." A
message from a communicator generates downstream receiver effects, also known
as behavior. Receiver behavior, being a downstream effect of the message,
carries rich signals about it. Even after carrying signals about the message,
the behavior data is often ignored while training large language models. We
show that training LLMs on receiver behavior can actually help improve their
content-understanding abilities. Specifically, we show that training LLMs to
predict the receiver behavior of likes and comments improves the LLM's
performance on a wide variety of downstream content understanding tasks. We
show this performance increase over 40 video and image understanding tasks over
23 benchmark datasets across both 0-shot and fine-tuning settings,
outperforming many supervised baselines. Moreover, since receiver behavior,
such as likes and comments, is collected by default on the internet and does
not need any human annotations to be useful, the performance improvement we get
after training on this data is essentially free-lunch. We release the receiver
behavior cleaned comments and likes of 750k images and videos collected from
multiple platforms along with our instruction-tuning data. |
This paper investigates whether training large language models (LLMs) on receiver behavior (e.g., likes, comments) can enhance their content understanding abilities. |
Behavior data, though often discarded, implicitly carries rich signals about the content it interacts with. Leveraging this readily available resource could lead to significant improvements in content understanding tasks across various domains. |
The authors collected a large-scale dataset (BLIFT) of images and videos from Reddit and YouTube, along with their corresponding comments and likes. They then fine-tuned LLaMA-Vid, a large vision and language model, on BLIFT to predict user behavior given the content. Ablation studies were conducted to compare the impact of different behavioral data types (perception vs. action) and sources. |
Training on receiver behavior (Behavior-LLaVA) consistently outperformed the base LLaMA-Vid and a data-augmented variant (Ad-LLaVA) across 40 tasks and 23 benchmark datasets.
The improvements were particularly pronounced for high-level understanding tasks like emotion recognition and persuasion strategy classification.
Action-level behavior (comments, likes) proved more effective than perception-level behavior (saliency) for enhancing content understanding, likely due to its availability at scale. |
The study primarily focused on comments and likes as behavioral signals, limiting the exploration of other rich action-level data.
Future work could delve deeper into the relationship between specific behavioral patterns and different aspects of content understanding. |
large language models, content understanding, behavior modeling, digital analytics, vision and language |
2405.00915
Report |
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion |
Guangyao Zhai, Evin Pınar Örnek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, Benjamin Busam |
We present EchoScene, an interactive and controllable generative model that
generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch
diffusion model that dynamically adapts to scene graphs. Existing methods
struggle to handle scene graphs due to varying numbers of nodes, multiple edge
combinations, and manipulator-induced node-edge operations. EchoScene overcomes
this by associating each node with a denoising process and enables
collaborative information exchange, enhancing controllable and consistent
generation aware of global constraints. This is achieved through an information
echo scheme in both shape and layout branches. At every denoising step, all
processes share their denoising data with an information exchange unit that
combines these updates using graph convolution. The scheme ensures that the
denoising processes are influenced by a holistic understanding of the scene
graph, facilitating the generation of globally coherent scenes. The resulting
scenes can be manipulated during inference by editing the input scene graph and
sampling the noise in the diffusion model. Extensive experiments validate our
approach, which maintains scene controllability and surpasses previous methods
in generation fidelity. Moreover, the generated scenes are of high quality and
thus directly compatible with off-the-shelf texture generation. Code and
trained models are open-sourced. |
EchoScene, an interactive and controllable generative model for synthesizing 3D indoor scenes from scene graphs using a dual-branch diffusion model. |
Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. |
EchoScene employs a dual-branch diffusion model with an information echo scheme. It associates each node with a denoising process in both shape and layout branches, enabling collaborative information exchange through an information exchange unit using graph convolution. |
EchoScene outperforms previous methods in generation fidelity, achieving lower FID, FID CLIP, and KID scores.
It demonstrates superior robustness in handling graph manipulation, accurately reflecting changes in node addition and relation adjustments.
The method effectively maintains inter-object consistency, generating shapes and layouts that adhere to global scene graph constraints. |
The model's reliance on a limited dataset may restrict the diversity of generated scenes.
Exploration of alternative information exchange mechanisms within the echo scheme could further enhance generation quality. |
scene graph, diffusion model, 3d scene generation, controllable generation, information exchange |
2405.00900
Report |
LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes |
Shanlin Sun, Bingbing Zhuang, Ziyu Jiang, Buyu Liu, Xiaohui Xie, Manmohan Chandraker |
Photorealistic simulation plays a crucial role in applications such as
autonomous driving, where advances in neural radiance fields (NeRFs) may allow
better scalability through the automatic creation of digital 3D assets.
However, reconstruction quality suffers on street scenes due to largely
collinear camera motions and sparser samplings at higher speeds. On the other
hand, the application often demands rendering from camera views that deviate
from the inputs to accurately simulate behaviors like lane changes. In this
paper, we propose several insights that allow a better utilization of Lidar
data to improve NeRF quality on street scenes. First, our framework learns a
geometric scene representation from Lidar, which is fused with the implicit
grid-based representation for radiance decoding, thereby supplying stronger
geometric information offered by explicit point cloud. Second, we put forth a
robust occlusion-aware depth supervision scheme, which allows utilizing
densified Lidar points by accumulation. Third, we generate augmented training
views from Lidar points for further improvement. Our insights translate to
largely improved novel view synthesis under real driving scenes. |
This paper presents a novel framework leveraging Lidar data to enhance the quality of Neural Radiance Fields (NeRFs) for street scenes, particularly addressing challenges posed by sparse and collinear camera trajectories in autonomous driving scenarios. |
Photorealistic simulation for applications like autonomous driving requires high-quality NeRFs, but street scenes with limited camera viewpoints and low-texture environments pose significant difficulties. Existing methods struggle to produce satisfactory results, necessitating improved techniques. |
The proposed framework fuses Lidar-derived geometric features with the implicit grid-based representation of NeRFs. It introduces a robust occlusion-aware depth supervision scheme using densified Lidar points and generates augmented training views from Lidar projections to address view sparsity. |
The method achieves state-of-the-art performance on the Pandaset benchmark, outperforming existing NeRF techniques in terms of visual fidelity and accuracy.
The robust depth supervision scheme effectively utilizes dense Lidar data while mitigating errors caused by occlusions, leading to improved geometry reconstruction.
Lidar encoding and augmented view supervision further enhance the rendering of fine details and improve performance in extrapolation scenarios, particularly for regions sparsely captured in the original data. |
The current framework focuses on static backgrounds and does not handle dynamic objects.
Future work could explore extending the insights of Lidar integration to model dynamic elements in street scenes. |
neural radiance fields, lidar, autonomous driving, novel view synthesis, depth supervision |
2405.00794
Report |
Coherent 3D Portrait Video Reconstruction via Triplane Fusion |
Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Josef Spjut, Henry Fuchs, Shalini De Mello, Koki Nagano |
Recent breakthroughs in single-image 3D portrait reconstruction have enabled
telepresence systems to stream 3D portrait videos from a single camera in
real-time, potentially democratizing telepresence. However, per-frame 3D
reconstruction exhibits temporal inconsistency and forgets the user's
appearance. On the other hand, self-reenactment methods can render coherent 3D
portraits by driving a personalized 3D prior, but fail to faithfully
reconstruct the user's per-frame appearance (e.g., facial expressions and
lighting). In this work, we recognize the need to maintain both coherent
identity and dynamic per-frame appearance to enable the best possible realism.
To this end, we propose a new fusion-based method that fuses a personalized 3D
subject prior with per-frame information, producing temporally stable 3D videos
with faithful reconstruction of the user's per-frame appearances. Trained only
using synthetic data produced by an expression-conditioned 3D GAN, our
encoder-based method achieves both state-of-the-art 3D reconstruction accuracy
and temporal consistency on in-studio and in-the-wild datasets. |
This paper presents a novel triplane fusion method for reconstructing coherent and high-fidelity 3D portrait videos from monocular RGB videos, aiming to improve the realism of 3D telepresence systems. |
Existing single-image 3D reconstruction methods suffer from temporal inconsistency and struggle to maintain stable identity across frames. On the other hand, 3D self-reenactment methods, while temporally consistent, fail to faithfully reconstruct the dynamic appearance of users in real-time, such as expressions and lighting. |
The proposed method leverages a pre-trained LP3D model to construct a personal triplane prior from a frontal reference image. For each input frame, a raw triplane is extracted using LP3D and then fused with the prior. This fusion process involves a Triplane Undistorter to remove view-dependent distortions and a Triplane Fuser to combine the undistorted triplane with the prior while preserving dynamic appearances. |
The method successfully captures authentic dynamic appearances (e.g., facial expressions, lighting) while producing temporally consistent 3D videos.
Trained solely on synthetic data generated from an expression-conditioned 3D GAN, the approach achieves state-of-the-art 3D reconstruction accuracy and temporal consistency on both in-studio and in-the-wild datasets.
A new multi-view evaluation protocol is introduced to assess a method's robustness to input viewpoint variations and consistency across generated novel views. |
Fusing side views with significantly different expressions compared to the reference view can result in blurry reconstructions due to triplane alignment ambiguity.
The current implementation relies on a single reference image; incorporating multiple reference images with varying expressions and head poses could further enhance performance. |
3d portrait video reconstruction, neural rendering, triplane representation, temporal consistency, single-view reconstruction |
2405.00791
Report |
Obtaining Favorable Layouts for Multiple Object Generation |
Barak Battash, Amit Rozner, Lior Wolf, Ofir Lindenbaum |
Large-scale text-to-image models that can generate high-quality and diverse
images based on textual prompts have shown remarkable success. These models aim
ultimately to create complex scenes, and addressing the challenge of
multi-subject generation is a critical step towards this goal. However, the
existing state-of-the-art diffusion models face difficulty when generating
images that involve multiple subjects. When presented with a prompt containing
more than one subject, these models may omit some subjects or merge them
together. To address this challenge, we propose a novel approach based on a
guiding principle. We allow the diffusion model to initially propose a layout,
and then we rearrange the layout grid. This is achieved by enforcing
cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels
from latent maps to new locations determined by us. We introduce new loss terms
aimed at reducing XAM entropy for clearer spatial definition of subjects,
reduce the overlap between XAMs, and ensure that XAMs align with their
respective masks. We contrast our approach with several alternative methods and
show that it more faithfully captures the desired concepts across a variety of
text prompts. |
This paper proposes a novel approach to address the difficulty of existing diffusion models in generating images with multiple distinct subjects, focusing on preventing subject omission and merging in complex scene generation. |
Generating images with multiple subjects is a critical challenge for text-to-image models as it's essential for creating complex and realistic scenes based on user prompts. |
The proposed method uses a three-phase approach: 1) **Excite and distinguish:** Encourages distinct spatial representation for each subject's token in early diffusion steps. 2) **Rearrange the generation grid:** Extracts and optimizes subject masks to minimize overlap and rearranges the latent space accordingly. 3) **Follow the masks:** Guides subsequent diffusion steps to adhere to the optimized subject masks. |
The method significantly outperforms existing state-of-the-art models in generating images with multiple subjects, showing reduced subject omission and blending.
Quantitative evaluations using Llava1.5, Qwen-VL-Chat, and BLIP2 demonstrate substantial improvements across various metrics, especially with increasing subject numbers.
The approach effectively combines with attribute binding techniques, further enhancing the overall quality and correctness of generated images. |
The method increases inference time due to the multi-step optimization process.
Forcing a specific layout can sometimes result in unnatural object arrangements or slightly reduced image quality, necessitating further improvements in mask generation and optimization strategies. |
text-to-image synthesis, diffusion models, multi-subject generation, cross-attention maps, layout optimization |
2405.00760
Report |
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models |
Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, Hongsheng Li |
Optimizing a text-to-image diffusion model with a given reward function is an
important but underexplored research area. In this study, we propose Deep
Reward Tuning (DRTune), an algorithm that directly supervises the final output
image of a text-to-image diffusion model and back-propagates through the
iterative sampling process to the input noise. We find that training earlier
steps in the sampling process is crucial for low-level rewards, and deep
supervision can be achieved efficiently and effectively by stopping the
gradient of the denoising network input. DRTune is extensively evaluated on
various reward models. It consistently outperforms other algorithms,
particularly for low-level control signals, where all shallow supervision
methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0)
model via DRTune to optimize Human Preference Score v2.1, resulting in the
Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances
image quality compared to SDXL 1.0 and reaches comparable quality compared with
Midjourney v5.2. |
The paper presents Deep Reward Tuning (DRTune), an algorithm for efficiently and effectively optimizing text-to-image diffusion models using differentiable rewards, particularly focusing on deep supervision for low-level rewards like symmetry. |
Optimizing diffusion models with rewards is crucial for controlling image generation beyond traditional training datasets, but existing methods struggle with deep supervision of the iterative sampling process. |
DRTune employs two key strategies: 1) stopping gradients of the denoising network input to alleviate gradient explosion and accelerate convergence, and 2) training on a subset of equally spaced sampling steps to improve efficiency. |
DRTune consistently outperforms baselines like ReFL, DRaFT, and AlignProp on various rewards, including aesthetic score, CLIPScore, and human preference.
It successfully optimizes low-level rewards like symmetry, which other methods fail to achieve due to limitations in deep supervision.
Fine-tuning Stable Diffusion XL 1.0 with DRTune and HPS v2.1 results in Favorable Diffusion XL 1.0 (FDXL 1.0), exhibiting superior visual quality compared to the base model and comparable quality to Midjourney v5.2. |
Reward hacking is a potential issue, necessitating strategies like regularization to prevent image quality degradation while optimizing for specific metrics.
The paper acknowledges the potential negative social impact of advanced generative models, including the risk of generating misleading content and perpetuating biases. |
diffusion models, text-to-image generation, reward learning, deep supervision, stable diffusion |
2405.00676
Report |
Spectrally Pruned Gaussian Fields with Neural Compensation |
Runyi Yang, Zhenxin Zhu, Zhou Jiang, Baijun Ye, Xiaoxue Chen, Yifei Zhang, Yuantao Chen, Jian Zhao, Hao Zhao |
Recently, 3D Gaussian Splatting, as a novel 3D representation, has garnered
attention for its fast rendering speed and high rendering quality. However,
this comes with high memory consumption, e.g., a well-trained Gaussian field
may utilize three million Gaussian primitives and over 700 MB of memory. We
credit this high memory footprint to the lack of consideration for the
relationship between primitives. In this paper, we propose a memory-efficient
Gaussian field named SUNDAE with spectral pruning and neural compensation. On
one hand, we construct a graph on the set of Gaussian primitives to model their
relationship and design a spectral down-sampling module to prune out primitives
while preserving desired signals. On the other hand, to compensate for the
quality loss of pruning Gaussians, we exploit a lightweight neural network head
to mix splatted features, which effectively compensates for quality losses
while capturing the relationship between primitives in its weights. We
demonstrate the performance of SUNDAE with extensive results. For example,
SUNDAE can achieve 26.80 PSNR at 145 FPS using 104 MB memory while the vanilla
Gaussian splatting algorithm achieves 25.60 PSNR at 160 FPS using 523 MB
memory, on the Mip-NeRF360 dataset. Codes are publicly available at
https://runyiyang.github.io/projects/SUNDAE/. |
This paper introduces SUNDAE, a memory-efficient 3D Gaussian Splatting method that leverages spectral pruning on a primitive graph and a neural compensation head to reduce storage requirements while maintaining rendering speed and quality. |
3D Gaussian Splatting (3DGS) suffers from high memory consumption due to the independence of its primitives. SUNDAE addresses this by modeling the relationship between primitives, enabling significant storage reduction. |
The method constructs a graph based on Gaussian primitives and uses spectral graph pruning to remove redundant ones. A neural compensation head then mitigates the quality loss by integrating information from remaining primitives in the 2D feature domain. |
SUNDAE achieves competitive rendering quality with significantly lower memory footprint compared to 3DGS and other state-of-the-art methods.
Spectral pruning effectively retains essential scene information by balancing high-frequency details and low-frequency background.
The neural compensation module successfully mitigates the quality loss caused by pruning, demonstrating the benefits of modeling primitive relationships. |
Continuous pruning, explored as an alternative, shows potential for lower peak memory but less control over final memory footprint.
Future work could explore more sophisticated graph construction methods and alternative neural compensation architectures. |
3d gaussian splatting, graph signal processing, neural rendering, memory efficient, primitive pruning |
2405.00672
Report |
TexSliders: Diffusion-Based Texture Editing in CLIP Space |
Julia Guerrero-Viu, Milos Hasan, Arthur Roullier, Midhun Harikumar, Yiwei Hu, Paul Guerrero, Diego Gutierrez, Belen Masia, Valentin Deschaintre |
Generative models have enabled intuitive image creation and manipulation
using natural language. In particular, diffusion models have recently shown
remarkable results for natural image editing. In this work, we propose to apply
diffusion techniques to edit textures, a specific class of images that are an
essential part of 3D content creation pipelines. We analyze existing editing
methods and show that they are not directly applicable to textures, since their
common underlying approach, manipulating attention maps, is unsuitable for the
texture domain. To address this, we propose a novel approach that instead
manipulates CLIP image embeddings to condition the diffusion generation. We
define editing directions using simple text prompts (e.g., "aged wood" to "new
wood") and map these to CLIP image embedding space using a texture prior, with
a sampling-based approach that gives us identity-preserving directions in CLIP
space. To further improve identity preservation, we project these directions to
a CLIP subspace that minimizes identity variations resulting from entangled
texture attributes. Our editing pipeline facilitates the creation of arbitrary
sliders using natural language prompts only, with no ground-truth annotated
data necessary. |
Introduces TexSliders, a novel diffusion-based method for editing textures using natural language prompts, by manipulating CLIP image embeddings. |
Existing diffusion-based image editing methods, relying on attention maps, are not effective for textures due to the lack of distinct semantic regions in textures. |
Defines editing directions in CLIP space using pairs of text prompts, leverages a texture diffusion prior, and prunes irrelevant dimensions to improve identity preservation. |
Enables intuitive slider-based texture editing using natural language.
Demonstrates superior performance compared to state-of-the-art image editing methods on textures.
Generalizes to real photographs and allows combinations of multiple editing directions. |
Performance depends on the quality of CLIP embeddings and the diffusion model's sensitivity to specific concepts.
Formal definition of texture identity in the context of diffusion models requires further investigation. |
texture editing, diffusion models, clip, image embedding, generative models |
2405.00630
Report |
Depth Priors in Removal Neural Radiance Fields |
Zhihao Guo, Peng Wang |
Neural Radiance Fields have achieved impressive results in 3D reconstruction
and novel view generation. A significant challenge within NeRF involves editing
reconstructed 3D scenes, such as object removal, which demands consistency
across multiple views and the synthesis of high-quality perspectives. Previous
studies have integrated depth priors, typically sourced from LiDAR or sparse
depth estimates from COLMAP, to enhance NeRF's performance in object removal.
However, these methods are either expensive or time-consuming. This paper
proposes a new pipeline that leverages SpinNeRF and monocular depth estimation
models like ZoeDepth to enhance NeRF's performance in complex object removal
with improved efficiency. A thorough evaluation of COLMAP's dense depth
reconstruction on the KITTI dataset is conducted to demonstrate that COLMAP can
be viewed as a cost-effective and scalable alternative for acquiring depth
ground truth compared to traditional methods like LiDAR. This serves as the
basis for evaluating the performance of monocular depth estimation models to
determine the best one for generating depth priors for SpinNeRF. The new
pipeline is tested in various scenarios involving 3D reconstruction and object
removal, and the results indicate that our pipeline significantly reduces the
time required for depth prior acquisition for object removal and enhances the
fidelity of the synthesized views, suggesting substantial potential for
building high-fidelity digital twin systems with increased efficiency in the
future. |
This paper presents a novel object removal pipeline for Neural Radiance Fields (NeRF) that integrates SpinNeRF with monocular depth estimation models like ZoeDepth. |
Enhancing NeRF's object removal capabilities is crucial for applications like robot navigation in human-robot collaborative environments, but existing methods using LiDAR or COLMAP depth priors are either costly or time-consuming. |
The authors evaluate COLMAP's dense depth reconstruction accuracy against KITTI datasets to establish its viability as a ground truth depth source. They then compare various monocular depth estimation models using COLMAP-generated depth on the SpinNeRF dataset, identifying ZoeDepth as the optimal choice. Finally, they integrate ZoeDepth with SpinNeRF to create the proposed pipeline. |
COLMAP's dense depth reconstruction exhibits high accuracy, making it a viable alternative to expensive ground truth depth acquisition methods.
ZoeDepth outperforms other monocular depth estimation models on the SpinNeRF dataset, delivering high-quality depth priors while minimizing computational overhead.
Integrating ZoeDepth with SpinNeRF significantly reduces depth prior acquisition time and improves the fidelity of synthesized views, particularly in object removal scenarios. |
The paper primarily focuses on the SpinNeRF model, potentially limiting the generalizability of findings to other NeRF architectures.
Future work could explore the integration of alternative monocular depth estimation models or the development of specialized depth estimation techniques tailored for NeRF object removal. |
neural radiance fields, monocular depth estimation, 3d editing, 3d reconstruction, object removal |
2405.00587
Report |
GraCo: Granularity-Controllable Interactive Segmentation |
Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen |
Interactive Segmentation (IS) segments specific objects or parts in the image
according to user input. Current IS pipelines fall into two categories:
single-granularity output and multi-granularity output. The latter aims to
alleviate the spatial ambiguity present in the former. However, the
multi-granularity output pipeline suffers from limited interaction flexibility
and produces redundant results. In this work, we introduce
Granularity-Controllable Interactive Segmentation (GraCo), a novel approach
that allows precise control of prediction granularity by introducing additional
parameters to input. This enhances the customization of the interactive system
and eliminates redundancy while resolving ambiguity. Nevertheless, the
exorbitant cost of annotating multi-granularity masks and the lack of available
datasets with granularity annotations make it difficult for models to acquire
the necessary guidance to control output granularity. To address this problem,
we design an any-granularity mask generator that exploits the semantic property
of the pre-trained IS model to automatically generate abundant mask-granularity
pairs without requiring additional manual annotation. Based on these pairs, we
propose a granularity-controllable learning strategy that efficiently imparts
the granularity controllability to the IS model. Extensive experiments on
intricate scenarios at object and part levels demonstrate that our GraCo has
significant advantages over previous methods. This highlights the potential of
GraCo to be a flexible annotation tool, capable of adapting to diverse
segmentation scenarios. The project page: https://zhao-yian.github.io/GraCo. |
This paper introduces GraCo, a novel Granularity-Controllable Interactive Segmentation approach that allows users to precisely control the granularity of segmentation masks through an additional input parameter, resolving ambiguity without redundant outputs. |
Current interactive segmentation methods either provide single-granularity outputs, ignoring potential ambiguity in user intent, or offer multi-granularity outputs with limited scalability and redundancy. GraCo addresses these issues by enabling flexible and precise control over segmentation granularity. |
GraCo employs a two-stage approach: (1) an Any-Granularity mask Generator (AGG) automatically generates mask proposals of varying granularities and quantifies their granularity level, and (2) Granularity-Controllable Learning (GCL) leverages these mask-granularity pairs to fine-tune a pre-trained IS model, enabling it to understand and respond to user-specified granularity. |
GraCo significantly outperforms state-of-the-art single-granularity IS methods on both object and part-level benchmarks.
GraCo surpasses the multi-granularity IS approach SAM on all benchmarks, except for achieving comparable performance on the SA-1B dataset.
Analysis of IoU-granularity curves confirms GraCo's ability to control segmentation granularity consistently with human cognition. |
The randomness in interaction signals generated by AGG can lead to semantically inconsistent parts or noisy boundaries, impacting granularity controllability.
The offline proposal generation in AGG creates a trade-off between storage space and granularity abundance. Exploring online fine-tuning for granularity controllability is a potential future direction. |
interactive segmentation, granularity control, ambiguity resolution, any-granularity mask generation, granularity-controllable learning |
2405.00466
Report |
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable |
Haozhe Liu, Wentian Zhang, Bing Li, Bernard Ghanem, Jürgen Schmidhuber |
Foundational generative models should be traceable to protect their owners
and facilitate safety regulation. To achieve this, traditional approaches embed
identifiers based on supervisory trigger-response signals, which are commonly
known as backdoor watermarks. They are prone to failure when the model is
fine-tuned with nontrigger data. Our experiments show that this vulnerability
is due to energetic changes in only a few 'busy' layers during fine-tuning.
This yields a novel arbitrary-in-arbitrary-out (AIAO) strategy that makes
watermarks resilient to fine-tuning-based removal. The trigger-response pairs
of AIAO samples across various neural network depths can be used to construct
watermarked subpaths, employing Monte Carlo sampling to achieve stable
verification results. In addition, unlike the existing methods of designing a
backdoor for the input/output space of diffusion models, in our method, we
propose to embed the backdoor into the feature space of sampled subpaths, where
a mask-controlled trigger function is proposed to preserve the generation
performance and ensure the invisibility of the embedded backdoor. Our empirical
studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm
the robustness of AIAO; while the verification rates of other trigger-based
methods fall from ~90% to ~70% after fine-tuning, those of our method remain
consistently above 90%. |
This paper introduces AIAO, a novel backdoor-based method for traceable ownership protection of diffusion models, designed to be robust against fine-tuning on downstream tasks. |
With the increasing use of fine-tuned pre-trained diffusion models, it's crucial to develop methods for tracking their usage and protecting the intellectual property of the source model. |
AIAO embeds backdoor identifiers in the feature space of lazy layers (layers that undergo minimal change during fine-tuning) using a mask-controlled trigger function and Monte Carlo sampling of subpaths to minimize the impact of busy layers. |
AIAO maintains high response and verification success rates (over 90%) even after fine-tuning, significantly outperforming existing backdoor watermarking methods.
Embedding the backdoor in lazy layers significantly improves robustness against fine-tuning removal.
The mask-controlled trigger function effectively generates invisible triggers in the feature space, preserving generation performance. |
The verification pipeline currently relies on access to feature maps, limiting its applicability to open-source or semi-open-source scenarios.
Future work will focus on extending AIAO to black-box ownership protection where feature maps are inaccessible. |
trustworthy ai, intellectual property protection, backdoor watermark, diffusion model, fine-tuning |
2405.00448
Report |
MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation |
Xujie Zhang, Ente Lin, Xiu Li, Yuxuan Luo, Michael Kampffmeyer, Xin Dong, Xiaodan Liang |
This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON
(VITON) framework, which can generate high-quality compositional try-on results
by taking as inputs a text instruction and multiple garment images. Our MMTryon
mainly addresses two problems overlooked in prior literature: 1) Support of
multiple try-on items and dressing styleExisting methods are commonly designed
for single-item try-on tasks (e.g., upper/lower garments, dresses) and fall
short on customizing dressing styles (e.g., zipped/unzipped, tuck-in/tuck-out,
etc.) 2) Segmentation Dependency. They further heavily rely on
category-specific segmentation models to identify the replacement regions, with
segmentation errors directly leading to significant artifacts in the try-on
results. For the first issue, our MMTryon introduces a novel multi-modality and
multi-reference attention mechanism to combine the garment information from
reference images and dressing-style information from text instructions.
Besides, to remove the segmentation dependency, MMTryon uses a parsing-free
garment encoder and leverages a novel scalable data generation pipeline to
convert existing VITON datasets to a form that allows MMTryon to be trained
without requiring any explicit segmentation. Extensive experiments on
high-resolution benchmarks and in-the-wild test sets demonstrate MMTryon's
superiority over existing SOTA methods both qualitatively and quantitatively.
Besides, MMTryon's impressive performance on multi-items and style-controllable
virtual try-on scenarios and its ability to try on any outfit in a large
variety of scenarios from any source image, opens up a new avenue for future
investigation in the fashion community. |
Introduces MMTryon, a multi-modal multi-reference virtual try-on framework generating high-quality compositional try-on results from text instructions and multiple garment images. |
Addresses limitations in existing VITON methods like single-item try-on, lack of dressing style customization, and dependence on segmentation models leading to artifacts. |
Leverages a multi-modality and multi-reference attention mechanism combining garment and dressing style information, employs a parsing-free garment encoder, and uses a scalable data generation pipeline to train without explicit segmentation. |
Outperforms SOTA methods qualitatively and quantitatively on high-resolution benchmarks and in-the-wild tests.
Demonstrates superior performance in multi-item, style-controllable try-on scenarios.
Offers flexibility in trying on outfits from diverse sources and scenarios. |
Data generation process limited by pretrained models, posing challenges for fine-grained details like cuffs and collars.
Future work may focus on fine-tuning large models to construct more detailed datasets for enhanced generation. |
virtual try-on, viton, multi-modal learning, compositional try-on, diffusion models |
2405.00313
Report |
Streamlining Image Editing with Layered Diffusion Brushes |
Peyman Gholami, Robert Xiao |
Denoising diffusion models have recently gained prominence as powerful tools
for a variety of image generation and manipulation tasks. Building on this, we
propose a novel tool for real-time editing of images that provides users with
fine-grained region-targeted supervision in addition to existing prompt-based
controls. Our novel editing technique, termed Layered Diffusion Brushes,
leverages prompt-guided and region-targeted alteration of intermediate
denoising steps, enabling precise modifications while maintaining the integrity
and context of the input image. We provide an editor based on Layered Diffusion
Brushes modifications, which incorporates well-known image editing concepts
such as layer masks, visibility toggles, and independent manipulation of
layers; regardless of their order. Our system renders a single edit on a
512x512 image within 140 ms using a high-end consumer GPU, enabling real-time
feedback and rapid exploration of candidate edits. We validated our method and
editing system through a user study involving both natural images (using
inversion) and generated images, showcasing its usability and effectiveness
compared to existing techniques such as InstructPix2Pix and Stable Diffusion
Inpainting for refining images. Our approach demonstrates efficacy across a
range of tasks, including object attribute adjustments, error correction, and
sequential prompt-based object placement and manipulation, demonstrating its
versatility and potential for enhancing creative workflows. |
This paper introduces Layered Diffusion Brushes, a novel real-time image editing tool for refining AI-generated images by making localized adjustments to specific regions defined by user-drawn masks. |
Existing AI image editing tools often lack the speed and precision for real-time, localized adjustments. This tool aims to fill this gap, providing artists and users with greater control over image manipulation. |
The method leverages Latent Diffusion Models (LDMs) by introducing targeted random noise patterns into the latent space during the reverse diffusion process. Users control the edits through masks, text prompts, and adjustable parameters like brush strength and the number of editing steps. A layering system allows for non-destructive, independent edits on different parts of the image. |
Layered Diffusion Brushes achieved significantly faster editing times compared to manual editing and other AI-based methods, enabling real-time feedback.
A user study indicated that Layered Diffusion Brushes was perceived as more usable and intuitive compared to InstructPix2Pix and SD-Inpainting.
The tool was found to be effective for tasks like object addition/removal, attribute modification, style mixing, and error correction, demonstrating its versatility in refining AI-generated images. |
Some aspects of the user interface, such as layer management and blend options, could be further improved based on user feedback.
Future work could explore incorporating advanced features like semantic guidance and integration with 3D models for even greater control and realism. |
diffusion models, image editing, artistic control, real-time editing, user interface |
2405.00293
Report |
MoPEFT: A Mixture-of-PEFTs for the Segment Anything Model |
Rajat Sahay, Andreas Savakis |
The emergence of foundation models, such as the Segment Anything Model (SAM),
has sparked interest in Parameter-Efficient Fine-Tuning (PEFT) methods that
tailor these large models to application domains outside their training data.
However, different PEFT techniques modify the representation of a model
differently, making it a non-trivial task to select the most appropriate method
for the domain of interest. We propose a new framework, Mixture-of-PEFTs
methods (MoPEFT), that is inspired by traditional Mixture-of-Experts (MoE)
methodologies and is utilized for fine-tuning SAM. Our MoPEFT framework
incorporates three different PEFT techniques as submodules and dynamically
learns to activate the ones that are best suited for a given data-task setup.
We test our method on the Segment Anything Model and show that MoPEFT
consistently outperforms other fine-tuning methods on the MESS benchmark. |
This paper introduces MoPEFT, a new framework inspired by Mixture-of-Experts, which dynamically activates specific Parameter-Efficient Fine-Tuning (PEFT) techniques based on the data and task. |
Fine-tuning large foundation models like SAM is computationally expensive. PEFT methods offer efficiency but their effectiveness varies. MoPEFT addresses this by selectively leveraging the strengths of different PEFT techniques. |
MoPEFT integrates LoRA, Prefix Tuning, and Adapters as submodules. A gating mechanism learns to favor the most suitable PEFT method for a given task, dynamically switching between them. |
MoPEFT consistently outperforms individual PEFT methods (LoRA, Prefix Tuning, Adapters) on the MESS benchmark across multiple domains.
The gating mechanism effectively learns to prefer different PEFT techniques for different datasets, demonstrating its adaptive capability.
Combining multiple PEFT techniques in MoPEFT often leads to better performance than the best-performing individual technique, suggesting synergistic effects. |
The paper primarily focuses on three major domains from the MESS benchmark due to brevity.
Further investigation is needed to fully understand the compounding effects observed when combining different PEFT methods. |
parameter-efficient fine-tuning, foundation models, segment anything model, mixture-of-experts, semantic segmentation |
2405.00256
Report |
ASAM: Boosting Segment Anything Model with Adversarial Tuning |
Bo Li, Haoke Xiao, Lv Tang |
In the evolving landscape of computer vision, foundation models have emerged
as pivotal tools, exhibiting exceptional adaptability to a myriad of tasks.
Among these, the Segment Anything Model (SAM) by Meta AI has distinguished
itself in image segmentation. However, SAM, like its counterparts, encounters
limitations in specific niche applications, prompting a quest for enhancement
strategies that do not compromise its inherent capabilities. This paper
introduces ASAM, a novel methodology that amplifies SAM's performance through
adversarial tuning. We harness the potential of natural adversarial examples,
inspired by their successful implementation in natural language processing. By
utilizing a stable diffusion model, we augment a subset (1%) of the SA-1B
dataset, generating adversarial instances that are more representative of
natural variations rather than conventional imperceptible perturbations. Our
approach maintains the photorealism of adversarial examples and ensures
alignment with original mask annotations, thereby preserving the integrity of
the segmentation task. The fine-tuned ASAM demonstrates significant
improvements across a diverse range of segmentation tasks without necessitating
additional data or architectural modifications. The results of our extensive
evaluations confirm that ASAM establishes new benchmarks in segmentation tasks,
thereby contributing to the advancement of foundational models in computer
vision. Our project page is in https://asam2024.github.io/. |
Introduces ASAM, a method enhancing SAM's performance using adversarial tuning inspired by natural adversarial examples in NLP. |
To boost SAM's generalization ability without using extra data, changing its architecture, or hurting its zero-shot capabilities. |
Projects natural images onto a low-dimensional manifold via a generative model, optimizes the latent representation, and fine-tunes SAM with the generated adversarial examples. |
ASAM outperforms other SAM tuning methods on 14 diverse segmentation datasets.
ASAM maintains high image quality comparable to original images.
ASAM framework successfully enhances performance of another large vision foundation model, EfficientSAM. |
Lack of direct theoretical proof for the method's efficacy.
Exploration of ASAM's application to other vision tasks beyond segmentation. |
image segmentation, foundation models, adversarial tuning, stable diffusion, segment anything model (sam) |
2405.00196
Report |
Synthetic Image Verification in the Era of Generative AI: What Works and What Isn't There Yet |
Diangarti Tariang, Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, Luisa Verdoliva |
In this work we present an overview of approaches for the detection and
attribution of synthetic images and highlight their strengths and weaknesses.
We also point out and discuss hot topics in this field and outline promising
directions for future research. |
This paper presents an overview of methods for detecting and attributing synthetic images, highlighting their strengths, weaknesses, and future research directions. |
The rise of generative AI, enabling easy creation of hyperrealistic synthetic images, poses significant threats to disinformation and propaganda. Automated tools for detecting and attributing such images are crucial for societal protection. |
The paper reviews various data-driven methods, including those leveraging CNNs, transformers, and vision-language models. It also discusses techniques exploiting forensic cues, like low-level artifacts in the frequency domain and high-level semantic inconsistencies. |
Diffusion model-generated images are harder to detect than those from GANs.
Generalization remains a challenge, especially when there's a mismatch between training and test data.
While attribution in closed-set scenarios is reliable, open-set attribution requires further research. |
Most research treats detection and attribution as separate problems, while a joint approach could be more effective.
Calibration of detectors for real-world scenarios, where a fixed threshold may not be suitable, needs more attention. |
synthetic image detection, image attribution, generative ai, deep learning, digital forensics |
2404.19760
Report |
Lightplane: Highly-Scalable Components for Neural 3D Fields |
Ang Cao, Justin Johnson, Andrea Vedaldi, David Novotny |
Contemporary 3D research, particularly in reconstruction and generation,
heavily relies on 2D images for inputs or supervision. However, current designs
for these 2D-3D mapping are memory-intensive, posing a significant bottleneck
for existing methods and hindering new applications. In response, we propose a
pair of highly scalable components for 3D neural fields: Lightplane Render and
Splatter, which significantly reduce memory usage in 2D-3D mapping. These
innovations enable the processing of vastly more and higher resolution images
with small memory and computational costs. We demonstrate their utility in
various applications, from benefiting single-scene optimization with
image-level losses to realizing a versatile pipeline for dramatically scaling
3D reconstruction and generation. Code:
\url{https://github.com/facebookresearch/lightplane}. |
This paper introduces Lightplane, a framework with two highly scalable components, Renderer and Splatter, for efficiently mapping information between 2D images and neural 3D fields using hashed 3D representations like voxel grids and triplanes. |
Existing methods for 2D-3D mapping in neural 3D fields are memory-intensive, limiting the use of image-level losses, the number of input views, and the scalability of 3D models. Lightplane addresses this bottleneck by significantly reducing memory usage. |
Lightplane leverages a hybrid 3D representation combining hashed structures (e.g., voxel grids, triplanes) and MLPs. It fuses operations along rays instead of processing individual 3D points, recomputes intermediate values during backpropagation, and leverages the GPU memory hierarchy for speed. |
Lightplane achieves up to four orders of magnitude reduction in memory consumption compared to autograd methods while maintaining comparable speed.
It enables the use of image-level losses on high-resolution renders for single-scene optimization.
Lightplane significantly boosts the scalability of 3D reconstruction and generation models, demonstrated by improvements in LRM and a novel viewset diffusion model for CO3Dv2. |
Current implementation shows a performance gap between different 3D hash representations (voxel grids and triplanes).
Rendering and splatting a large number of images is still time-consuming, despite being comparable in speed to existing methods. |
neural 3d fields, 3d reconstruction, 3d generation, memory efficiency, scalability |
2404.19759
Report |
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model |
Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, Yansong Tang |
This work introduces MotionLCM, extending controllable motion generation to a
real-time level. Existing methods for spatial control in text-conditioned
motion generation suffer from significant runtime inefficiency. To address this
issue, we first propose the motion latent consistency model (MotionLCM) for
motion generation, building upon the latent diffusion model (MLD). By employing
one-step (or few-step) inference, we further improve the runtime efficiency of
the motion latent diffusion model for motion generation. To ensure effective
controllability, we incorporate a motion ControlNet within the latent space of
MotionLCM and enable explicit control signals (e.g., pelvis trajectory) in the
vanilla motion space to control the generation process directly, similar to
controlling other latent-free diffusion models for motion generation. By
employing these techniques, our approach can generate human motions with text
and control signals in real-time. Experimental results demonstrate the
remarkable generation and controlling capabilities of MotionLCM while
maintaining real-time runtime efficiency. |
This work introduces MotionLCM, a novel model that enables real-time controllable motion generation by combining latent consistency distillation and a motion ControlNet. |
Existing methods for controllable text-to-motion generation suffer from significant runtime inefficiency, hindering their applicability in real-time scenarios. MotionLCM addresses this issue by significantly accelerating motion generation without compromising quality. |
The methodology consists of two key components: 1) Motion Latent Consistency Distillation: A consistency model is distilled from a pre-trained motion latent diffusion model to achieve efficient one-step or few-step motion generation. 2) Controllable Motion Generation in Latent Space: A motion ControlNet is incorporated into MotionLCM to enable control over motion generation using spatial signals like pelvis trajectory. Explicit control supervision is applied in the motion space to enhance controllability. |
MotionLCM achieves real-time inference speed (~30ms per motion sequence), outperforming prior diffusion-based methods by a significant margin.
Despite using only one-step inference, MotionLCM achieves comparable or even superior performance compared to existing state-of-the-art methods.
The introduction of motion ControlNet and control supervision in the motion space allows MotionLCM to achieve high-quality controllable motion generation. |
While MotionLCM excels in terms of speed and quality trade-off, methods using guided diffusion still outperform it in motion control performance, suggesting room for improvement.
The paper acknowledges the issue of potential physical implausibility in generated motions and limitations in handling noisy or anomalous data, leaving these as future research directions. |
motion generation, text-to-motion, motion control, latent consistency models, controlnet |
2404.19758
Report |
Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting |
Paul Engstler, Andrea Vedaldi, Iro Laina, Christian Rupprecht |
3D scene generation has quickly become a challenging new research direction,
fueled by consistent improvements of 2D generative diffusion models. Most prior
work in this area generates scenes by iteratively stitching newly generated
frames with existing geometry. These works often depend on pre-trained
monocular depth estimators to lift the generated images into 3D, fusing them
with the existing scene representation. These approaches are then often
evaluated via a text metric, measuring the similarity between the generated
images and a given text prompt. In this work, we make two fundamental
contributions to the field of 3D scene generation. First, we note that lifting
images to 3D with a monocular depth estimation model is suboptimal as it
ignores the geometry of the existing scene. We thus introduce a novel depth
completion model, trained via teacher distillation and self-training to learn
the 3D fusion process, resulting in improved geometric coherence of the scene.
Second, we introduce a new benchmarking scheme for scene generation methods
that is based on ground truth geometry, and thus measures the quality of the
structure of the scene. |
This paper presents a novel depth completion model for 3D scene generation and a new benchmark for evaluating the geometric quality of generated scenes. |
Current 3D scene generation methods often produce geometrically inconsistent scenes and rely on image-based metrics for evaluation, neglecting the underlying geometry. |
The authors propose a depth completion model trained via teacher distillation and self-training to learn the 3D fusion process. They also introduce a benchmark based on ground truth geometry to evaluate the depth accuracy of generated scenes. |
The proposed depth completion model significantly reduces geometric artifacts compared to existing methods.
The new benchmark effectively uncovers geometric inconsistencies in existing scene generation approaches.
The authors demonstrate their approach in a 360-degree scene generation pipeline, showcasing its ability to create immersive and geometrically consistent scenes. |
The training dataset for the depth completion model is limited to specific scene types.
The evaluation benchmark relies on the availability of ground truth depth data. |
scene generation, novel view synthesis, 3d geometry, depth completion, benchmarking |
2404.19753
Report |
DOCCI: Descriptions of Connected and Contrasting Images |
Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, Jason Baldridge |
Vision-language datasets are vital for both text-to-image (T2I) and
image-to-text (I2T) research. However, current datasets lack descriptions with
fine-grained detail that would allow for richer associations to be learned by
models. To fill the gap, we introduce Descriptions of Connected and Contrasting
Images (DOCCI), a dataset with long, human-annotated English descriptions for
15k images that were taken, curated and donated by a single researcher intent
on capturing key challenges such as spatial relations, counting, text
rendering, world knowledge, and more. We instruct human annotators to create
comprehensive descriptions for each image; these average 136 words in length
and are crafted to clearly distinguish each image from those that are related
or similar. Each description is highly compositional and typically encompasses
multiple challenges. Through both quantitative and qualitative analyses, we
demonstrate that DOCCI serves as an effective training resource for
image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or
superior results compared to highly-performant larger models like LLaVA-1.5 7B
and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for
text-to-image generation, highlighting the limitations of current text-to-image
models in capturing long descriptions and fine details. |
Introduces DOCCI, a vision-language dataset with 15k images and detailed human-annotated descriptions (avg. 136 words), focusing on fine-grained details and challenging aspects for T2I models. |
Addresses limitations of existing datasets that lack descriptions with fine-grained detail needed for models to learn richer associations, hindering research on T2I models and their real-world applications. |
Images curated to include contrastive sets and test specific T2I challenges. Three-stage annotation process ensures detailed and high-quality descriptions, with rigorous quality control. |
DOCCI serves as an effective training resource for I2T generation, as demonstrated by improved performance of a PaLI 5B model finetuned on DOCCI.
DOCCI highlights limitations of current T2I models, particularly in handling long descriptions, fine details, and challenges like spatial relationships, counting, and text rendering.
Reveals discrepancies between automatic metrics (e.g., FID, CLIPScore) and human evaluation for long descriptions, emphasizing the need for better metrics. |
DOCCI images are sourced from a single photographer, potentially introducing bias.
Lack of reliable automatic metrics for evaluating long, detailed image descriptions. |
vision-language, text-to-image generation, image-to-text generation, dataset, evaluation |
2404.19752
Report |
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation |
Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui |
Existing automatic captioning methods for visual content face challenges such
as lack of detail, content hallucination, and poor instruction following. In
this work, we propose VisualFactChecker (VFC), a flexible training-free
pipeline that generates high-fidelity and detailed captions for both 2D images
and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text
captioning models propose multiple initial captions; 2) verification, where a
large language model (LLM) utilizes tools such as object detection and VQA
models to fact-check proposed captions; 3) captioning, where an LLM generates
the final caption by summarizing caption proposals and the fact check
verification results. In this step, VFC can flexibly generate captions in
various styles following complex instructions. We conduct comprehensive
captioning evaluations using four metrics: 1) CLIP-Score for image-text
similarity; 2) CLIP-Image-Score for measuring the image-image similarity
between the original and the reconstructed image generated by a text-to-image
model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V
for fine-grained evaluation. Evaluation results show that VFC outperforms
state-of-the-art open-sourced captioning methods for 2D images on the COCO
dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by
combining open-source models into a pipeline, we can attain captioning
capability comparable to proprietary models such as GPT-4V, despite being over
10x smaller in model size. |
This paper introduces VisualFactChecker (VFC), a training-free pipeline for generating detailed and accurate captions for both 2D images and 3D objects. VFC addresses limitations of existing captioning methods, such as hallucination and lack of detail, by combining multiple models and a fact-checking step. |
Accurate and detailed image captioning is crucial for various applications, including image retrieval, accessibility, and understanding visual content. Existing methods often produce captions that are either too short or hallucinate details not present in the image. VFC aims to bridge this gap by ensuring both detail and accuracy in generated captions. |
VFC uses a three-step process: 1) Proposal: Multiple captioning models generate initial captions. 2) Verification: An LLM employs object detection and VQA models to verify the proposed captions, reducing hallucinations. 3) Captioning: The LLM synthesizes the verified information into a final detailed and accurate caption. For 3D objects, VFC generates captions for multiple views and combines them. |
VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset, as measured by CLIP-Score and human evaluation.
The paper proposes a novel caption evaluation metric called CLIP-Image-Score, which compares the input image with a reconstructed image generated from the caption using a text-to-image model. This helps assess caption fidelity and detect hallucinations.
The study demonstrates that combining open-source models in a pipeline with an LLM can achieve captioning performance comparable to proprietary models like GPT-4V. |
One limitation is the reliance on multiple models, which could increase computational cost and complexity.
The current fact-checking process still has room for improvement, particularly in automatically determining which components to use for optimal results. |
image captioning, hallucination mitigation, large language models, multimodal learning, 3d object captioning |
2404.19702
Report |
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting |
Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, Zexiang Xu |
We propose GS-LRM, a scalable large reconstruction model that can predict
high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23
seconds on single A100 GPU. Our model features a very simple transformer-based
architecture; we patchify input posed images, pass the concatenated multi-view
image tokens through a sequence of transformer blocks, and decode final
per-pixel Gaussian parameters directly from these tokens for differentiable
rendering. In contrast to previous LRMs that can only reconstruct objects, by
predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large
variations in scale and complexity. We show that our model can work on both
object and scene captures by training it on Objaverse and RealEstate10K
respectively. In both scenarios, the models outperform state-of-the-art
baselines by a wide margin. We also demonstrate applications of our model in
downstream 3D generation tasks. Our project webpage is available at:
https://sai-bi.github.io/project/gs-lrm/ . |
This paper proposes GS-LRM, a scalable transformer-based Large Reconstruction Model (LRM) that predicts 3D Gaussian primitives from sparse posed images, enabling fast and high-quality 3D reconstruction for both objects and scenes. |
Existing LRMs rely on triplane NeRF representation, which suffers from limitations in resolution, rendering speed, and scalability to large scenes. GS-LRM overcomes these limitations by directly predicting per-pixel Gaussians, leading to improved quality, speed, and scalability. |
GS-LRM uses a simple transformer architecture: input posed images are patchified, processed by transformer blocks, and decoded into per-pixel Gaussian parameters. It is trained on Objaverse and RealEstate10K datasets for object and scene reconstruction, respectively. |
GS-LRM achieves state-of-the-art reconstruction quality, outperforming previous methods by a large margin (4dB PSNR improvement for objects, 2.2dB for scenes).
The model is fast, reconstructing a scene in ~0.23 seconds on a single A100 GPU.
GS-LRM demonstrates strong performance in downstream 3D generation tasks, such as text-to-3D and image-to-3D. |
The current model has a limited working resolution of 512x904 and requires known camera parameters.
Future work will focus on increasing the resolution, handling unknown camera poses, and improving the reconstruction of unseen regions. |
large reconstruction models, 3d reconstruction, gaussian splatting, transformers, sparse-view reconstruction |
2404.19696
Report |
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners |
Chun Feng, Joy Hsu, Weiyu Liu, Jiajun Wu |
3D visual grounding is a challenging task that often requires direct and
dense supervision, notably the semantic label for each object in the scene. In
this paper, we instead study the naturally supervised setting that learns from
only 3D scene and QA pairs, where prior works underperform. We propose the
Language-Regularized Concept Learner (LARC), which uses constraints from
language as regularization to significantly improve the accuracy of
neuro-symbolic concept learners in the naturally supervised setting. Our
approach is based on two core insights: the first is that language constraints
(e.g., a word's relation to another) can serve as effective regularization for
structured representations in neuro-symbolic models; the second is that we can
query large language models to distill such constraints from language
properties. We show that LARC improves performance of prior works in naturally
supervised 3D visual grounding, and demonstrates a wide range of 3D visual
reasoning capabilities-from zero-shot composition, to data efficiency and
transferability. Our method represents a promising step towards regularizing
structured visual reasoning frameworks with language-based priors, for learning
in settings without dense supervision. |
Proposes Language-Regularized Concept Learner (LRC), a neuro-symbolic model that uses language constraints as regularization for 3D visual grounding in naturally supervised settings. |
Addresses the limitations of current 3D visual grounding models that rely on dense supervision (e.g., object labels) which is expensive and difficult to obtain. |
LRC leverages LLMs to distill language constraints (symmetry, exclusivity, synonymity) and applies these constraints as regularization losses and data augmentation during training. |
LRC significantly outperforms prior neuro-symbolic methods and achieves comparable performance to end-to-end methods in naturally supervised 3D referring expression comprehension.
LRC demonstrates strong zero-shot generalization to unseen concepts via language composition rules.
LRC exhibits superior data efficiency and transferability to new datasets compared to previous approaches. |
Reliance on object detectors like VoteNet introduces noise in bounding box predictions.
Exploiting a wider range of language priors beyond the three explored could further enhance performance. |
3d visual grounding, neuro-symbolic learning, natural language supervision, language constraints, referring expression comprehension |
2404.19567
Report |
Causal Perception Inspired Representation Learning for Trustworthy Image Quality Assessment |
Lei Wang, Desen Yuan |
Despite great success in modeling visual perception, deep neural network
based image quality assessment (IQA) still remains unreliable in real-world
applications due to its vulnerability to adversarial perturbations and the
inexplicit black-box structure. In this paper, we propose to build a
trustworthy IQA model via Causal Perception inspired Representation Learning
(CPRL), and a score reflection attack method for IQA model. More specifically,
we assume that each image is composed of Causal Perception Representation (CPR)
and non-causal perception representation (N-CPR). CPR serves as the causation
of the subjective quality label, which is invariant to the imperceptible
adversarial perturbations. Inversely, N-CPR presents spurious associations with
the subjective quality label, which may significantly change with the
adversarial perturbations. To extract the CPR from each input image, we develop
a soft ranking based channel-wise activation function to mediate the causally
sufficient (beneficial for high prediction accuracy) and necessary (beneficial
for high robustness) deep features, and based on intervention employ minimax
game to optimize. Experiments on four benchmark databases show that the
proposed CPRL method outperforms many state-of-the-art adversarial defense
methods and provides explicit model interpretation. |
This paper proposes Causal Perception inspired Representation Learning (CPRL) to enhance the trustworthiness and adversarial robustness of image quality assessment (IQA) models. |
Existing deep learning-based IQA models are vulnerable to adversarial perturbations, highlighting their unreliability in real-world applications. This work addresses this limitation by focusing on the causal relationship between image features and perceived quality. |
The proposed CPRL method introduces a novel channel-wise activation function within a causal framework. This function, based on soft ranking and a minimax game training strategy, aims to extract causal perception representations (CPR) from images while mitigating the influence of non-causal features. |
CPRL significantly improves the robustness of IQA models against adversarial attacks like FGSM and PGD, as demonstrated by higher SRCC and PLCC values compared to existing methods.
The learned representations exhibit greater stability in channel activations for adversarial examples, indicating the effectiveness of CPRL in capturing causal features.
CPRL also achieves competitive performance on clean images, suggesting its capability to improve both robustness and accuracy in IQA. |
The training process of CPRL requires additional optimization steps, leading to higher computational overhead compared to conventional IQA models.
The intervention method based on prediction might not be perfectly accurate and has room for further improvement in future work. |
image quality assessment, adversarial robustness, causal inference, representation learning, trustworthy ai |
2404.19525
Report |
MicroDreamer: Zero-shot 3D Generation in $\sim$20 Seconds by Score-based Iterative Reconstruction |
Luxi Chen, Zhengyi Wang, Chongxuan Li, Tingting Gao, Hang Su, Jun Zhu |
Optimization-based approaches, such as score distillation sampling (SDS),
show promise in zero-shot 3D generation but suffer from low efficiency,
primarily due to the high number of function evaluations (NFEs) required for
each sample. In this paper, we introduce score-based iterative reconstruction
(SIR), an efficient and general algorithm for 3D generation with a multi-view
score-based diffusion model. Given the images produced by the diffusion model,
SIR reduces NFEs by repeatedly optimizing 3D parameters, unlike the single
optimization in SDS, mimicking the 3D reconstruction process. With other
improvements including optimization in the pixel space, we present an efficient
approach called MicroDreamer that generally applies to various 3D
representations and 3D generation tasks. In particular, retaining a comparable
performance, MicroDreamer is 5-20 times faster than SDS in generating neural
radiance field and takes about 20 seconds to generate meshes from 3D Gaussian
splitting on a single A100 GPU, halving the time of the fastest zero-shot
baseline, DreamGaussian. Our code is available at
https://github.com/ML-GSAI/MicroDreamer. |
This paper proposes score-based iterative reconstruction (SIR), an efficient and general algorithm for zero-shot 3D generation using multi-view diffusion models. |
Existing optimization-based 3D generation methods, while promising, suffer from low efficiency due to high function evaluation counts and optimization within the latent space. |
SIR mimics the 3D reconstruction process by repeatedly optimizing 3D parameters given diffusion model outputs, reducing function evaluations. It also enables optimization directly in pixel space for further efficiency gains. |
SIR achieves a 5-20 times speedup for NeRF generation compared to score distillation sampling.
The proposed MicroDreamer system generates high-quality meshes from 3D Gaussian splatting in about 20 seconds.
MicroDreamer matches the speed of feed-forward methods while remaining zero-shot, achieving competitive generation quality. |
The quality of generated objects is limited by the quality of the multi-view diffusion model outputs.
Further efficiency improvements may be possible with alternative sampling models or consistency models. |
3d generation, diffusion model, zero-shot learning, score distillation sampling, multi-view diffusion |
2404.19475
Report |
TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image Generation with Diffusion Models |
Teng Zhou, Yongchuan Tang |
Diffusion models have emerged as effective tools for generating diverse and
high-quality content. However, their capability in high-resolution image
generation, particularly for panoramic images, still faces challenges such as
visible seams and incoherent transitions. In this paper, we propose
TwinDiffusion, an optimized framework designed to address these challenges
through two key innovations: Crop Fusion for quality enhancement and Cross
Sampling for efficiency optimization. We introduce a training-free optimizing
stage to refine the similarity of the adjacent image areas, as well as an
interleaving sampling strategy to yield dynamic patches during the cropping
process. A comprehensive evaluation is conducted to compare TwinDiffusion with
the existing methods, considering factors including coherence, fidelity,
compatibility, and efficiency. The results demonstrate the superior performance
of our approach in generating seamless and coherent panoramas, setting a new
standard in quality and efficiency for panoramic image generation. |
The paper proposes TwinDiffusion, an optimized framework for generating high-resolution panoramic images with diffusion models, enhancing coherence and efficiency. |
Existing methods struggle to generate seamless and coherent panoramic images, often exhibiting visible seams and incoherent transitions, especially in high-resolution. |
TwinDiffusion introduces two key innovations: (1) Crop Fusion: a training-free optimization stage to refine the similarity of adjacent image areas, ensuring smoother transitions. (2) Cross Sampling: an interleaving sampling strategy using dynamic strides during cropping, maintaining quality while improving efficiency. |
TwinDiffusion generates significantly more coherent panoramic images with fewer visible seams compared to baselines.
Quantitative evaluation shows superior performance across various metrics, including LPIPS, DISTS, FID, IS, CLIP, and CLIP-aesthetic, without compromising efficiency.
The paper analyzes the impact of key factors like optimization timestep, adjacent control, view stride, and cross stride on the quality-efficiency trade-off. |
The method might struggle to maintain spatial logic in the overall panorama layout while focusing on local coherence.
Future work includes extending the framework to video synthesis and virtual reality applications. |
panorama generation, diffusion models, image coherence, efficient sampling, high-resolution |
2404.19417
Report |
Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World |
Wen Yin, Jian Lou, Pan Zhou, Yulai Xie, Dan Feng, Yuhua Sun, Tailai Zhang, Lichao Sun |
Backdoor attacks have been well-studied in visible light object detection
(VLOD) in recent years. However, VLOD can not effectively work in dark and
temperature-sensitive scenarios. Instead, thermal infrared object detection
(TIOD) is the most accessible and practical in such environments. In this
paper, our team is the first to investigate the security vulnerabilities
associated with TIOD in the context of backdoor attacks, spanning both the
digital and physical realms. We introduce two novel types of backdoor attacks
on TIOD, each offering unique capabilities: Object-affecting Attack and
Range-affecting Attack. We conduct a comprehensive analysis of key factors
influencing trigger design, which include temperature, size, material, and
concealment. These factors, especially temperature, significantly impact the
efficacy of backdoor attacks on TIOD. A thorough understanding of these factors
will serve as a foundation for designing physical triggers and temperature
controlling experiments. Our study includes extensive experiments conducted in
both digital and physical environments. In the digital realm, we evaluate our
approach using benchmark datasets for TIOD, achieving an Attack Success Rate
(ASR) of up to 98.21%. In the physical realm, we test our approach in two
real-world settings: a traffic intersection and a parking lot, using a thermal
infrared camera. Here, we attain an ASR of up to 98.38%. |
This paper presents the first study on backdoor attacks against Thermal Infrared Object Detection (TIOD), highlighting vulnerabilities in both digital and physical environments. |
TIOD is increasingly critical in various applications, including security monitoring and autonomous driving, making its security crucial. |
The authors propose two novel backdoor attacks: Object-affecting Attack (OAA) and Range-affecting Attack (RAA), both leveraging temperature manipulation in trigger design. |
Digital experiments demonstrate up to 98.21% attack success rate (ASR) across different parameters.
Physical world tests in traffic intersection and parking lot scenarios achieve up to 98.38% ASR.
Evaluations of potential countermeasures (pruning, fine-pruning, Neural Cleanse) show limited effectiveness. |
The study primarily focuses on attacking cars, future work could explore vulnerabilities in other object classes.
Further investigation into more robust defense mechanisms specifically designed for TIOD backdoor attacks is needed. |
backdoor attacks, thermal infrared object detection, security vulnerability, temperature modulated triggering, physical world attacks |
2404.19227
Report |
Espresso: Robust Concept Filtering in Text-to-Image Models |
Anudeep Das, Vasisht Duddu, Rui Zhang, N. Asokan |
Diffusion-based text-to-image (T2I) models generate high-fidelity images for
given textual prompts. They are trained on large datasets scraped from the
Internet, potentially containing unacceptable concepts (e.g., copyright
infringing or unsafe). Retraining T2I models after filtering out unacceptable
concepts in the training data is inefficient and degrades utility. Hence, there
is a need for concept removal techniques (CRTs) which are effective in removing
unacceptable concepts, utility-preserving on acceptable concepts, and robust
against evasion with adversarial prompts. None of the prior filtering and
fine-tuning CRTs satisfy all these requirements simultaneously.
We introduce Espresso, the first robust concept filter based on Contrastive
Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by
projecting the generated image's embedding onto the vector connecting
unacceptable and acceptable concepts in the joint text-image embedding space.
This ensures robustness by restricting the adversary to adding noise only along
this vector, in the direction of the acceptable concept. Further fine-tuning
Espresso to separate embeddings of acceptable and unacceptable concepts, while
preserving their pairing with image embeddings, ensures both effectiveness and
utility. We evaluate Espresso on eleven concepts to show that it is effective
(~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93%
normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on
adversarial prompts for unacceptable concepts). Finally, we present theoretical
bounds for the certified robustness of Espresso against adversarial prompts,
and an empirical analysis. |
\method is a robust content filter for text-to-image (\tti) models, which leverages CLIP embeddings of both unacceptable and acceptable concepts to identify and remove undesirable content from generated images. |
\tti models, trained on vast unfiltered internet data, often memorize and generate images containing unacceptable concepts (e.g., copyright infringement, inappropriate content). Existing concept removal techniques are either ineffective, negatively impact utility, or lack robustness against adversarial prompts. |
\method utilizes a CLIP-based classifier that projects the image embedding onto the vector connecting text embeddings of acceptable and unacceptable concepts. This restricts adversaries to manipulating prompts only along this vector. Further fine-tuning enhances effectiveness and utility by maximizing separation between text embeddings while preserving their pairing with image embeddings. |
\method achieves high effectiveness with low CLIP accuracy (\sim5\%) on unacceptable concepts.
It generally preserves utility with high normalized CLIP score (\sim93\%) on acceptable concepts.
It demonstrates robustness against various attacks with low CLIP accuracy (\sim4\%) on adversarial prompts. |
The certified robustness bound, while providing some guarantees, is loose and can be improved.
Exploring the design of new attacks specifically targeting \method and utilizing adversarial training to further enhance its robustness is crucial. |
text-to-image, concept removal, robustness, clip, adversarial prompts |
2404.19204
Report |
NeRF-Insert: 3D Local Editing with Multimodal Control Signals |
Benet Oriol Sabat, Alessandro Achille, Matthew Trager, Stefano Soatto |
We propose NeRF-Insert, a NeRF editing framework that allows users to make
high-quality local edits with a flexible level of control. Unlike previous work
that relied on image-to-image models, we cast scene editing as an in-painting
problem, which encourages the global structure of the scene to be preserved.
Moreover, while most existing methods use only textual prompts to condition
edits, our framework accepts a combination of inputs of different modalities as
reference. More precisely, a user may provide a combination of textual and
visual inputs including images, CAD models, and binary image masks for
specifying a 3D region. We use generic image generation models to in-paint the
scene from multiple viewpoints, and lift the local edits to a 3D-consistent
NeRF edit. Compared to previous methods, our results show better visual quality
and also maintain stronger consistency with the original NeRF. |
Presents NeRF-Insert, a framework for making local edits to NeRFs with flexible control using textual prompts, reference images, and 3D region specification (via masks or CAD models). |
Addresses limitations of existing NeRF editing methods that struggle with local edits, often impacting the global scene structure and offering limited control over the editing process. |
Utilizes a visual hull for 3D region definition, employs text-guided or image-guided inpainting (Stable Diffusion, PaintByExample), and introduces a novel loss term to constrain edits within the specified region. |
Enables high-quality local edits with various control levels, including object insertion and scene modification.
Demonstrates superior performance compared to previous methods (e.g., Instruct-NeRF2NeRF) in terms of edit quality and local consistency.
Shows that image-guided inpainting often surpasses text-guided inpainting for complex prompts. |
Suffers from artifacts similar to early SDS-based text-to-3D models (e.g., noise, inconsistency).
Manual mask drawing can be challenging without a dedicated interface, and mesh/CAD models may not always be available. |
3d editing, nerf, inpainting, diffusion models, visual hull |
2404.19149
Report |
SAGS: Structure-Aware 3D Gaussian Splatting |
Evangelos Ververas, Rolandos Alexandros Potamias, Jifei Song, Jiankang Deng, Stefanos Zafeiriou |
Following the advent of NeRFs, 3D Gaussian Splatting (3D-GS) has paved the
way to real-time neural rendering overcoming the computational burden of
volumetric methods. Following the pioneering work of 3D-GS, several methods
have attempted to achieve compressible and high-fidelity performance
alternatives. However, by employing a geometry-agnostic optimization scheme,
these methods neglect the inherent 3D structure of the scene, thereby
restricting the expressivity and the quality of the representation, resulting
in various floating points and artifacts. In this work, we propose a
structure-aware Gaussian Splatting method (SAGS) that implicitly encodes the
geometry of the scene, which reflects to state-of-the-art rendering performance
and reduced storage requirements on benchmark novel-view synthesis datasets.
SAGS is founded on a local-global graph representation that facilitates the
learning of complex scenes and enforces meaningful point displacements that
preserve the scene's geometry. Additionally, we introduce a lightweight version
of SAGS, using a simple yet effective mid-point interpolation scheme, which
showcases a compact representation of the scene with up to 24$\times$ size
reduction without the reliance on any compression strategies. Extensive
experiments across multiple benchmark datasets demonstrate the superiority of
SAGS compared to state-of-the-art 3D-GS methods under both rendering quality
and model size. Besides, we demonstrate that our structure-aware method can
effectively mitigate floating artifacts and irregular distortions of previous
methods while obtaining precise depth maps. Project page
https://eververas.github.io/SAGS/. |
This paper introduces SAGS, a structure-aware 3D Gaussian Splatting method for novel view synthesis, that leverages local and global structural information of the scene to improve rendering quality and reduce storage requirements. |
Current 3D Gaussian Splatting methods optimize Gaussian attributes independently, neglecting inherent 3D structure, leading to reduced quality and increased storage requirements. SAGS addresses this by incorporating structural inductive biases. |
SAGS utilizes a curvature-aware densification step to augment the point cloud, followed by a structure-aware encoder based on graph neural networks to learn local-global features for each point. These features are then decoded into Gaussian attributes, including point displacements, ensuring structure preservation during optimization. |
SAGS outperforms state-of-the-art 3D-GS methods in terms of rendering quality on benchmark datasets.
SAGS effectively mitigates floating artifacts and preserves scene geometry, resulting in more accurate depth maps.
SAGS significantly reduces storage requirements (up to 24x with SAGS-Lite) without sacrificing rendering speed. |
SAGS-Lite, while compact, may lack some sharp details compared to the full SAGS model.
Further exploration of alternative graph neural network architectures or point cloud processing techniques could further enhance performance. |
novel view synthesis, 3d gaussian splatting, graph neural networks, structure-aware, point cloud processing |
2404.19110
Report |
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars |
Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos Vougioukas, Zoe Landgraf, Stavros Petridis, Maja Pantic |
Head avatars animated by visual signals have gained popularity, particularly
in cross-driving synthesis where the driver differs from the animated
character, a challenging but highly practical approach. The recently presented
MegaPortraits model has demonstrated state-of-the-art results in this domain.
We conduct a deep examination and evaluation of this model, with a particular
focus on its latent space for facial expression descriptors, and uncover
several limitations with its ability to express intense face motions. To
address these limitations, we propose substantial changes in both training
pipeline and model architecture, to introduce our EMOPortraits model, where we:
Enhance the model's capability to faithfully support intense, asymmetric face
expressions, setting a new state-of-the-art result in the emotion transfer
task, surpassing previous methods in both metrics and quality.
Incorporate speech-driven mode to our model, achieving top-tier performance
in audio-driven facial animation, making it possible to drive source identity
through diverse modalities, including visual signal, audio, or a blend of both.
We propose a novel multi-view video dataset featuring a wide range of intense
and asymmetric facial expressions, filling the gap with absence of such data in
existing datasets. |
This paper introduces EMOPortraits, an enhanced one-shot head avatar model capable of transferring intense facial expressions, and incorporating speech-driven animation. |
Accurately transferring intense and asymmetric facial expressions, especially in cross-driving synthesis, remains challenging. Additionally, few methods excel in high-quality talking heads with natural head movements and multimodal input options. |
This work builds on the MegaPortraits model, enhancing its expression transfer through analysis and improvement of latent expression space. This includes reducing its dimensionality, introducing novel self-supervised losses (canonical volume loss and source-driver mismatch loss), and using a new multi-view video dataset (FEED) featuring intense and asymmetric expressions. For speech-driven animation, the authors disentangle expression and head pose in the latent space and introduce a novel PCA mouth loss to enhance lip synchronization. |
EMOPortraits achieves state-of-the-art results in cross-driving emotion translation, outperforming existing models in user preference and FID scores.
The proposed speech-driven mode demonstrates top-tier performance in audio-driven animation, comparable to leading methods in realism and facial dynamics.
The authors introduce FEED, a novel multi-view video dataset capturing a wide range of intense and asymmetric facial expressions, addressing the limitations of existing datasets. |
The model currently does not generate the avatar's body or shoulders.
There are occasional struggles with accurate expression translation, especially with extensive head rotations. |
one-shot head avatars, emotion transfer, speech-driven animation, facial expression dataset, cross-driving synthesis |
2404.18929
Report |
DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing |
Minghao Chen, Iro Laina, Andrea Vedaldi |
We consider the problem of editing 3D objects and scenes based on open-ended
language instructions. The established paradigm to solve this problem is to use
a 2D image generator or editor to guide the 3D editing process. However, this
is often slow as it requires do update a computationally expensive 3D
representations such as a neural radiance field, and to do so by using
contradictory guidance from a 2D model which is inherently not multi-view
consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that
addresses these issues in two ways. First, we modify a given high-quality image
editor like InstructPix2Pix to be multi-view consistent. We do so by utilizing
a training-free approach which integrates cues from the underlying 3D geometry
of the scene. Second, given a multi-view consistent edited sequence of images
of the object, we directly and efficiently optimize the 3D object
representation, which is based on 3D Gaussian Splatting. Because it does not
require to apply edits incrementally and iteratively, DGE is significantly more
efficient than existing approaches, and comes with other perks such as allowing
selective editing of parts of the scene. |
Introduces Direct Gaussian Editor (DGE), a method for fast and efficient text-guided 3D object and scene editing using multi-view consistent image editing and direct optimization of 3D Gaussian Splatting representations. |
Existing methods relying on 2D image generators or editors for 3D editing are slow due to iterative updates and struggle with multi-view consistency. |
1. Modifies a 2D image editor (InstructPix2Pix) to be multi-view consistent using spatio-temporal attention and epipolar constraints. 2. Directly optimizes a 3D Gaussian Splatting representation based on the multi-view consistent edited images. |
Significantly faster than previous iterative methods (approximately 4 minutes for a single edit).
Achieves higher fidelity edits due to multi-view consistent editing in the image space.
Allows for selective editing of specific regions within the 3D scene. |
Limited ability to handle substantial geometric transformations due to reliance on the underlying image editor's capabilities.
Performance can be affected by the quality and consistency of the initial 3D Gaussian Splatting reconstruction. |
3d object editing, text-guided editing, gaussian splatting, multi-view consistency, diffusion models |
2404.18928
Report |
Stylus: Automatic Adapter Selection for Diffusion Models |
Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica |
Beyond scaling base models with more data or parameters, fine-tuned adapters
provide an alternative way to generate high fidelity, custom images at reduced
costs. As such, adapters have been widely adopted by open-source communities,
accumulating a database of over 100K adapters-most of which are highly
customized with insufficient descriptions. This paper explores the problem of
matching the prompt to a set of relevant adapters, built on recent work that
highlight the performance gains of composing adapters. We introduce Stylus,
which efficiently selects and automatically composes task-specific adapters
based on a prompt's keywords. Stylus outlines a three-stage approach that first
summarizes adapters with improved descriptions and embeddings, retrieves
relevant adapters, and then further assembles adapters based on prompts'
keywords by checking how well they fit the prompt. To evaluate Stylus, we
developed StylusDocs, a curated dataset featuring 75K adapters with
pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion
checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as
preferred, with humans and multimodal models as evaluators, over the base
model. See stylus-diffusion.github.io for more. |
This paper introduces \text{\name}, a novel algorithm that automatically selects and composes adapters for diffusion models to enhance image generation quality, guided by user prompts. |
Fine-tuned adapters offer a cost-effective way to customize image generation, but manually selecting from the growing number of adapters is challenging. \text{\name} automates this process, enabling users to easily leverage the power of adapter composition for high-fidelity and diverse images. |
\text{\name} employs a three-stage approach: 1) \textit{Refiner}: Generates textual descriptions and embeddings for adapters using a vision-language model and text encoder. 2) \textit{Retriever}: Fetches relevant adapters by comparing embeddings with the user prompt. 3) \textit{Composer}: Segments the prompt into tasks and assigns adapters to each, leveraging a long-context LLM and a binary masking scheme for diversity. |
\text{\name} improves image quality and textual alignment, achieving higher CLIP and FID scores compared to base Stable Diffusion and other retrieval methods.
Human evaluations demonstrate a strong preference (2:1) for images generated with \text{\name} over those from baseline checkpoints.
By using a combination of masking and LLM temperature, \text{\name} generates highly diverse sets of images from a single prompt. |
The \textit{composer} component, while efficient, can sometimes misinterpret prompts or select low-quality adapters, leading to errors in image generation.
While \text{\name} improves diversity across prompts, it doesn't completely solve the issue of reduced diversity within a specific task when using an adapter. |
image generation, diffusion models, adapter selection, retrieval-augmented generation, vision-language models |
2404.18861
Report |
A Survey on Vision Mamba: Models, Applications and Challenges |
Rui Xu, Shu Yang, Yihui Wang, Bo Du, Hao Chen |
Mamba, a recent selective structured state space model, performs excellently
on long sequence modeling tasks. Mamba mitigates the modeling constraints of
convolutional neural networks and offers advanced modeling capabilities similar
to those of Transformers, through global receptive fields and dynamic
weighting. Crucially, it achieves this without incurring the quadratic
computational complexity typically associated with Transformers. Due to its
advantages over the former two mainstream foundation models, Mamba exhibits
great potential to be a visual foundation model. Researchers are actively
applying Mamba to various computer vision tasks, leading to numerous emerging
works. To help keep pace with the rapid advancements in computer vision, this
paper aims to provide a comprehensive review of visual Mamba approaches. This
paper begins by delineating the formulation of the original Mamba model.
Subsequently, our review of visual Mamba delves into several representative
backbone networks to elucidate the core insights of the visual Mamba. We then
categorize related works using different modalities, including image, video,
point cloud, multi-modal, and others. Specifically, for image applications, we
further organize them into distinct tasks to facilitate a more structured
discussion. Finally, we discuss the challenges and future research directions
for visual Mamba, providing insights for future research in this quickly
evolving area. A comprehensive list of visual Mamba models reviewed in this
work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models. |
This paper presents a comprehensive survey of Vision Mamba models, examining their applications and challenges in computer vision. |
Mamba, a novel selective structured state space model, shows significant promise as a foundation model for computer vision tasks due to its linear scalability and strong modeling capabilities, rivaling Transformers. |
The paper provides an in-depth explanation of the Mamba model and reviews various visual Mamba adaptations, categorizing them by their backbone architecture, scanning techniques, and applications across different visual data modalities, including images, videos, multi-modal data, and point clouds. |
Visual Mamba models achieve competitive results on benchmarks for image classification, object detection, instance segmentation, and semantic segmentation.
They have been successfully applied to various image-level tasks, including generation, restoration, and medical image analysis.
Mamba's efficiency and capacity for long-range modeling prove beneficial in video and multi-modal tasks, such as action recognition, video object segmentation, and visual question answering. |
Challenges remain in addressing the inherent causality assumptions of Mamba for non-causal visual data and in scaling its performance to large datasets and networks.
Future research directions include developing more efficient scanning techniques, fusion strategies with other model architectures like CNNs, and improving computational efficiency for real-world applications. |
mamba, state space model, computer vision, vision transformer, sequence modeling |
2404.18669
Report |
Bootstrap 3D Reconstructed Scenes from 3D Gaussian Splatting |
Yifei Gao, Jie Ou, Lei Wang, Jun Cheng |
Recent developments in neural rendering techniques have greatly enhanced the
rendering of photo-realistic 3D scenes across both academic and commercial
fields. The latest method, known as 3D Gaussian Splatting (3D-GS), has set new
benchmarks for rendering quality and speed. Nevertheless, the limitations of
3D-GS become pronounced in synthesizing new viewpoints, especially for views
that greatly deviate from those seen during training. Additionally, issues such
as dilation and aliasing arise when zooming in or out. These challenges can all
be traced back to a single underlying issue: insufficient sampling. In our
paper, we present a bootstrapping method that significantly addresses this
problem. This approach employs a diffusion model to enhance the rendering of
novel views using trained 3D-GS, thereby streamlining the training process. Our
results indicate that bootstrapping effectively reduces artifacts, as well as
clear enhancements on the evaluation metrics. Furthermore, we show that our
method is versatile and can be easily integrated, allowing various 3D
reconstruction projects to benefit from our approach. |
This paper proposes a bootstrapping method using diffusion models to enhance the rendering of novel views in 3D Gaussian Splatting (3D-GS), improving the handling of unseen views and reducing artifacts. |
3D-GS, while fast and high-quality, struggles with novel view synthesis, particularly those significantly different from training views. This limitation stems from insufficient sampling, leading to artifacts like distortion and aliasing. |
The method uses a trained 3D-GS model to render novel views, then refines these renderings using a diffusion model. These refined images are incorporated back into the training process, guiding the 3D-GS model to learn better representations. |
Bootstrapping effectively reduces artifacts in novel views, especially in challenging scenes with texture-less surfaces or limited observations.
Significant improvement in quantitative metrics (PSNR, SSIM, LPIPS) compared to the original 3D-GS and other state-of-the-art methods.
Demonstrated versatility and plug-and-play capability, showing promising results on multi-scale datasets. |
Increased time consumption due to the diffusion process.
Challenges in rendering specific views and generating high-frequency details consistently. |
3d gaussian splatting, diffusion models, novel view synthesis, artifact reduction, neural rendering |
2404.18630
Report |
4D-DRESS: A 4D Dataset of Real-world Human Clothing with Semantic Annotations |
Wenbo Wang, Hsuan-I Ho, Chen Guo, Boxiang Rong, Artur Grigorev, Jie Song, Juan Jose Zarate, Otmar Hilliges |
The studies of human clothing for digital avatars have predominantly relied
on synthetic datasets. While easy to collect, synthetic data often fall short
in realism and fail to capture authentic clothing dynamics. Addressing this
gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human
clothing research with its high-quality 4D textured scans and garment meshes.
4D-DRESS captures 64 outfits in 520 human motion sequences, amounting to 78k
textured scans. Creating a real-world clothing dataset is challenging,
particularly in annotating and segmenting the extensive and complex 4D human
scans. To address this, we develop a semi-automatic 4D human parsing pipeline.
We efficiently combine a human-in-the-loop process with automation to
accurately label 4D scans in diverse garments and body movements. Leveraging
precise annotations and high-quality garment meshes, we establish several
benchmarks for clothing simulation and reconstruction. 4D-DRESS offers
realistic and challenging data that complements synthetic sources, paving the
way for advancements in research of lifelike human clothing. Website:
https://ait.ethz.ch/4d-dress. |
The paper introduces 4D-DRESS, the first real-world 4D dataset of human clothing, containing high-quality 4D textured scans, vertex-level semantic labels, garment meshes, and registered SMPL/SMPL-X body models. |
Existing human clothing research relies heavily on synthetic datasets, which lack realism and fail to capture authentic clothing dynamics. Real-world datasets are needed to bridge this gap. |
The authors captured 520 motion sequences featuring 64 distinct outfits using a multi-view volumetric capture system. They developed a semi-automatic 4D human parsing pipeline to efficiently annotate the 78k frames with semantic labels. |
The semi-automatic pipeline achieved accurate vertex label assignment without manual intervention in 96.8% of frames.
Evaluation benchmarks for clothing simulation showed that 4D-DRESS poses a realistic challenge for existing algorithms.
Benchmarks for clothed human reconstruction highlighted the difficulty of current methods in accurately reconstructing real-world clothing, especially loose garments. |
The pipeline's computational cost and the manual effort required for rectification limit the dataset's scalability.
Future work includes expanding the dataset with more diverse subjects and clothing, and developing real-time 4D annotation and rectification tools. |
4d human clothing, dataset, semantic segmentation, clothing simulation, human reconstruction |
2404.18620
Report |
FlexiFilm: Long Video Generation with Flexible Conditions |
Yichen Ouyang, jianhao Yuan, Hao Zhao, Gaoang Wang, Bo zhao |
Generating long and consistent videos has emerged as a significant yet
challenging problem. While most existing diffusion-based video generation
models, derived from image generation models, demonstrate promising performance
in generating short videos, their simple conditioning mechanism and sampling
strategy-originally designed for image generation-cause severe performance
degradation when adapted to long video generation. This results in prominent
temporal inconsistency and overexposure. Thus, in this work, we introduce
FlexiFilm, a new diffusion model tailored for long video generation. Our
framework incorporates a temporal conditioner to establish a more consistent
relationship between generation and multi-modal conditions, and a resampling
strategy to tackle overexposure. Empirical results demonstrate FlexiFilm
generates long and consistent videos, each over 30 seconds in length,
outperforming competitors in qualitative and quantitative analyses. Project
page: https://y-ichen.github.io/FlexiFilm-Page/ |
FlexiFilm, a novel latent video diffusion model specifically designed for generating long and consistent videos, addressing the limitations of existing methods in handling long-duration sequences. |
Existing diffusion-based video generation models struggle with long videos, exhibiting temporal inconsistency and overexposure due to insufficient conditioning mechanisms and sampling strategies. |
FlexiFilm introduces a temporal conditioner to enhance consistency by establishing relationships between generated frames and multi-modal conditions. It also employs a resampling strategy to mitigate overexposure and a co-training method for improved temporal coherence. |
FlexiFilm generates high-quality videos of over 30 seconds, outperforming baselines in terms of length and consistency.
Quantitative evaluations demonstrate FlexiFilm's superiority in visual quality (FVD) and inter-frame consistency.
Ablation studies highlight the importance of the temporal conditioner, co-training, and resampling strategy in achieving long and consistent video generation. |
The reliance on a large-scale driving dataset may limit generalization to other domains, necessitating further exploration with diverse datasets.
The computational cost associated with long video generation remains a consideration for practical applications. |
long video generation, conditional video generation, diffusion model, temporal consistency, resampling strategy |
2404.18598
Report |
Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting |
Tianyidan Xie, Rui Ma, Qian Wang, Xiaoqian Ye, Feixuan Liu, Ying Tai, Zhenyu Zhang, Zili Yi |
Recent advancements in image inpainting, particularly through diffusion
modeling, have yielded promising outcomes. However, when tested in scenarios
involving the completion of images based on the foreground objects, current
methods that aim to inpaint an image in an end-to-end manner encounter
challenges such as "over-imagination", inconsistency between foreground and
background, and limited diversity. In response, we introduce Anywhere, a
pioneering multi-agent framework designed to address these issues. Anywhere
utilizes a sophisticated pipeline framework comprising various agents such as
Visual Language Model (VLM), Large Language Model (LLM), and image generation
models. This framework consists of three principal components: the prompt
generation module, the image generation module, and the outcome analyzer. The
prompt generation module conducts a semantic analysis of the input foreground
image, leveraging VLM to predict relevant language descriptions and LLM to
recommend optimal language prompts. In the image generation module, we employ a
text-guided canny-to-image generation model to create a template image based on
the edge map of the foreground image and language prompts, and an image refiner
to produce the outcome by blending the input foreground and the template image.
The outcome analyzer employs VLM to evaluate image content rationality,
aesthetic score, and foreground-background relevance, triggering prompt and
image regeneration as needed. Extensive experiments demonstrate that our
Anywhere framework excels in foreground-conditioned image inpainting,
mitigating "over-imagination", resolving foreground-background discrepancies,
and enhancing diversity. It successfully elevates foreground-conditioned image
inpainting to produce more reliable and diverse results. |
Introduces "Anywhere", a multi-agent framework for foreground-conditioned image inpainting that addresses issues like "over-imagination", inconsistencies, and limited diversity in existing end-to-end models. |
Current image inpainting methods struggle to generate reliable and diverse results for foreground-conditioned image completion, often producing illogical or repetitive outputs. |
Utilizes a pipeline with VLM, LLM, and image generation agents. A prompt generation module analyzes the foreground to create descriptive prompts. An image generation module generates a background template, refines it, and blends it with the foreground. An outcome analyzer evaluates the result and triggers prompt regeneration for improved quality. |
Significantly reduces "over-imagination" by intelligently inpainting irrelevant content around the foreground.
Generates more diverse and contextually relevant backgrounds compared to existing methods.
Achieves a lower bad case rate and higher aesthetic scores than both open-source and commercial inpainting tools. |
Faces challenges with transparent or semi-transparent foreground objects.
The outcome analyzer struggles to accurately assess image rationality in terms of lighting and shadowing. |
image inpainting, multi-agent systems, vision-language models, large language models, diffusion models |
2404.18454
Report |
3D Gaussian Splatting with Deferred Reflection |
Keyang Ye, Qiming Hou, Kun Zhou |
The advent of neural and Gaussian-based radiance field methods have achieved
great success in the field of novel view synthesis. However, specular
reflection remains non-trivial, as the high frequency radiance field is
notoriously difficult to fit stably and accurately. We present a deferred
shading method to effectively render specular reflection with Gaussian
splatting. The key challenge comes from the environment map reflection model,
which requires accurate surface normal while simultaneously bottlenecks normal
estimation with discontinuous gradients. We leverage the per-pixel reflection
gradients generated by deferred shading to bridge the optimization process of
neighboring Gaussians, allowing nearly correct normal estimations to gradually
propagate and eventually spread over all reflective objects. Our method
significantly outperforms state-of-the-art techniques and concurrent work in
synthesizing high-quality specular reflection effects, demonstrating a
consistent improvement of peak signal-to-noise ratio (PSNR) for both synthetic
and real-world scenes, while running at a frame rate almost identical to
vanilla Gaussian splatting. |
This paper presents a deferred shading method to render high-quality specular reflection with Gaussian splatting for novel view synthesis, addressing the challenge of modeling high-frequency specular reflection. |
Specular reflection is a challenging aspect of novel view synthesis, and existing Gaussian splatting methods struggle to model it accurately, often resulting in poor visual quality and compromised geometry. |
The method employs a two-pass rendering pipeline: a Gaussian splatting pass to generate screen-space maps of base color, normal, and reflection strength, followed by a deferred reflection pass using an environment map for specular reflection. A novel training algorithm featuring normal propagation is introduced to address the challenge of accurate normal estimation. |
Significantly outperforms state-of-the-art methods and concurrent work in synthesizing high-quality specular reflection effects, demonstrating a consistent PSNR improvement for both synthetic and real-world scenes.
Achieves real-time frame rates almost identical to vanilla Gaussian splatting, thanks to its efficient deferred shading pipeline and reduced reliance on splitting Gaussians.
Produces accurate normal and environment map estimations due to the pixel-level reflection computation and effective normal propagation during training. |
Limited to handling one layer of reflective materials per pixel due to the inherent limitation of traditional deferred shading.
Normal propagation is less efficient on concave scenes, leading to slower convergence during training. |
novel view synthesis, deferred shading, gaussian splatting, specular reflection, real-time rendering |
2404.18409
Report |
PKU-AIGIQA-4K: A Perceptual Quality Assessment Database for Both Text-to-Image and Image-to-Image AI-Generated Images |
Jiquan Yuan, Fanyi Yang, Jihe Li, Xinyan Cao, Jinming Che, Jinlong Lin, Xixin Cao |
In recent years, image generation technology has rapidly advanced, resulting
in the creation of a vast array of AI-generated images (AIGIs). However, the
quality of these AIGIs is highly inconsistent, with low-quality AIGIs severely
impairing the visual experience of users. Due to the widespread application of
AIGIs, the AI-generated image quality assessment (AIGIQA), aimed at evaluating
the quality of AIGIs from the perspective of human perception, has garnered
increasing interest among scholars. Nonetheless, current research has not yet
fully explored this field. We have observed that existing databases are limited
to images generated from single scenario settings. Databases such as AGIQA-1K,
AGIQA-3K, and AIGCIQA2023, for example, only include images generated by
text-to-image generative models. This oversight highlights a critical gap in
the current research landscape, underscoring the need for dedicated databases
catering to image-to-image scenarios, as well as more comprehensive databases
that encompass a broader range of AI-generated image scenarios. Addressing
these issues, we have established a large scale perceptual quality assessment
database for both text-to-image and image-to-image AIGIs, named PKU-AIGIQA-4K.
We then conduct a well-organized subjective experiment to collect quality
labels for AIGIs and perform a comprehensive analysis of the PKU-AIGIQA-4K
database. Regarding the use of image prompts during the training process, we
propose three image quality assessment (IQA) methods based on pre-trained
models that include a no-reference method NR-AIGCIQA, a full-reference method
FR-AIGCIQA, and a partial-reference method PR-AIGCIQA. Finally, leveraging the
PKU-AIGIQA-4K database, we conduct extensive benchmark experiments and compare
the performance of the proposed methods and the current IQA methods. |
This paper introduces PKU-AIGIQA-4K, the first perceptual quality assessment database to include both text-to-image and image-to-image AI-generated images, addressing the lack of comprehensive datasets for evaluating AI-generated images. |
Evaluating the quality of AI-generated images is crucial as their applications expand, however, existing databases only focus on single scenario settings limiting the scope of assessment. |
The authors collect images generated by three popular models (Midjourney, Stable Diffusion, DALLE3) using both text and image prompts. They conduct subjective experiments to obtain quality labels and propose three IQA methods: NR-AIGCIQA (no-reference), FR-AIGCIQA (full-reference), and PR-AIGCIQA (partial-reference). |
The PKU-AIGIQA-4K database demonstrates diverse perceptual scores across different generation methods and image types.
The proposed PR-AIGCIQA method, leveraging image prompts, often outperforms the NR-AIGCIQA method, indicating the importance of using reference images.
The study reveals that current IQA methods, including the proposed ones, still need improvement to better align with human perception of AIGIs. |
The performance of different IQA methods varies significantly depending on the visual backbone network used.
The current IQA methods and proposed methods require further refinement to improve their alignment with human preferences for AIGIs. |
ai-generated images, image quality assessment, perceptual quality, text-to-image generation, image-to-image generation |
2404.18343
Report |
G-Refine: A General Quality Refiner for Text-to-Image Generation |
Chunyi Li, Haoning Wu, Hongkun Hao, Zicheng Zhang, Tengchaun Kou, Chaofeng Chen, Lei Bai, Xiaohong Liu, Weisi Lin, Guangtao Zhai |
With the evolution of Text-to-Image (T2I) models, the quality defects of
AI-Generated Images (AIGIs) pose a significant barrier to their widespread
adoption. In terms of both perception and alignment, existing models cannot
always guarantee high-quality results. To mitigate this limitation, we
introduce G-Refine, a general image quality refiner designed to enhance
low-quality images without compromising the integrity of high-quality ones. The
model is composed of three interconnected modules: a perception quality
indicator, an alignment quality indicator, and a general quality enhancement
module. Based on the mechanisms of the Human Visual System (HVS) and syntax
trees, the first two indicators can respectively identify the perception and
alignment deficiencies, and the last module can apply targeted quality
enhancement accordingly. Extensive experimentation reveals that when compared
to alternative optimization methods, AIGIs after G-Refine outperform in 10+
quality metrics across 4 databases. This improvement significantly contributes
to the practical application of contemporary T2I models, paving the way for
their broader adoption. The code will be released on
https://github.com/Q-Future/Q-Refine. |
The paper proposes G-Refine, a general image quality refiner for text-to-image generation designed to enhance low-quality images without degrading high-quality ones. |
Existing text-to-image models often produce inconsistent results with varying quality, hindering their widespread adoption. Current optimization methods either lack text guidance or struggle to balance refinement between low and high-quality regions. |
G-Refine uses three modules: a perception quality indicator (PQ-Map), an alignment quality indicator (AQ-Map), and a general quality enhancement module. PQ-Map leverages modified CLIP encoders and quality-related factors to identify perceptual deficiencies. AQ-Map analyzes prompt semantics through syntax trees to locate alignment issues. The enhancement module then uses these maps to guide a multi-stage denoising process. |
G-Refine outperforms competing methods in over 90% of cases across 13 quality indicators and 4 datasets.
G-Refine exhibits minimal negative optimization compared to other methods, indicating its ability to selectively enhance low-quality regions.
Both PQ-Map and AQ-Map, when used independently, demonstrate strong performance in quality assessment tasks, even surpassing some state-of-the-art methods. |
The alignment optimization effectiveness is less prominent on models with inherently high generation quality.
Future work involves exploring further optimization for advanced text-to-image models, particularly in improving alignment quality. |
text-to-image generation, image quality assessment, text-to-image alignment, image refinement, ai-generated content |
2404.18284
Report |
S3-SLAM: Sparse Tri-plane Encoding for Neural Implicit SLAM |
Zhiyao Zhang, Yunzhou Zhang, Yanmin Wu, Bin Zhao, Xingshuo Wang, Rui Tian |
With the emergence of Neural Radiance Fields (NeRF), neural implicit
representations have gained widespread applications across various domains,
including simultaneous localization and mapping. However, current neural
implicit SLAM faces a challenging trade-off problem between performance and the
number of parameters. To address this problem, we propose sparse tri-plane
encoding, which efficiently achieves scene reconstruction at resolutions up to
512 using only 2~4% of the commonly used tri-plane parameters (reduced from
100MB to 2~4MB). On this basis, we design S3-SLAM to achieve rapid and
high-quality tracking and mapping through sparsifying plane parameters and
integrating orthogonal features of tri-plane. Furthermore, we develop
hierarchical bundle adjustment to achieve globally consistent geometric
structures and reconstruct high-resolution appearance. Experimental results
demonstrate that our approach achieves competitive tracking and scene
reconstruction with minimal parameters on three datasets. Source code will soon
be available. |
This paper proposes S3-SLAM, a neural implicit SLAM leveraging a novel sparse tri-plane encoding for rapid iteration and parameter sparsity in high-fidelity scene reconstruction. |
Existing neural implicit SLAM methods struggle to balance performance with a manageable number of parameters, especially at high resolutions. |
The sparse tri-plane encoding represents scenes compactly via 2D hash-grid plane features. S3-SLAM integrates this with multi-resolution encoding and a hierarchical bundle adjustment (HBA) for globally consistent geometry and high-resolution appearance reconstruction. |
S3-SLAM achieves competitive tracking accuracy and high-fidelity scene reconstruction with minimal parameters on Replica, ScanNet, and TUM RGB-D datasets.
The sparse tri-plane encoding reduces memory consumption to 2-4% of regular tri-plane encoding at 512 resolution.
HBA improves local appearance details while maintaining global consistency. |
The current approach lacks genuine local updates, potentially leading to forgetting issues.
Future work will address this by implementing local update mechanisms. |
neural implicit slam, sparse tri-plane encoding, neural rendering, 3d reconstruction, hierarchical bundle adjustment |
2404.18212
Report |
Paint by Inpaint: Learning to Add Image Objects by Removing Them First |
Navve Wasserman, Noam Rotstein, Roy Ganz, Ron Kimmel |
Image editing has advanced significantly with the introduction of
text-conditioned diffusion models. Despite this progress, seamlessly adding
objects to images based on textual instructions without requiring user-provided
input masks remains a challenge. We address this by leveraging the insight that
removing objects (Inpaint) is significantly simpler than its inverse process of
adding them (Paint), attributed to the utilization of segmentation mask
datasets alongside inpainting models that inpaint within these masks.
Capitalizing on this realization, by implementing an automated and extensive
pipeline, we curate a filtered large-scale image dataset containing pairs of
images and their corresponding object-removed versions. Using these pairs, we
train a diffusion model to inverse the inpainting process, effectively adding
objects into images. Unlike other editing datasets, ours features natural
target images instead of synthetic ones; moreover, it maintains consistency
between source and target by construction. Additionally, we utilize a large
Vision-Language Model to provide detailed descriptions of the removed objects
and a Large Language Model to convert these descriptions into diverse,
natural-language instructions. We show that the trained model surpasses
existing ones both qualitatively and quantitatively, and release the
large-scale dataset alongside the trained models for the community. |
Introduces Paint by Inpaint, a framework for image object addition that leverages the inverse relationship between object addition and removal. This framework is used to create PIPE, a large-scale object addition dataset, and train a diffusion model that achieves state-of-the-art performance on this task. |
Existing text-guided, mask-free object addition methods struggle with consistency and rely on synthetic datasets. This paper addresses these limitations by creating a large-scale dataset with real image targets and inherent consistency. |
PIPE is constructed by leveraging segmentation datasets and an inpainting model to remove objects from images, resulting in source-target pairs. Natural language instructions are generated using class names, VLM-LLM pipelines, and object reference datasets. A diffusion model is then trained on PIPE to perform object addition. |
The trained model outperforms existing methods on object addition benchmarks, demonstrating superior fidelity to instructions and consistency.
Human evaluation confirms the model's ability to produce higher-quality edits aligned with user instructions.
Combining PIPE with general editing datasets enhances performance on broader editing tasks, indicating its potential beyond object addition. |
The data curation pipeline, while robust, is not entirely error-free, potentially impacting dataset quality.
The effectiveness of instruction generation relies on the capabilities of VLMs and LLMs, which can still exhibit limitations in producing human-like instructions. |
image editing, object addition, diffusion models, vision-language models, dataset creation |
2404.18136
Report |
SafePaint: Anti-forensic Image Inpainting with Domain Adaptation |
Dunyun Chen, Xin Liao, Xiaoshuai Wu, Shiwei Chen |
Existing image inpainting methods have achieved remarkable accomplishments in
generating visually appealing results, often accompanied by a trend toward
creating more intricate structural textures. However, while these models excel
at creating more realistic image content, they often leave noticeable traces of
tampering, posing a significant threat to security. In this work, we take the
anti-forensic capabilities into consideration, firstly proposing an end-to-end
training framework for anti-forensic image inpainting named SafePaint.
Specifically, we innovatively formulated image inpainting as two major tasks:
semantically plausible content completion and region-wise optimization. The
former is similar to current inpainting methods that aim to restore the missing
regions of corrupted images. The latter, through domain adaptation, endeavors
to reconcile the discrepancies between the inpainted region and the unaltered
area to achieve anti-forensic goals. Through comprehensive theoretical
analysis, we validate the effectiveness of domain adaptation for anti-forensic
performance. Furthermore, we meticulously crafted a region-wise separated
attention (RWSA) module, which not only aligns with our objective of
anti-forensics but also enhances the performance of the model. Extensive
qualitative and quantitative evaluations show our approach achieves comparable
results to existing image inpainting methods while offering anti-forensic
capabilities not available in other methods. |
This paper proposes SafePaint, an end-to-end training framework for anti-forensic image inpainting, enhancing image security and reliability by incorporating anti-forensic capabilities as an evaluation metric for inpainting quality. |
Existing image inpainting methods excel in visual realism but often leave detectable tampering traces, posing security risks. This work addresses the need for inpainted images that can resist forensic analysis, aligning with human perception and improving inpainting quality assessment. |
SafePaint decouples image inpainting into two stages: content completion and region-wise optimization. It leverages domain adaptation to minimize discrepancies between inpainted and unaltered regions, thereby enhancing anti-forensic performance. A novel region-wise separated attention (RWSA) module further improves anti-forensic capabilities and overall model performance. |
SafePaint significantly outperforms state-of-the-art methods in anti-forensic capabilities based on evaluations using multiple forgery and inpainting detectors.
The proposed method achieves comparable visual quality results to existing inpainting techniques, maintaining a balance between realism and security.
Ablation studies confirm the effectiveness of domain distance loss and the RWSA module in enhancing anti-forensic performance. |
The paper acknowledges a potential trade-off between visual quality and anti-forensic performance, suggesting further exploration to minimize this trade-off.
Future work could explore extending SafePaint's capabilities to address more complex image manipulation scenarios beyond inpainting. |
image inpainting, anti-forensics, domain adaptation, attention mechanism, image security |
2404.18065
Report |
Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model |
Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto |
In this paper, we propose an effective two-stage approach named
Grounded-Dreamer to generate 3D assets that can accurately follow complex,
compositional text prompts while achieving high fidelity by using a pre-trained
multi-view diffusion model. Multi-view diffusion models, such as MVDream, have
shown to generate high-fidelity 3D assets using score distillation sampling
(SDS). However, applied naively, these methods often fail to comprehend
compositional text prompts, and may often entirely omit certain subjects or
parts. To address this issue, we first advocate leveraging text-guided 4-view
images as the bottleneck in the text-to-3D pipeline. We then introduce an
attention refocusing mechanism to encourage text-aligned 4-view image
generation, without the necessity to re-train the multi-view diffusion model or
craft a high-quality compositional 3D dataset. We further propose a hybrid
optimization strategy to encourage synergy between the SDS loss and the sparse
RGB reference images. Our method consistently outperforms previous
state-of-the-art (SOTA) methods in generating compositional 3D assets,
excelling in both quality and accuracy, and enabling diverse 3D from the same
text prompt. |
Presents Grounded-Dreamer, a two-stage approach for generating high-fidelity 3D assets from complex, compositional text prompts using a pre-trained multi-view diffusion model. |
Addresses the limitations of existing text-to-3D methods that struggle to accurately render compositional prompts and ensure diversity in generated objects. |
Employs an attention refocusing mechanism for generating compositionally accurate four-view images and a hybrid optimization strategy combining sparse-view NeRF with text-guided diffusion priors for detailed 3D reconstruction. |
Achieves superior text-image alignment compared to state-of-the-art baselines.
Generates diverse 3D assets from the same text prompt by varying the input four-view images.
Demonstrates high-fidelity 3D generation while preserving accurate compositional relationships. |
Reliance on diffusion-based Text-to-Image models can lead to limitations in color accuracy and foreground segmentation.
Future work includes exploring seamless 2D-to-3D transitions and enhancing model versatility. |
text-to-3d synthesis, multi-view diffusion models, compositional generation, attention refocusing, sparse-view nerf |
2404.18020
Report |
DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images |
Maria Mihaela Trusca, Tinne Tuytelaars, Marie-Francine Moens |
Text-based semantic image editing assumes the manipulation of an image using
a natural language instruction. Although recent works are capable of generating
creative and qualitative images, the problem is still mostly approached as a
black box sensitive to generating unexpected outputs. Therefore, we propose a
novel model to enhance the text-based control of an image editor by explicitly
reasoning about which parts of the image to alter or preserve. It relies on
word alignments between a description of the original source image and the
instruction that reflects the needed updates, and the input image. The proposed
Diffusion Masking with word Alignments (DM-Align) allows the editing of an
image in a transparent and explainable way. It is evaluated on a subset of the
Bison dataset and a self-defined dataset dubbed Dream. When comparing to
state-of-the-art baselines, quantitative and qualitative results show that
DM-Align has superior performance in image editing conditioned on language
instructions, well preserves the background of the image and can better cope
with long text instructions. |
This paper introduces DM-Align, a novel model for text-based semantic image editing that uses word alignments between source and target text instructions to identify and manipulate specific image regions. |
Existing text-based image editing methods struggle to maintain background consistency and effectively handle long, complex instructions. DM-Align addresses these limitations by explicitly reasoning about which image parts to alter or preserve. |
DM-Align aligns words in the source and target instructions, segments the image based on aligned words, generates a global diffusion mask, refines it using segmented regions, and finally inpaints the masked areas using Stable Diffusion. |
DM-Align outperforms baselines in image-based metrics (FID, LPIPS, PWMSE), demonstrating superior editing quality, particularly with longer instructions.
It excels at background preservation, as evidenced by significantly lower background FID, LPIPS, and PWMSE scores compared to baselines.
Human evaluation confirms DM-Align's effectiveness, achieving higher scores for editing quality, background preservation, and overall image quality. |
DM-Align currently focuses on editing objects and their attributes, with future work exploring action editing.
The model relies on accurate object detection and segmentation, which might be limited by the capabilities of the employed models (Grounded-SAM). |
image editing, semantic editing, text-guided image manipulation, diffusion models, word alignments |
2404.17993
Report |
MinBackProp -- Backpropagating through Minimal Solvers |
Diana Sungatullina, Tomas Pajdla |
We present an approach to backpropagating through minimal problem solvers in
end-to-end neural network training. Traditional methods relying on manually
constructed formulas, finite differences, and autograd are laborious,
approximate, and unstable for complex minimal problem solvers. We show that
using the Implicit function theorem to calculate derivatives to backpropagate
through the solution of a minimal problem solver is simple, fast, and stable.
We compare our approach to (i) using the standard autograd on minimal problem
solvers and relate it to existing backpropagation formulas through SVD-based
and Eig-based solvers and (ii) implementing the backprop with an existing
PyTorch Deep Declarative Networks (DDN) framework. We demonstrate our technique
on a toy example of training outlier-rejection weights for 3D point
registration and on a real application of training an outlier-rejection and
RANSAC sampling network in image matching. Our method provides $100\%$
stability and is 10 times faster compared to autograd, which is unstable and
slow, and compared to DDN, which is stable but also slow. |
The paper proposes MinBackProp, a new approach to backpropagating through minimal problem solvers in end-to-end neural network training using the Implicit Function Theorem (IFT) and Deep Declarative Networks (DDN). |
Current methods for backpropagating through minimal problem solvers, such as manual differentiation, finite differences, and autograd, are often laborious, approximate, unstable, or inefficient for complex solvers. |
The paper leverages the IFT to directly compute derivatives of the minimal problem solver's output, leading to stable and efficient backpropagation. It also presents an alternative implementation using the DDN framework for simpler implementation and potential use in quick prototyping. |
MinBackProp with IFT demonstrates 100% stability in training an outlier rejection network for essential matrix estimation, compared to a 20-30% success rate for autograd-based methods.
MinBackProp with IFT achieves a 10 times speedup in backward pass computation compared to autograd and DDN-based approaches.
Both IFT and DDN implementations achieve comparable performance to the baseline method (∇-RANSAC) in terms of outlier rejection accuracy. |
The current implementation focuses on minimal problems with closed-form solutions, potentially limiting its applicability to a broader range of solvers.
The paper explores the use of DDN for minimal problem backpropagation, but further investigation into its limitations and potential advantages is needed. |
minimal problem solvers, backpropagation, implicit function theorem, deep declarative networks, end-to-end learning |
2404.17876
Report |
DF-SLAM: Neural Feature Rendering Based on Dictionary Factors Representation for High-Fidelity Dense Visual SLAM System |
Weifeng Wei, Jie Wang |
We introduce a high-fidelity neural implicit dense visual Simultaneous
Localization and Mapping (SLAM) system, termed DF-SLAM. In our work, we employ
dictionary factors for scene representation, encoding the geometry and
appearance information of the scene as a combination of basis and coefficient
factors. Compared to neural implicit SLAM methods that directly encode scene
information as features, our method exhibits superior scene detail
reconstruction capabilities and more efficient memory usage, while our model
size is insensitive to the size of the scene map, making our method more
suitable for large-scale scenes. Additionally, we employ feature integration
rendering to accelerate color rendering speed while ensuring color rendering
quality, further enhancing the real-time performance of our neural SLAM method.
Extensive experiments on synthetic and real-world datasets demonstrate that our
method is competitive with existing state-of-the-art neural implicit SLAM
methods in terms of real-time performance, localization accuracy, and scene
reconstruction quality. Our source code is available at
https://github.com/funcdecl/DF-SLAM. |
DF-SLAM, a high-fidelity neural implicit dense visual SLAM system, uses dictionary factors for scene representation and feature integration rendering for efficient and high-quality reconstruction. |
Existing neural implicit SLAM methods struggle to balance accuracy, memory efficiency, and real-time performance, particularly in large-scale scenes. |
The scene is represented using separate basis and coefficient factor grids for geometry and appearance. Feature integration rendering accelerates color rendering by approximating the appearance feature of the entire ray. |
DF-SLAM achieves superior scene detail reconstruction compared to baseline methods on Replica and ScanNet datasets.
It demonstrates robust tracking performance with reduced drift on Replica, ScanNet, and TUM-RGBD datasets.
The method exhibits efficient memory usage, remaining unaffected by map size, unlike memory-intensive alternatives like NICE-SLAM and ESLAM. |
Feature integration rendering may lead to artifacts in color rendering with extreme motion blur.
Future work will address this by incorporating a deblurring module. |
dense visual slam, dictionary factors, feature integration rendering, neural implicit representations, real-time performance |
2404.17774
Report |
High-quality Surface Reconstruction using Gaussian Surfels |
Pinxuan Dai, Jiamin Xu, Wenxiang Xie, Xinguo Liu, Huamin Wang, Weiwei Xu |
We propose a novel point-based representation, Gaussian surfels, to combine
the advantages of the flexible optimization procedure in 3D Gaussian points and
the surface alignment property of surfels. This is achieved by directly setting
the z-scale of 3D Gaussian points to 0, effectively flattening the original 3D
ellipsoid into a 2D ellipse. Such a design provides clear guidance to the
optimizer. By treating the local z-axis as the normal direction, it greatly
improves optimization stability and surface alignment. While the derivatives to
the local z-axis computed from the covariance matrix are zero in this setting,
we design a self-supervised normal-depth consistency loss to remedy this issue.
Monocular normal priors and foreground masks are incorporated to enhance the
quality of the reconstruction, mitigating issues related to highlights and
background. We propose a volumetric cutting method to aggregate the information
of Gaussian surfels so as to remove erroneous points in depth maps generated by
alpha blending. Finally, we apply screened Poisson reconstruction method to the
fused depth maps to extract the surface mesh. Experimental results show that
our method demonstrates superior performance in surface reconstruction compared
to state-of-the-art neural volume rendering and point-based rendering methods. |
This paper proposes Gaussian surfels, a novel point-based representation for high-quality surface reconstruction, combining the advantages of 3D Gaussian points' flexible optimization and surfels' surface alignment. |
3D Gaussian Splatting (3DGS), while efficient for 3D scene reconstruction and rendering, struggles to generate high-quality geometry due to limitations like non-zero thickness and normal direction ambiguity. Gaussian surfels address these issues, improving surface alignment and reconstruction quality. |
The method flattens 3D Gaussian points into 2D ellipses by setting the z-scale to 0, directly representing surface normals. It introduces a self-supervised normal-depth consistency loss to address gradient vanishing issues and utilizes volumetric cutting to refine depth maps before meshing. |
Significantly outperforms 3DGS and SuGaR in surface reconstruction quality on DTU and BlendedMVS datasets.
Achieves a good balance between reconstruction quality and speed, comparable to INSR and NeuS2 while reconstructing finer details compared to NeuS.
Demonstrates superior generality over 3DGS in sparse view rendering, producing higher quality results. |
Struggles with accurate reconstruction in areas with strong specular reflections despite using monocular normal priors.
Reconstructed surfaces may exhibit global shifts compared to ground truth for objects with weak textures. |
3d surface reconstruction, gaussian surfels, depth-normal consistency, point-based rendering, neural rendering |
2404.17762
Report |
Large Multi-modality Model Assisted AI-Generated Image Quality Assessment |
Puyi Wang, Wei Sun, Zicheng Zhang, Jun Jia, Yanwei Jiang, Zhichao Zhang, Xiongkuo Min, Guangtao Zhai |
Traditional deep neural network (DNN)-based image quality assessment (IQA)
models leverage convolutional neural networks (CNN) or Transformer to learn the
quality-aware feature representation, achieving commendable performance on
natural scene images. However, when applied to AI-Generated images (AGIs),
these DNN-based IQA models exhibit subpar performance. This situation is
largely due to the semantic inaccuracies inherent in certain AGIs caused by
uncontrollable nature of the generation process. Thus, the capability to
discern semantic content becomes crucial for assessing the quality of AGIs.
Traditional DNN-based IQA models, constrained by limited parameter complexity
and training data, struggle to capture complex fine-grained semantic features,
making it challenging to grasp the existence and coherence of semantic content
of the entire image. To address the shortfall in semantic content perception of
current IQA models, we introduce a large Multi-modality model Assisted
AI-Generated Image Quality Assessment (MA-AGIQA) model, which utilizes
semantically informed guidance to sense semantic information and extract
semantic vectors through carefully designed text prompts. Moreover, it employs
a mixture of experts (MoE) structure to dynamically integrate the semantic
information with the quality-aware features extracted by traditional DNN-based
IQA models. Comprehensive experiments conducted on two AI-generated content
datasets, AIGCQA-20k and AGIQA-3k show that MA-AGIQA achieves state-of-the-art
performance, and demonstrate its superior generalization capabilities on
assessing the quality of AGIs. Code is available at
https://github.com/wangpuyi/MA-AGIQA. |
This paper introduces MA-AGIQA, a novel framework for assessing the quality of AI-generated images by integrating Large Multi-modality Models (LMMs) with traditional deep neural networks (DNNs) to address the limitation of DNNs in capturing semantic content. |
Existing DNN-based image quality assessment models, trained primarily on natural scene images, often fail to accurately evaluate the quality of AI-generated images, particularly in terms of semantic coherence and meaningfulness. |
MA-AGIQA leverages MANIQA as a quality-aware feature extractor and mPLUG-Owl2, an LMM, as a fine-grained semantic feature extractor guided by meticulously crafted text prompts. An adaptive fusion module, employing a mixture of experts structure, dynamically integrates these features to generate a comprehensive quality score. |
MA-AGIQA achieves state-of-the-art performance, surpassing existing methods on two AI-generated image datasets: AIGCQA-20k and AGIQA-3k.
The integration of fine-grained semantic features extracted by the LMM significantly improves assessment accuracy, demonstrating a closer alignment with human perception.
MA-AGIQA exhibits superior cross-dataset performance, highlighting its robust generalization capabilities. |
The current implementation of MA-AGIQA primarily focuses on semantic aspects and may benefit from incorporating additional features for a more holistic assessment.
The computational cost associated with employing LMMs, even with fixed parameters, remains a consideration for future optimization. |
image quality assessment, ai-generated images, large multi-modality models, semantic content, mixture of experts |
2404.17753
Report |
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification |
Chao Yi, Lu Ren, De-Chuan Zhan, Han-Jia Ye |
CLIP showcases exceptional cross-modal matching capabilities due to its
training on image-text contrastive learning tasks. However, without specific
optimization for unimodal scenarios, its performance in single-modality feature
extraction might be suboptimal. Despite this, some studies have directly used
CLIP's image encoder for tasks like few-shot classification, introducing a
misalignment between its pre-training objectives and feature extraction
methods. This inconsistency can diminish the quality of the image's feature
representation, adversely affecting CLIP's effectiveness in target tasks. In
this paper, we view text features as precise neighbors of image features in
CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER)
based on the distance structure between images and their neighbor texts. This
feature extraction method aligns better with CLIP's pre-training objectives,
thereby fully leveraging CLIP's robust cross-modal capabilities. The key to
construct a high-quality CODER lies in how to create a vast amount of
high-quality and diverse texts to match with images. We introduce the Auto Text
Generator(ATG) to automatically generate the required texts in a data-free and
training-free manner. We apply CODER to CLIP's zero-shot and few-shot image
classification tasks. Experiment results across various datasets and models
confirm CODER's effectiveness. Code is available
at:https://github.com/YCaigogogo/CVPR24-CODER. |
This paper introduces Cross-modal Neighbor Representation (CODER), a novel image representation method for CLIP that leverages cross-modal distances between images and neighboring texts in CLIP's feature space. This approach improves CLIP's performance in single-modality image feature extraction tasks. |
Directly using CLIP's image encoder for tasks like few-shot classification can be suboptimal due to a misalignment between its pre-training objectives (cross-modal matching) and feature extraction methods (unimodal). CODER addresses this by aligning feature extraction with CLIP's pre-training, thus improving its effectiveness in downstream tasks. |
CODER represents images based on their distances to neighboring texts in CLIP's feature space. To ensure diverse and dense text sampling, the authors introduce Auto Text Generator (ATG) which leverages LLMs like ChatGPT to automatically generate various high-quality, class-specific texts. CODER is applied to zero-shot and few-shot image classification using a two-stage approach for zero-shot and a similarity-based method for few-shot. |
CODER consistently improves CLIP's zero-shot image classification accuracy across diverse datasets and model architectures.
The two-stage zero-shot classification method further enhances performance by using general CODER for preliminary classification and one-to-one specific CODER for reranking.
CODER-Adapter, applying CODER to few-shot classification, outperforms existing training-free CLIP-based methods on most datasets. |
Generating texts with ATG using LLMs can be computationally expensive, especially with many classes.
CODER's dimensionality, directly proportional to the number of classes, can be problematic for datasets with very few or many classes. |
cross-modal learning, clip, image representation, few-shot learning, zero-shot learning |
2404.17672
Report |
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models |
Ian Huang, Guandao Yang, Leonidas Guibas |
Graphics design is important for various applications, including movie
production and game design. To create a high-quality scene, designers usually
need to spend hours in software like Blender, in which they might need to
interleave and repeat operations, such as connecting material nodes, hundreds
of times. Moreover, slightly different design goals may require completely
different sequences, making automation difficult. In this paper, we propose a
system that leverages Vision-Language Models (VLMs), like GPT-4V, to
intelligently search the design action space to arrive at an answer that can
satisfy a user's intent. Specifically, we design a vision-based edit generator
and state evaluator to work together to find the correct sequence of actions to
achieve the goal. Inspired by the role of visual imagination in the human
design process, we supplement the visual reasoning capabilities of VLMs with
"imagined" reference images from image-generation models, providing visual
grounding of abstract language descriptions. In this paper, we provide
empirical evidence suggesting our system can produce simple but tedious Blender
editing sequences for tasks such as editing procedural materials from text
and/or reference images, as well as adjusting lighting configurations for
product renderings in complex scenes. |
Presents BlenderAlchemy, a system leveraging Vision-Language Models (VLMs) like GPT-4V to automate 3D graphics editing in Blender based on text and image inputs. |
Automating tedious 3D design tasks, like material and lighting design, in software like Blender can boost artist productivity and impact various industries. |
Uses a visual program search approach with a vision-aware edit generator and a visual state evaluator to iteratively refine Blender programs based on user intent. Employs 'visual imagination' using image-generation models to enhance VLM understanding when only text input is provided. |
Successfully edits procedural materials from text descriptions and reference images, outperforming prior work like BlenderGPT.
Demonstrates applicability to lighting design by adjusting lighting configurations based on user intent.
Shows the importance of key components: visual state evaluator, visual edit generator, edit reversion mechanism, and visual imagination module. |
Currently limited to material and lighting editing, with future work exploring animation, modeling, and other design workflows.
Relies on expensive and high-latency VLMs, requiring future optimization or advancements in VLM efficiency. |
vision-language models, 3d graphics editing, blender, procedural material editing, lighting design |
2404.17571
Report |
Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos |
Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao |
Video try-on is a challenging task and has not been well tackled in previous
works. The main obstacle lies in preserving the details of the clothing and
modeling the coherent motions simultaneously. Faced with those difficulties, we
address video try-on by proposing a diffusion-based framework named "Tunnel
Try-on." The core idea is excavating a "focus tunnel" in the input video that
gives close-up shots around the clothing regions. We zoom in on the region in
the tunnel to better preserve the fine details of the clothing. To generate
coherent motions, we first leverage the Kalman filter to construct smooth crops
in the focus tunnel and inject the position embedding of the tunnel into
attention layers to improve the continuity of the generated videos. In
addition, we develop an environment encoder to extract the context information
outside the tunnels as supplementary cues. Equipped with these techniques,
Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and
smooth videos. Demonstrating significant advancements, Tunnel Try-on could be
regarded as the first attempt toward the commercial-level application of
virtual try-on in videos. |
This paper proposes Tunnel Try-on, the first diffusion-based video virtual try-on model demonstrating state-of-the-art performance in complex, real-world scenarios. |
Video virtual try-on offers a more comprehensive and realistic clothing try-on experience than image-based try-on but faces challenges in preserving clothing details and generating coherent motions. Existing methods struggle to handle complex scenarios with diverse clothing, backgrounds, and movements. |
Tunnel Try-on introduces a "focus tunnel" to zoom in on the clothing region, enhancing detail preservation. It leverages a Kalman filter to smooth tunnel movements, injects tunnel embeddings into attention layers for motion consistency, and employs an environment encoder for capturing background context. |
Tunnel Try-on significantly outperforms existing video try-on methods on standard benchmarks and a newly collected dataset.
It effectively handles various camera movements, human motions, and clothing types, generating high-fidelity try-on results.
Ablation studies demonstrate the contribution of each proposed component to the model's performance. |
The current implementation relies on a pre-trained pose estimator, which might limit its generalization ability to unseen poses.
Further research can explore incorporating user preferences and interactive controls to enhance the personalization and controllability of virtual try-on. |
virtual try-on, video generation, diffusion models, computer vision, fashion technology |
2404.17528
Report |
Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields |
Tianqi Liu, Xinyi Ye, Min Shi, Zihao Huang, Zhiyu Pan, Zhan Peng, Zhiguo Cao |
Generalizable NeRF aims to synthesize novel views for unseen scenes. Common
practices involve constructing variance-based cost volumes for geometry
reconstruction and encoding 3D descriptors for decoding novel views. However,
existing methods show limited generalization ability in challenging conditions
due to inaccurate geometry, sub-optimal descriptors, and decoding strategies.
We address these issues point by point. First, we find the variance-based cost
volume exhibits failure patterns as the features of pixels corresponding to the
same point can be inconsistent across different views due to occlusions or
reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to
amplify the contribution of consistent pixel pairs and suppress inconsistent
ones. Unlike previous methods that solely fuse 2D features into descriptors,
our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D
context into descriptors through spatial and inter-view interaction. When
decoding the descriptors, we observe the two existing decoding strategies excel
in different areas, which are complementary. A Consistency-Aware Fusion (CAF)
strategy is proposed to leverage the advantages of both. We incorporate the
above ACA, SVA, and CAF into a coarse-to-fine framework, termed Geometry-aware
Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains
state-of-the-art performance across multiple datasets. Code is available at
https://github.com/TQTQliu/GeFu . |
This paper introduces GeFu, a novel generalizable NeRF framework for novel view synthesis in unseen scenes, improving geometry reconstruction, descriptor encoding, and rendering strategies. |
Existing generalizable NeRF methods struggle to achieve satisfactory results in challenging conditions, particularly within occluded areas, due to limitations in geometry accuracy, descriptor quality, and decoding strategies. |
GeFu incorporates Adaptive Cost Aggregation (ACA) for robust geometry estimation, Spatial-View Aggregator (SVA) for 3D context-aware descriptors, and Consistency-Aware Fusion (CAF) for integrating different rendering strategies. |
GeFu achieves state-of-the-art performance on DTU, Real Forward-facing, and NeRF Synthetic datasets without per-scene fine-tuning.
After fine-tuning, GeFu surpasses previous generalizable NeRFs and achieves comparable or superior results to NeRF.
GeFu exhibits high accuracy in depth map generation, outperforming other generalizable NeRF methods and even surpassing some MVS methods. |
GeFu is designed for static scenes and may not be directly applicable to dynamic scenes.
The fine-tuning and rendering processes remain computationally demanding for NeRF-based methods, including GeFu. |
novel view synthesis, generalizable nerf, neural radiance fields, multi-view stereo, 3d reconstruction |
2404.17486
Report |
TextGaze: Gaze-Controllable Face Generation with Natural Language |
Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang |
Generating face image with specific gaze information has attracted
considerable attention. Existing approaches typically input gaze values
directly for face generation, which is unnatural and requires annotated gaze
datasets for training, thereby limiting its application. In this paper, we
present a novel gaze-controllable face generation task. Our approach inputs
textual descriptions that describe human gaze and head behavior and generates
corresponding face images. Our work first introduces a text-of-gaze dataset
containing over 90k text descriptions spanning a dense distribution of gaze and
head poses. We further propose a gaze-controllable text-to-face method. Our
method contains a sketch-conditioned face diffusion module and a model-based
sketch diffusion module. We define a face sketch based on facial landmarks and
eye segmentation map. The face diffusion module generates face images from the
face sketch, and the sketch diffusion module employs a 3D face model to
generate face sketch from text description. Experiments on the FFHQ dataset
show the effectiveness of our method. We will release our dataset and code for
future research. |
Introduces TextGaze, a novel gaze-controllable face generation method that uses textual descriptions of human gaze and head behavior instead of numerical gaze values. |
Existing gaze-controllable face generation methods rely on numerical gaze values, which is unnatural and requires annotated datasets, limiting their application. Text descriptions are more intuitive and user-friendly. |
A two-stage method: 1) Text-to-Gaze Generation: Extracts gaze and head pose from text descriptions using CLIP embeddings and a text attention module, then generates a face sketch using a 3D face model. 2) Gaze-Controllable Face Generation: Generates face images from the face sketches using a conditional diffusion model. |
Introduces ToG, the first text-to-gaze dataset with over 90k descriptions, leveraging LLMs for accurate and diverse annotations.
Generates more accurate gaze-controllable face images than baseline methods based on user study.
Achieves comparable or better image quality (IS, FID, KID) compared to baseline text-to-image generation methods. |
Limited variability in low-precision descriptions within the ToG dataset.
Reliance on pre-trained pose estimators for evaluation and comparison with baseline models. |
text-to-image generation, diffusion model, gaze-controllable, face generation, large language models |
2404.17419
Report |
Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation |
Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang |
Using image as prompts for 3D generation demonstrate particularly strong
performances compared to using text prompts alone, for images provide a more
intuitive guidance for the 3D generation process. In this work, we delve into
the potential of using multiple image prompts, instead of a single image
prompt, for 3D generation. Specifically, we build on ImageDream, a novel
image-prompt multi-view diffusion model, to support multi-view images as the
input prompt. Our method, dubbed MultiImageDream, reveals that transitioning
from a single-image prompt to multiple-image prompts enhances the performance
of multi-view and 3D object generation according to various quantitative
evaluation metrics and qualitative assessments. This advancement is achieved
without the necessity of fine-tuning the pre-trained ImageDream multi-view
diffusion model. |
This paper introduces \methodName, a novel approach for 3D object generation that leverages multiple image prompts to enhance the quality and consistency of generated 3D models. |
Existing image-to-3D generation methods, while promising, often struggle to maintain consistency in detail, texture, and lighting across different viewpoints. Using multiple image prompts can address these limitations by providing richer guidance during the generation process. |
The authors extend ImageDream, a state-of-the-art image-to-3D method, to support multiple image inputs. They achieve this by modifying the local and pixel controllers of ImageDream to handle multiple images, enabling the model to incorporate information from various viewpoints without requiring fine-tuning. |
Quantitative evaluation metrics, including IS and CLIP scores, demonstrate that \methodName outperforms the baseline ImageDream in multi-view image generation.
\methodName also exhibits competitive performance in 3D generation, though the improvements are less pronounced.
Qualitative assessments reveal that using multiple image prompts effectively reduces artifacts like excessive whitening and lack of detail in the generated 3D models, particularly at viewpoints not covered by the primary image prompt. |
The quantitative evaluation is limited by the small number (39) of prompts used, potentially impacting the generalizability of the findings.
Future work could explore fine-tuning the model specifically for multi-image prompts and investigate methods to explicitly leverage the cross-view relationships between the input images. |
3d generation, image-to-3d, multi-view diffusion, imagedream, multi-image prompts |
2404.17364
Report |
MV-VTON: Multi-View Virtual Try-On with Diffusion Models |
Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo |
The goal of image-based virtual try-on is to generate an image of the target
person naturally wearing the given clothing. However, most existing methods
solely focus on the frontal try-on using the frontal clothing. When the views
of the clothing and person are significantly inconsistent, particularly when
the person's view is non-frontal, the results are unsatisfactory. To address
this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to
reconstruct the dressing results of a person from multiple views using the
given clothes. On the one hand, given that single-view clothes provide
insufficient information for MV-VTON, we instead employ two images, i.e., the
frontal and back views of the clothing, to encompass the complete view as much
as possible. On the other hand, the diffusion models that have demonstrated
superior abilities are adopted to perform our MV-VTON. In particular, we
propose a view-adaptive selection method where hard-selection and
soft-selection are applied to the global and local clothing feature extraction,
respectively. This ensures that the clothing features are roughly fit to the
person's view. Subsequently, we suggest a joint attention block to align and
fuse clothing features with person features. Additionally, we collect a MV-VTON
dataset, i.e., Multi-View Garment (MVG), in which each person has multiple
photos with diverse views and poses. Experiments show that the proposed method
not only achieves state-of-the-art results on MV-VTON task using our MVG
dataset, but also has superiority on frontal-view virtual try-on task using
VITON-HD and DressCode datasets. Codes and datasets will be publicly released
at https://github.com/hywang2002/MV-VTON . |
Introduces MV-VTON, a novel task aiming to generate realistic multi-view dressed person images using frontal and back clothing views, and proposes a diffusion-based method with a view-adaptive selection mechanism and joint attention block to address it. |
Addresses limitations of existing virtual try-on methods that primarily focus on frontal views and struggle with inconsistent clothing-person poses, particularly in multi-view scenarios. |
Utilizes a diffusion model with a view-adaptive selection mechanism (hard-selection for global features and soft-selection for local features) based on person-clothing pose similarity, and a joint attention block to align and fuse global and local clothing features with person features for detail preservation. |
Achieves state-of-the-art performance on both multi-view (MVG dataset) and frontal-view (VITON-HD and DressCode datasets) virtual try-on tasks, quantitatively and qualitatively.
Effectively handles inconsistencies between clothing and person poses in multi-view scenarios, resulting in more natural and realistic try-on results.
Exhibits superior performance in preserving high-frequency clothing details, such as texts, patterns, and shapes, compared to existing methods. |
Struggles to fully preserve smaller or more complex clothing details, potentially due to information loss during inpainting in latent space.
Limited to using two views (frontal and back) of clothing, which may not fully capture the complexity of certain garments. |
virtual try-on, multi-view, diffusion models, view-adaptive selection, joint attention |
2404.17255
Report |
SDFD: Building a Versatile Synthetic Face Image Dataset with Diverse Attributes |
Georgia Baltsou, Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos |
AI systems rely on extensive training on large datasets to address various
tasks. However, image-based systems, particularly those used for demographic
attribute prediction, face significant challenges. Many current face image
datasets primarily focus on demographic factors such as age, gender, and skin
tone, overlooking other crucial facial attributes like hairstyle and
accessories. This narrow focus limits the diversity of the data and
consequently the robustness of AI systems trained on them. This work aims to
address this limitation by proposing a methodology for generating synthetic
face image datasets that capture a broader spectrum of facial diversity.
Specifically, our approach integrates a systematic prompt formulation strategy,
encompassing not only demographics and biometrics but also non-permanent traits
like make-up, hairstyle, and accessories. These prompts guide a
state-of-the-art text-to-image model in generating a comprehensive dataset of
high-quality realistic images and can be used as an evaluation set in face
analysis systems. Compared to existing datasets, our proposed dataset proves
equally or more challenging in image classification tasks while being much
smaller in size. |
This paper proposes a methodology for generating synthetic face image datasets that are more diverse and inclusive than existing ones, capturing a wider range of facial attributes. |
Existing face image datasets often lack diversity in facial attributes beyond basic demographics, limiting the robustness and fairness of AI systems trained on them. |
The methodology employs a systematic prompt formulation strategy for text-to-image generation, incorporating attributes like hairstyle, accessories, and facial expressions, and utilizes a denoising diffusion probabilistic model (Stable Diffusion 2.1) for generating high-quality realistic images. |
The generated dataset (SDFD) captures a wide variety of facial attributes despite its small size (1000 images).
SDFD proves equally or more challenging in image classification tasks compared to larger datasets like FairFace and LFW.
Visualization of the datasets reveals that SDFD exhibits good spatial dispersion, suggesting a higher degree of facial attribute variety. |
Certain attributes and their combinations were challenging to apply effectively during the image generation process, highlighting limitations in the training data of the generative model.
Stereotypical representations emerged in some generated images, indicating the need for further investigation and mitigation of biases. |
synthetic data generation, face image datasets, diversity and inclusion, text-to-image synthesis, diffusion models |
2404.17254
Report |
Trinity Detector:text-assisted and attention mechanisms based spectral fusion for diffusion generation image detection |
Jiawei Song, Dengpan Ye, Yunming Zhang |
Artificial Intelligence Generated Content (AIGC) techniques, represented by
text-to-image generation, have led to a malicious use of deep forgeries,
raising concerns about the trustworthiness of multimedia content. Adapting
traditional forgery detection methods to diffusion models proves challenging.
Thus, this paper proposes a forgery detection method explicitly designed for
diffusion models called Trinity Detector. Trinity Detector incorporates
coarse-grained text features through a CLIP encoder, coherently integrating
them with fine-grained artifacts in the pixel domain for comprehensive
multimodal detection. To heighten sensitivity to diffusion-generated image
features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed,
extracting spectral inconsistencies through adaptive fusion of diverse
frequency bands and further integrating spatial co-occurrence of the two
modalities. Extensive experimentation validates that our Trinity Detector
method outperforms several state-of-the-art methods, our performance is
competitive across all datasets and up to 17.6\% improvement in transferability
in the diffusion datasets. |
This paper proposes Trinity Detector, a novel method for detecting images generated by diffusion models by leveraging multi-spectral channel attention and integrating text-based and image-based features. |
The rise of diffusion models in AI-generated content (AIGC) necessitates new forgery detection methods specifically designed for this technology due to its unique characteristics compared to traditional generation techniques. |
The Trinity Detector uses a Multi-spectral Channel Attention Fusion Unit (MCAF) to analyze spectral inconsistencies in the frequency domain and combines it with text features extracted using a CLIP encoder, providing a comprehensive multimodal detection approach. |
Trinity Detector outperforms state-of-the-art detectors, especially on diffusion-generated images.
The method shows strong generalization ability, effectively detecting forgeries from untrained diffusion models.
Trinity Detector exhibits robust performance even with image perturbations like Gaussian blur and JPEG compression. |
The paper acknowledges the need for evaluating the method on a wider range of diffusion models beyond Stable Diffusion and GLIDE.
Future work could explore alternative frequency domain analysis techniques or incorporate additional modalities for further performance improvement. |
diffusion models, forgery detection, deepfakes, multimodal learning, frequency domain analysis |
2404.17230
Report |
ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion |
Ziyue Zhang, Mingbao Lin, Rongrong Ji |
We introduce ObjectAdd, a training-free diffusion modification method to add
user-expected objects into user-specified area. The motive of ObjectAdd stems
from: first, describing everything in one prompt can be difficult, and second,
users often need to add objects into the generated image. To accommodate with
real world, our ObjectAdd maintains accurate image consistency after adding
objects with technical innovations in: (1) embedding-level concatenation to
ensure correct text embedding coalesce; (2) object-driven layout control with
latent and attention injection to ensure objects accessing user-specified area;
(3) prompted image inpainting in an attention refocusing & object expansion
fashion to ensure rest of the image stays the same. With a text-prompted image,
our ObjectAdd allows users to specify a box and an object, and achieves: (1)
adding object inside the box area; (2) exact content outside the box area; (3)
flawless fusion between the two areas |
ObjectAdd, a training-free method to add user-specified objects into pre-existing images generated by diffusion models while preserving the rest of the image content. |
Addresses limitations of text-to-image models that struggle to convey spatial relationships and necessitate tedious multi-step modifications to achieve desired results. |
Combines embedding-level concatenation for accurate text prompts, object-driven layout control with latent and attention injection for precise object placement, and prompted image inpainting with attention refocusing and object expansion for seamless integration and background consistency. |
Successfully adds objects into user-defined areas while maintaining image consistency.
Outperforms existing methods like DALL-E 3, P2P, and SD-v1-4 qualitatively and quantitatively.
Demonstrates versatility by accurately adding diverse objects and handling complex object-background interactions. |
Usability limitations for non-experts due to the method's complexity and hyperparameter tuning.
Performance dependency on the pre-trained SD-v1-4 model, potentially limiting effectiveness in complex scenarios. |
diffusion model, training-free, text to image, image editing, object insertion |
2404.16994
Report |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning |
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng |
Vision-language pre-training has significantly elevated performance across a
wide range of image-language applications. Yet, the pre-training process for
video-related tasks demands exceptionally large computational and data
resources, which hinders the progress of video-language models. This paper
investigates a straight-forward, highly efficient, and resource-light approach
to adapting an existing image-language pre-trained model for dense video
understanding. Our preliminary experiments reveal that directly fine-tuning
pre-trained image-language models with multiple frames as inputs on video
datasets leads to performance saturation or even a drop. Our further
investigation reveals that it is largely attributed to the bias of learned
high-norm visual features. Motivated by this finding, we propose a simple but
effective pooling strategy to smooth the feature distribution along the
temporal dimension and thus reduce the dominant impacts from the extreme
features. The new model is termed Pooling LLaVA, or PLLaVA in short. PLLaVA
achieves new state-of-the-art performance on modern benchmark datasets for both
video question-answer and captioning tasks. Notably, on the recent popular
VideoChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of
five evaluated dimensions, exceeding the previous SOTA results from GPT4V
(IG-VLM) by 9%. On the latest multi-choice benchmark MVBench, PLLaVA achieves
58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V
(IG-VLM). Code is available at https://pllava.github.io/ |
This paper proposes Pooling LLaVA (PLLaVA), a simple yet effective method for adapting pre-trained image-language models to video understanding by introducing a pooling strategy to smooth feature distribution along the temporal dimension and reduce the impact of extreme features. |
Adapting existing image-language models for video understanding is crucial for efficient and resource-light model development, but directly fine-tuning these models on videos can lead to performance saturation and vulnerability to prompt changes. |
PLLaVA encodes video frames using an image encoder, applies an average pooling operation to the features along the spatial dimension, and feeds the pooled features to a pre-trained LLM with LoRA for adaptation. Post-training optimization is also used to merge the weights of the original image LLM and the video-trained LLM. |
PLLaVA achieves state-of-the-art performance on various video understanding benchmarks, including VideoQA and MVBench.
PLLaVA exhibits strong performance in generating detailed video captions, outperforming previous methods in aspects like correctness of information, detail orientation, and context understanding.
The paper provides analysis on the impact of pooling strategies and the influence of LoRA weight fusion, offering insights into adapting image-language models for video tasks. |
PLLaVA's performance on tasks requiring strong reasoning ability and imagination, such as counterfactual inference, can be further improved.
Exploring the use of specialized video encoders and more advanced temporal modeling techniques could further enhance PLLaVA’s capabilities. |
video understanding, multimodal learning, large language models, pooling methods, vision-language models |
2404.16829
Report |
Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials |
Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gordon Wetzstein, Dahua Lin |
Physically realistic materials are pivotal in augmenting the realism of 3D
assets across various applications and lighting conditions. However, existing
3D assets and generative models often lack authentic material properties.
Manual assignment of materials using graphic software is a tedious and
time-consuming task. In this paper, we exploit advancements in Multimodal Large
Language Models (MLLMs), particularly GPT-4V, to present a novel approach,
Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and
describe materials, allowing the construction of a detailed material library.
2) Utilizing a combination of visual cues and hierarchical text prompts, GPT-4V
precisely identifies and aligns materials with the corresponding components of
3D objects. 3) The correctly matched materials are then meticulously applied as
reference for the new SVBRDF material generation according to the original
albedo map, significantly enhancing their visual authenticity. Make-it-Real
offers a streamlined integration into the 3D content creation workflow,
showcasing its utility as an essential tool for developers of 3D assets. |
Make-it-Real is a novel framework that leverages Multimodal Large Language Models (MLLMs), specifically GPT-4V, to automatically assign and generate physically realistic materials for 3D objects with only albedo maps. |
Many existing 3D assets and generative models lack realistic material properties. Manual material assignment is tedious and time-consuming. Make-it-Real automates this process, enhancing realism and streamlining 3D content creation. |
The pipeline involves rendering and segmenting the 3D mesh, retrieving matching materials from a meticulously annotated material library using GPT-4V and hierarchical text prompts, and generating spatially varying BRDF maps (including roughness, metallic, specular, normal, displacement, height) by referencing the original albedo map. |
Make-it-Real enhances the realism of 3D assets, generating high-fidelity, photorealistic textures with diverse reflective effects under different lighting conditions.
The framework ensures part-specific material matching, accurately identifying and applying different materials to various components of a 3D object.
It generates comprehensive material maps compatible with downstream rendering engines, streamlining the integration of refined assets into existing workflows. |
The method currently lacks support for reverse transformation from shaded texture maps to albedo maps.
The quality of the base 3D object significantly impacts the accuracy of material assignment, particularly when ground truth text descriptions are unavailable. |
3d material generation, multimodal large language models, gpt-4v, texture synthesis, physically based rendering |
2404.16821
Report |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites |
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang |
In this report, we introduce InternVL 1.5, an open-source multimodal large
language model (MLLM) to bridge the capability gap between open-source and
proprietary commercial models in multimodal understanding. We introduce three
simple improvements: (1) Strong Vision Encoder: we explored a continuous
learning strategy for the large-scale vision foundation model -- InternViT-6B,
boosting its visual understanding capabilities, and making it can be
transferred and reused in different LLMs. (2) Dynamic High-Resolution: we
divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels
according to the aspect ratio and resolution of the input images, which
supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we
carefully collected a high-quality bilingual dataset that covers common scenes,
document images, and annotated them with English and Chinese question-answer
pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
We evaluate InternVL 1.5 through a series of benchmarks and comparative
studies. Compared to both open-source and proprietary models, InternVL 1.5
shows competitive performance, achieving state-of-the-art results in 8 of 18
benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL. |
InternVL 1.5, an open-source multimodal large language model (MLLM), is introduced to bridge the capability gap between open-source and proprietary models in multimodal understanding. |
There is a noticeable divide between the capabilities of open-source and proprietary commercial MLLMs, particularly in parameter scale, image resolution handling, and multilingual capabilities. |
The paper introduces three primary improvements: (1) Continuous learning for a large-scale vision foundation model (InternViT-6B) to boost visual understanding. (2) Dynamic high-resolution strategy using image tiling (up to 4K) for detailed scene and document understanding. (3) Creation of a high-quality bilingual dataset covering diverse scenes, documents, and conversations in English and Chinese. |
InternVL 1.5 achieves state-of-the-art results in 8 out of 18 multimodal benchmarks, surpassing some leading proprietary models.
The model shows competitive performance in OCR-related tasks, exceeding proprietary models on benchmarks like ChartQA and OCRBench.
InternVL 1.5 demonstrates strong bilingual proficiency, particularly excelling in Chinese-related tasks compared to other open-source and proprietary models. |
Despite improvements, InternVL 1.5 still lags behind top proprietary models in multi-turn conversations, suggesting a direction for future research.
The model's performance on certain tasks slightly declined compared to its predecessor due to the smaller language model used. |
multimodal large language model, vision-language understanding, open-source, dynamic high-resolution, bilingual |
2404.16771
Report |
ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving |
Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang |
Diffusion-based technologies have made significant strides, particularly in
personalized and customized facialgeneration. However, existing methods face
challenges in achieving high-fidelity and detailed identity (ID)consistency,
primarily due to insufficient fine-grained control over facial areas and the
lack of a comprehensive strategy for ID preservation by fully considering
intricate facial details and the overall face. To address these limitations, we
introduce ConsistentID, an innovative method crafted for
diverseidentity-preserving portrait generation under fine-grained multimodal
facial prompts, utilizing only a single reference image. ConsistentID comprises
two key components: a multimodal facial prompt generator that combines facial
features, corresponding facial descriptions and the overall facial context to
enhance precision in facial details, and an ID-preservation network optimized
through the facial attention localization strategy, aimed at preserving ID
consistency in facial regions. Together, these components significantly enhance
the accuracy of ID preservation by introducing fine-grained multimodal ID
information from facial regions. To facilitate training of ConsistentID, we
present a fine-grained portrait dataset, FGID, with over 500,000 facial images,
offering greater diversity and comprehensiveness than existing public facial
datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results
substantiate that our ConsistentID achieves exceptional precision and diversity
in personalized facial generation, surpassing existing methods in the MyStyle
dataset. Furthermore, while ConsistentID introduces more multimodal ID
information, it maintains a fast inference speed during generation. |
This paper introduces ConsistentID, a novel method for generating high-fidelity, diverse, and identity-preserving portraits using a single reference image and multimodal fine-grained prompts. |
Existing methods for personalized portrait generation struggle to maintain accurate identity consistency and high-fidelity details, particularly in fine-grained facial features. |
ConsistentID leverages a multimodal facial prompt generator to combine facial features, descriptions, and overall context. It also utilizes an ID-preservation network with facial attention localization to ensure consistent identity across facial regions. Additionally, a new fine-grained dataset (FGID) is introduced for training and evaluation. |
ConsistentID outperforms state-of-the-art methods in identity consistency, diversity, and fidelity, as demonstrated by both quantitative metrics and qualitative comparisons.
The proposed facial attention localization strategy effectively prevents the blending of identities between facial regions, leading to improved ID preservation in generated images.
The introduction of the FGID dataset and a new fine-grained identity consistency metric provide a valuable resource for advancing research in facial generation. |
The use of MLLM in ConsistentID may introduce limitations in handling pose and expression variations.
Further research is needed to address potential ethical concerns related to privacy and misinformation. |
portrait generation, identity preservation, multimodal learning, fine-grained control, diffusion models |
2404.16752
Report |
TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation |
Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, Michael J. Black |
We address the problem of regressing 3D human pose and shape from a single
image, with a focus on 3D accuracy. The current best methods leverage large
datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust
performance. With such methods, we observe a paradoxical decline in 3D pose
accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and
the use of an approximate camera projection model. We quantify the error
induced by current camera models and show that fitting 2D keypoints and p-GT
accurately causes incorrect 3D poses. Our analysis defines the invalid
distances within which minimizing 2D and p-GT losses is detrimental. We use
this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that
penalizes gross 2D and p-GT losses but not smaller ones. With such a loss,
there are many 3D poses that could equally explain the 2D evidence. To reduce
this ambiguity we need a prior over valid human poses but such priors can
introduce unwanted bias. To address this, we exploit a tokenized representation
of human pose and reformulate the problem as token prediction. This restricts
the estimated poses to the space of valid poses, effectively providing a
uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that
our reformulated keypoint loss and tokenization allows us to train on
in-the-wild data while improving 3D accuracy over the state-of-the-art. Our
models and code are available for research at https://tokenhmr.is.tue.mpg.de. |
Introduces TokenHMR, a novel 3D human pose and shape regression method that leverages a token-based pose representation and a new loss function, TALS, to improve 3D accuracy. |
Addresses the trade-off between 2D and 3D accuracy in current HPS regression methods caused by approximate camera models, leading to biased pose estimations. |
Combines a tokenized pose representation using VQ-VAE to learn a prior over valid poses and introduces TALS, a loss function that reduces reliance on noisy 2D and pseudo-ground truth data. |
Achieves state-of-the-art 3D accuracy on EMDB and 3DPW datasets.
Demonstrates robustness to image truncation and ambiguous poses.
Shows that tokenization leads to more accurate and robust pose estimations compared to continuous regression. |
2D alignment can be inaccurate under severe perspective distortion due to the use of a weak-perspective camera model.
Global orientation estimation can be ambiguous in cases where body cues are limited. |
human pose and shape estimation, 3d human pose, tokenization, vq-vae, camera bias |
2404.16748
Report |
TELA: Text to Layer-wise 3D Clothed Human Generation |
Junting Dong, Qi Fang, Zehuan Huang, Xudong Xu, Jingbo Wang, Sida Peng, Bo Dai |
This paper addresses the task of 3D clothed human generation from textural
descriptions. Previous works usually encode the human body and clothes as a
holistic model and generate the whole model in a single-stage optimization,
which makes them struggle for clothing editing and meanwhile lose fine-grained
control over the whole generation process. To solve this, we propose a
layer-wise clothed human representation combined with a progressive
optimization strategy, which produces clothing-disentangled 3D human models
while providing control capacity for the generation process. The basic idea is
progressively generating a minimal-clothed human body and layer-wise clothes.
During clothing generation, a novel stratified compositional rendering method
is proposed to fuse multi-layer human models, and a new loss function is
utilized to help decouple the clothing model from the human body. The proposed
method achieves high-quality disentanglement, which thereby provides an
effective way for 3D garment generation. Extensive experiments demonstrate that
our approach achieves state-of-the-art 3D clothed human generation while also
supporting cloth editing applications such as virtual try-on. Project page:
http://jtdong.com/tela_layer/ |
This paper proposes a novel layer-wise representation and a progressive optimization strategy for generating 3D clothed humans from textual descriptions. |
Previous methods struggle with clothing editing and lack fine-grained control due to their holistic approach to modeling the body and clothing. |
The proposed method progressively generates a minimally-clothed body followed by layer-wise clothes using a stratified compositional rendering method for fusion and a new loss function to decouple clothing from the body. |
Achieves state-of-the-art 3D clothed human generation from text.
Provides high-quality disentanglement between clothing and body.
Enables cloth editing applications such as virtual try-on. |
Specific details about the novel loss function and its effectiveness are absent.
Quantitative evaluation of the disentanglement quality compared to previous works is missing. |
text-to-3d generation, clothed human generation, layer-wise representation, progressive optimization, virtual try-on |
2404.16687
Report |
NTIRE 2024 Quality Assessment of AI-Generated Content Challenge |
Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte, Fei Peng, Huiyuan Fu, Anlong Ming, Chuanming Wang, Huadong Ma, Shuai He, Zifei Dou, Shu Chen, Huacong Zhang, Haiyi Xie, Chengwei Wang, Baoying Chen, Jishen Zeng, Jianquan Yang, Weigang Wang, Xi Fang, Xiaoxin Lv, Jun Yan, Tianwu Zhi, Yabin Zhang, Yaohui Li, Yang Li, Jingwen Xu, Jianzhao Liu, Yiting Liao, Junlin Li, Zihao Yu, Yiting Lu, Xin Li, Hossein Motamednia, S. Farhad Hosseini-Benvidi, Fengbin Guan, Ahmad Mahmoudi-Aznaveh, Azadeh Mansouri, Ganzorig Gankhuyag, Kihwan Yoon, Yifang Xu, Haotian Fan, Fangyuan Kong, Shiling Zhao, Weifeng Dong, Haibing Yin, Li Zhu, Zhiling Wang, Bingchen Huang, Avinab Saha, Sandeep Mishra, Shashank Gupta, Rajesh Sureddi, Oindrila Saha, Luigi Celona, Simone Bianco, Paolo Napoletano, Raimondo Schettini, Junfeng Yang, Jing Fu, Wei Zhang, Wenzhi Cao, Limei Liu, Han Peng, Weijun Yuan, Zhan Li, Yihang Cheng, Yifan Deng, Haohui Li, Bowen Qu, Yao Li, Shuqing Luo, Shunzhou Wang, Wei Gao, Zihao Lu, Marcos V. Conde, Xinrui Wang, Zhibo Chen, Ruling Liao, Yan Ye, Qiulin Wang, Bing Li, Zhaokun Zhou, Miao Geng, Rui Chen, Xin Tao, Xiaoyu Liang, Shangkun Sun, Xingyuan Ma, Jiaze Li, Mengduo Yang, Haoran Xu, Jie Zhou, Shiding Zhu, Bohan Yu, Pengfei Chen, Xinrui Xu, Jiabin Shen, Zhichao Duan, Erfan Asadi, Jiahe Liu, Qi Yan, Youran Qu, Xiaohui Zeng, Lele Wang, Renjie Liao |
This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated
Content Challenge, which will be held in conjunction with the New Trends in
Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge
is to address a major challenge in the field of image and video processing,
namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for
AI-Generated Content (AIGC). The challenge is divided into the image track and
the video track. The image track uses the AIGIQA-20K, which contains 20,000
AI-Generated Images (AIGIs) generated by 15 popular generative models. The
image track has a total of 318 registered participants. A total of 1,646
submissions are received in the development phase, and 221 submissions are
received in the test phase. Finally, 16 participating teams submitted their
models and fact sheets. The video track uses the T2VQA-DB, which contains
10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V)
models. A total of 196 participants have registered in the video track. A total
of 991 submissions are received in the development phase, and 185 submissions
are received in the test phase. Finally, 12 participating teams submitted their
models and fact sheets. Some methods have achieved better results than baseline
methods, and the winning methods in both tracks have demonstrated superior
prediction performance on AIGC. |
This paper summarizes the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, focusing on developing objective IQA and VQA methods for AI-generated images and videos. |
The challenge addresses the critical need for accurate quality assessment of AI-generated content, a growing presence in daily life, to guide the improvement of generative models and enhance user experience. |
The challenge, divided into image and video tracks, utilized AIGIQA-20K and T2VQA-DB datasets, respectively, with participants tasked to predict perceptual quality scores of AI-generated images/videos based on training data and corresponding MOSs. The evaluation used SRCC and PLCC to measure prediction accuracy. |
The challenge attracted 514 participants and yielded 28 valid submissions.
Most submitted methods outperformed baseline I/VQA models, indicating progress in aligning objective metrics with human perception.
Top-performing methods demonstrated strong correlation between predicted scores and MOSs, emphasizing their potential in guiding the generation of higher-quality content. |
Limited number of AIGV datasets compared to AIGI datasets.
Future research could explore more sophisticated methods for multi-dimensional quality assessment of AIGC. |
ai-generated content, image quality assessment, video quality assessment, generative models, perceptual quality |
2404.16612
Report |
MuseumMaker: Continual Style Customization without Catastrophic Forgetting |
Chenxi Liu, Gan Sun, Wenqi Liang, Jiahua Dong, Can Qin, Yang Cong |
Pre-trained large text-to-image (T2I) models with an appropriate text prompt
has attracted growing interests in customized images generation field. However,
catastrophic forgetting issue make it hard to continually synthesize new
user-provided styles while retaining the satisfying results amongst learned
styles. In this paper, we propose MuseumMaker, a method that enables the
synthesis of images by following a set of customized styles in a never-end
manner, and gradually accumulate these creative artistic works as a Museum.
When facing with a new customization style, we develop a style distillation
loss module to extract and learn the styles of the training data for new image
generation. It can minimize the learning biases caused by content of new
training images, and address the catastrophic overfitting issue induced by
few-shot images. To deal with catastrophic forgetting amongst past learned
styles, we devise a dual regularization for shared-LoRA module to optimize the
direction of model update, which could regularize the diffusion model from both
weight and feature aspects, respectively. Meanwhile, to further preserve
historical knowledge from past styles and address the limited representability
of LoRA, we consider a task-wise token learning module where a unique token
embedding is learned to denote a new style. As any new user-provided style
come, our MuseumMaker can capture the nuances of the new styles while
maintaining the details of learned styles. Experimental results on diverse
style datasets validate the effectiveness of our proposed MuseumMaker method,
showcasing its robustness and versatility across various scenarios. |
This paper presents MuseumMaker, a novel approach for continual style customization in text-to-image diffusion models that addresses catastrophic forgetting and overfitting. |
Enabling diffusion models to continuously learn new styles from user-provided images without forgetting previously learned ones is crucial for personalized and evolving image generation. |
MuseumMaker employs a style distillation loss to extract pure style representations, a dual regularization for shared-LoRA to preserve style knowledge during optimization, and task-wise token learning to capture distinct features of each style. |
MuseumMaker demonstrates superior performance compared to existing methods, showing significant improvements in style loss, FID, and CLIP score.
The ablation studies highlight the effectiveness of each proposed module in mitigating catastrophic forgetting and overfitting.
MuseumMaker proves to be efficient with minimal training parameters and competitive training time while achieving near-upper-bound performance. |
The current implementation focuses on a limited number of styles.
Exploring more sophisticated techniques for knowledge distillation and preservation in continual learning settings could further enhance the model's capabilities. |
text-to-image generation, style customization, continual learning, diffusion models, catastrophic forgetting |
2404.16510
Report |
Interactive3D: Create What You Want by Interactive 3D Generation |
Shaocong Dong, Lihe Ding, Zhanpeng Huang, Zibin Wang, Tianfan Xue, Dan Xu |
3D object generation has undergone significant advancements, yielding
high-quality results. However, fall short of achieving precise user control,
often yielding results that do not align with user expectations, thus limiting
their applicability. User-envisioning 3D object generation faces significant
challenges in realizing its concepts using current generative models due to
limited interaction capabilities. Existing methods mainly offer two approaches:
(i) interpreting textual instructions with constrained controllability, or (ii)
reconstructing 3D objects from 2D images. Both of them limit customization to
the confines of the 2D reference and potentially introduce undesirable
artifacts during the 3D lifting process, restricting the scope for direct and
versatile 3D modifications. In this work, we introduce Interactive3D, an
innovative framework for interactive 3D generation that grants users precise
control over the generative process through extensive 3D interaction
capabilities. Interactive3D is constructed in two cascading stages, utilizing
distinct 3D representations. The first stage employs Gaussian Splatting for
direct user interaction, allowing modifications and guidance of the generative
direction at any intermediate step through (i) Adding and Removing components,
(ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv)
Semantic Editing. Subsequently, the Gaussian splats are transformed into
InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to
further add details and extract the geometry in the second stage. Our
experiments demonstrate that Interactive3D markedly improves the
controllability and quality of 3D generation. Our project webpage is available
at \url{https://interactive-3d.github.io/}. |
Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. |
Current 3D object generation methods lack precise user control, relying on text prompts or 2D images that limit controllability and quality. |
A two-stage process leveraging Gaussian Splatting for direct user interaction (adding/removing parts, dragging, transformations, semantic editing) followed by conversion to InstantNGP for refinement with an Interactive Hash Refinement module. |
Achieves high-quality and controllable 3D generation with demonstrated examples like modifying a human pose and creating a dragon.
Outperforms state-of-the-art methods in CLIP R-Precision, indicating improved controllability.
Enables efficient 3D generation due to fast Gaussian Splatting initialization and integration of interactions within the optimization process. |
Susceptible to failure under excessive or unreasonable user manipulation.
Inherits common challenges of current generative techniques, including color saturation issues. |
3d generation, interactive design, gaussian splatting, instantngp, user controllability |
2404.16375
Report |
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs |
An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang |
Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of
GPT-4V, by enabling the model to associate visual objects with tags inserted on
the image. These tags, marked with alphanumerics, can be indexed via text
tokens for easy reference. Despite the extraordinary performance from GPT-4V,
we observe that other Multimodal Large Language Models (MLLMs) struggle to
understand these visual tags. To promote the learning of SoM prompting for
open-source models, we propose a new learning paradigm: "list items one by
one," which asks the model to enumerate and describe all visual tags placed on
the image following the alphanumeric orders of tags. By integrating our curated
dataset with other visual instruction tuning datasets, we are able to equip
existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our
finetuned SoM models on five MLLM benchmarks. We find that this new dataset,
even in a relatively small size (10k-30k images with tags), significantly
enhances visual reasoning capabilities and reduces hallucinations for MLLMs.
Perhaps surprisingly, these improvements persist even when the visual tags are
omitted from input images during inference. This suggests the potential of
"list items one by one" as a new paradigm for training MLLMs, which strengthens
the object-text alignment through the use of visual tags in the training stage.
Finally, we conduct analyses by probing trained models to understand the
working mechanism of SoM. Our code and data are available at
\url{https://github.com/zzxslp/SoM-LLaVA}. |
This paper introduces "list items one by one," a novel learning paradigm and dataset for enhancing multimodal large language models (MLLMs) with Set-of-Mark (SoM) prompting capabilities. |
SoM prompting enables MLLMs to ground visual objects to tags on images, facilitating tasks like GUI navigation and robot interaction. However, this ability is predominantly observed in GPT-4V, limiting its wider adoption. |
The authors curate a dataset using Semantic-SAM to tag objects in MS-COCO images. GPT-4V then generates descriptions for these tags, training MLLMs to enumerate tagged items in alphanumeric order. |
With limited data (10k samples), MLLMs significantly improve in tag listing accuracy, even surpassing zero-shot GPT-4V.
Fine-tuned SoM MLLMs demonstrate enhanced performance on five MLLM benchmarks (POPE, MME, SEED-Bench, LLaVA-Bench, MM-Vet), indicating improved visual reasoning.
Surprisingly, models trained with SoM data exhibit superior performance even without tags during inference, highlighting the paradigm's potential for general MLLM training. |
The study primarily focuses on MS-COCO images, potentially limiting generalization to other datasets.
Future work can explore alternative tagging methods and data sources to further enhance SoM prompting. |
multimodal large language models, set-of-mark prompting, visual grounding, visual reasoning, instruction tuning |
2404.16323
Report |
DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction |
Jiamin Wu, Kenkun Liu, Han Gao, Xiaoke Jiang, Lei Zhang |
In this paper, we study the problem of 3D reconstruction from a single-view
RGB image and propose a novel approach called DIG3D for 3D object
reconstruction and novel view synthesis. Our method utilizes an encoder-decoder
framework which generates 3D Gaussians in decoder with the guidance of
depth-aware image features from encoder. In particular, we introduce the use of
deformable transformer, allowing efficient and effective decoding through 3D
reference point and multi-layer refinement adaptations. By harnessing the
benefits of 3D Gaussians, our approach offers an efficient and accurate
solution for 3D reconstruction from single-view images. We evaluate our method
on the ShapeNet SRN dataset, getting PSNR of 24.21 and 24.98 in car and chair
dataset, respectively. The result outperforming the recent method by around
2.25%, demonstrating the effectiveness of our method in achieving superior
results. |
DIG3D, a novel encoder-decoder framework leveraging deformable transformers and 3D Gaussian splatting for efficient single-view 3D object reconstruction and novel view synthesis. |
Addresses limitations of previous methods, such as incorrect geometry and reliance on shortcuts, while maintaining fast rendering speed. |
Combines pixel-aligned features from a UNet and depth-aware features from a pretrained DINOv2 model. Uses a deformable transformer decoder with 3D reference points and multi-layer refinement to predict 3D Gaussian parameters. |
Outperforms Splatter Image on the ShapeNet SRN dataset, particularly for views far from the input view.
Achieves high rendering quality, accurately capturing occlusions and producing realistic renderings.
Reconstructs meaningful 3D structures, evidenced by the visualization of Gaussian centers as point clouds. |
Training time is longer compared to Splatter Image.
Further improvements in geometry reconstruction are possible. |
3d reconstruction, novel view synthesis, single-view reconstruction, deformable transformer, 3d gaussian splatting |
2404.16306
Report |
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models |
Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks |
Text-conditioned image-to-video generation (TI2V) aims to synthesize a
realistic video starting from a given image (e.g., a woman's photo) and a text
description (e.g., "a woman is drinking water."). Existing TI2V frameworks
often require costly training on video-text datasets and specific model designs
for text and image conditioning. In this paper, we propose TI2V-Zero, a
zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V)
diffusion model to be conditioned on a provided image, enabling TI2V generation
without any optimization, fine-tuning, or introducing external modules. Our
approach leverages a pretrained T2V diffusion foundation model as the
generative prior. To guide video generation with the additional image input, we
propose a "repeat-and-slide" strategy that modulates the reverse denoising
process, allowing the frozen diffusion model to synthesize a video
frame-by-frame starting from the provided image. To ensure temporal continuity,
we employ a DDPM inversion strategy to initialize Gaussian noise for each newly
synthesized frame and a resampling technique to help preserve visual details.
We conduct comprehensive experiments on both domain-specific and open-domain
datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V
model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks
such as video infilling and prediction when provided with more images. Its
autoregressive design also supports long video generation. |
This paper introduces TI2V-Zero, a zero-shot, tuning-free method that enables pretrained text-to-video (T2V) diffusion models to be conditioned on a provided image, facilitating text-conditioned image-to-video (TI2V) generation without any optimization or fine-tuning. |
Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning, limiting their flexibility and generalizability. TI2V-Zero overcomes these limitations by leveraging pretrained T2V models, making it efficient and widely applicable. |
The approach leverages a pretrained T2V diffusion model and introduces a "repeat-and-slide" strategy to guide video generation. It modulates the reverse denoising process, allowing the model to synthesize video frame-by-frame from the provided image. A DDPM inversion strategy initializes Gaussian noise for temporal consistency, and a resampling technique preserves visual details. |
TI2V-Zero consistently outperforms a recent open-domain TI2V model in experiments on domain-specific (MUG, UCF-101) and open-domain datasets.
The method effectively preserves identity and background details, resulting in more visually pleasing and temporally coherent videos.
TI2V-Zero demonstrates its versatility by extending to other video-related tasks, including video infilling, prediction, and long video generation. |
The generation quality is limited by the capabilities of the pretrained T2V model.
The generation speed is slow due to the need to run the entire diffusion process for each frame, and the generated video might have blurriness or flickering artifacts. |
text-to-video generation, image-to-video generation, diffusion models, zero-shot learning, video generation |
2404.16221
Report |
NeRF-XL: Scaling NeRFs with Multiple GPUs |
Ruilong Li, Sanja Fidler, Angjoo Kanazawa, Francis Williams |
We present NeRF-XL, a principled method for distributing Neural Radiance
Fields (NeRFs) across multiple GPUs, thus enabling the training and rendering
of NeRFs with an arbitrarily large capacity. We begin by revisiting existing
multi-GPU approaches, which decompose large scenes into multiple independently
trained NeRFs, and identify several fundamental issues with these methods that
hinder improvements in reconstruction quality as additional computational
resources (GPUs) are used in training. NeRF-XL remedies these issues and
enables the training and rendering of NeRFs with an arbitrary number of
parameters by simply using more hardware. At the core of our method lies a
novel distributed training and rendering formulation, which is mathematically
equivalent to the classic single-GPU case and minimizes communication between
GPUs. By unlocking NeRFs with arbitrarily large parameter counts, our approach
is the first to reveal multi-GPU scaling laws for NeRFs, showing improvements
in reconstruction quality with larger parameter counts and speed improvements
with more GPUs. We demonstrate the effectiveness of NeRF-XL on a wide variety
of datasets, including the largest open-source dataset to date, MatrixCity,
containing 258K images covering a 25km^2 city area. |
Presents NerfXL, a method for distributing Neural Radiance Fields (NeRFs) across multiple GPUs to enable training and rendering of NeRFs with arbitrarily large capacity. |
Existing multi-GPU approaches for NeRFs suffer from redundancy and reduced visual quality as the number of GPUs increases, limiting their ability to handle large-scale, high-detail scenes. |
NerfXL partitions 3D space into non-overlapping tiles, assigns a NeRF to each tile, and jointly trains them across GPUs, minimizing communication overhead through a novel distributed training and rendering formulation. |
NerfXL achieves significant improvements in visual quality (PSNR) and rendering speed with more GPUs compared to existing independent training approaches.
The method effectively handles large-scale captures (up to 25km²), demonstrating robust scalability.
NerfXL enables exploring larger model capacities for NeRFs, which proves more beneficial than simply increasing the training batch size (as in PyTorch DDP). |
Multi-GPU synchronization, while minimized, remains a bottleneck for training and rendering speed.
While theoretically agnostic to NeRF representation, the method has only been tested with Instant-NGP and could be explored with other representations. |
neural radiance fields, nerf, multi-gpu, distributed training, novel view synthesis |
2404.16030
Report |
MoDE: CLIP Data Experts via Clustering |
Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih, Hu Xu |
The success of contrastive language-image pretraining (CLIP) relies on the
supervision from the pairing between images and captions, which tends to be
noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn
a system of CLIP data experts via clustering. Each data expert is trained on
one data cluster, being less sensitive to false negative noises in other
clusters. At inference time, we ensemble their outputs by applying weights
determined through the correlation between task metadata and cluster
conditions. To estimate the correlation precisely, the samples in one cluster
should be semantically similar, but the number of data experts should still be
reasonable for training and inference. As such, we consider the ontology in
human language and propose to use fine-grained cluster centers to represent
each data expert at a coarse-grained level. Experimental studies show that four
CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and
OpenCLIP on zero-shot image classification but with less ($<$35\%) training
cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly
include new data experts. The code is available at
https://github.com/facebookresearch/MetaCLIP/tree/main/mode. |
Introduces Mixture of Data Experts (MoDE), a system of CLIP data experts learned via clustering to improve contrastive language-image pretraining by mitigating noise in web-crawled image-caption pairs. |
Noise in web-crawled data negatively impacts CLIP's performance, and scaling CLIP on large datasets presents training efficiency and computational challenges. |
The method clusters data to train specialized data experts, each focusing on a subset of data with coherent semantics. Inference involves ensembling expert outputs based on task metadata and cluster relevance. |
Significantly outperforms OpenCLIP and OpenAI CLIP on benchmarks.
Reduces training cost to less than 35% compared to baselines.
Enables asynchronous training of data experts and flexible inclusion of new experts. |
The number of clusters must balance semantic coherence with computational feasibility.
Future work includes adapting MoDE for generative models. |
contrastive learning, image-language pretraining, data clustering, noise reduction, ensemble learning |
2404.16029
Report |
Editable Image Elements for Controllable Synthesis |
Jiteng Mu, Michaël Gharbi, Richard Zhang, Eli Shechtman, Nuno Vasconcelos, Xiaolong Wang, Taesung Park |
Diffusion models have made significant advances in text-guided synthesis
tasks. However, editing user-provided images remains challenging, as the high
dimensional noise input space of diffusion models is not naturally suited for
image inversion or spatial editing. In this work, we propose an image
representation that promotes spatial editing of input images using a diffusion
model. Concretely, we learn to encode an input into "image elements" that can
faithfully reconstruct an input image. These elements can be intuitively edited
by a user, and are decoded by a diffusion model into realistic images. We show
the effectiveness of our representation on various image editing tasks, such as
object resizing, rearrangement, dragging, de-occlusion, removal, variation, and
image composition. Project page:
https://jitengmu.github.io/Editable_Image_Elements/ |
This paper proposes "editable image elements," a novel image representation for controllable synthesis with diffusion models, allowing intuitive spatial editing of user-provided images. |
Existing diffusion models struggle with image editing as their noise-based input space isn't designed for spatial manipulations. This work addresses this limitation by providing an intuitive and effective way to edit images within the diffusion framework. |
The method encodes an input image into semantically meaningful "image elements" (superpixels) with learnable embeddings and editable spatial properties (position, size). A diffusion model, trained with element dropout for robustness, decodes edited elements into realistic images. |
The approach enables a range of edits: object resizing, rearrangement, dragging, de-occlusion, removal, variation, and composition.
The method outperforms baselines like Self-Guidance, Paint-by-Example, and InstructPix2Pix in user studies, demonstrating superior quality and edit fidelity.
Ablation studies confirm the importance of staged training, content encoder freezing, and random partition dropout during training for optimal performance. |
Editing high-resolution images remains challenging due to reconstruction quality limitations.
Exploring methods to edit the appearance of image elements beyond spatial manipulations is left for future work. |
image editing, disentangled representation, diffusion models, controllable synthesis, image elements |
2404.16022
Report |
PuLID: Pure and Lightning ID Customization via Contrastive Alignment |
Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Qian He |
We propose Pure and Lightning ID customization (PuLID), a novel tuning-free
ID customization method for text-to-image generation. By incorporating a
Lightning T2I branch with a standard diffusion one, PuLID introduces both
contrastive alignment loss and accurate ID loss, minimizing disruption to the
original model and ensuring high ID fidelity. Experiments show that PuLID
achieves superior performance in both ID fidelity and editability. Another
attractive property of PuLID is that the image elements (e.g., background,
lighting, composition, and style) before and after the ID insertion are kept as
consistent as possible. Codes and models will be available at
https://github.com/ToTheBeginning/PuLID |
Proposes PuLID, a tuning-free identity (ID) customization method for text-to-image generation that maintains high ID fidelity while minimizing interference with the original model's behavior. |
Existing tuning-free ID customization methods struggle to achieve high ID fidelity without disrupting the original model's ability to follow prompts and maintain stylistic consistency. |
Introduces a Lightning T2I branch alongside the standard diffusion training branch. Employs contrastive alignment loss between images generated with and without ID insertion to minimize disruption. Leverages fast sampling to generate high-quality images for accurate ID loss calculation. |
Achieves superior ID fidelity compared to state-of-the-art methods like IPAdapter and InstantID.
Demonstrates better preservation of original image elements (background, lighting, composition, style) compared to existing methods.
Maintains respectable prompt editing capabilities for modifying ID attributes, orientations, and accessories. |
The prompt list used for contrastive alignment, while effective, could be further optimized.
Exploring more advanced alignment techniques beyond semantic and layout alignment could lead to further improvements. |
text-to-image generation, identity customization, diffusion models, contrastive learning, fast sampling |
2404.15956
Report |
A Survey on Visual Mamba |
Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Zi Ye |
State space models (SSMs) with selection mechanisms and hardware-aware
architectures, namely Mamba, have recently demonstrated significant promise in
long-sequence modeling. Since the self-attention mechanism in transformers has
quadratic complexity with image size and increasing computational demands, the
researchers are now exploring how to adapt Mamba for computer vision tasks.
This paper is the first comprehensive survey aiming to provide an in-depth
analysis of Mamba models in the field of computer vision. It begins by
exploring the foundational concepts contributing to Mamba's success, including
the state space model framework, selection mechanisms, and hardware-aware
design. Next, we review these vision mamba models by categorizing them into
foundational ones and enhancing them with techniques such as convolution,
recurrence, and attention to improve their sophistication. We further delve
into the widespread applications of Mamba in vision tasks, which include their
use as a backbone in various levels of vision processing. This encompasses
general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation,
classification, and image registration, etc.), and Remote Sensing visual tasks.
We specially introduce general visual tasks from two levels: High/Mid-level
vision (e.g., Object detection, Segmentation, Video classification, etc.) and
Low-level vision (e.g., Image super-resolution, Image restoration, Visual
generation, etc.). We hope this endeavor will spark additional interest within
the community to address current challenges and further apply Mamba models in
computer vision. |
This paper presents the first comprehensive survey of Mamba models in computer vision, examining their foundational concepts, architectural enhancements, and diverse applications across various vision tasks. |
Mamba models offer a promising alternative to Transformers in computer vision due to their ability to capture long-range dependencies with linear complexity, leading to improved efficiency and performance, especially for high-resolution image processing. |
The paper reviews different Mamba block designs like ViM and VSS and analyzes their integration with techniques like convolution, recurrence, and attention. It categorizes existing works based on their applications in general vision (high/mid and low-level), medical imaging, and remote sensing. |
Mamba-based models demonstrate competitive performance compared to Transformers in various vision tasks such as image classification, object detection, segmentation, restoration, and generation.
Different scanning mechanisms are crucial for extending Mamba to multi-dimensional visual data, and the choice of mechanism depends on the specific task and data characteristics.
Combining Mamba with other architectures like convolution, recurrence, and attention further enhances its capabilities and performance in specific applications like medical image segmentation and video understanding. |
Most Mamba models are still in their early stages, lacking extensive pre-training on large-scale datasets like ImageNet, which limits their generalization ability.
Future work should focus on exploring more efficient scanning mechanisms, developing pre-trained Mamba models, and enhancing their interpretability and robustness for real-world deployment. |
mamba, computer vision, state space model, visual mamba, deep learning |
2404.15955
Report |
Beyond Deepfake Images: Detecting AI-Generated Videos |
Danial Samadi Vahdati, Tai D. Nguyen, Aref Azizpour, Matthew C. Stamm |
Recent advances in generative AI have led to the development of techniques to
generate visually realistic synthetic video. While a number of techniques have
been developed to detect AI-generated synthetic images, in this paper we show
that synthetic image detectors are unable to detect synthetic videos. We
demonstrate that this is because synthetic video generators introduce
substantially different traces than those left by image generators. Despite
this, we show that synthetic video traces can be learned, and used to perform
reliable synthetic video detection or generator source attribution even after
H.264 re-compression. Furthermore, we demonstrate that while detecting videos
from new generators through zero-shot transferability is challenging, accurate
detection of videos from a new generator can be achieved through few-shot
learning. |
This paper investigates the effectiveness of synthetic image detectors in detecting synthetic videos, revealing that image detectors perform poorly on videos due to the distinct traces left by video generators. |
The emergence of realistic synthetic videos generated by AI poses a significant threat of misinformation and disinformation. |
The authors evaluate various synthetic image detectors on a dataset of real and synthetic videos. They analyze the low-level forensic traces left by both image and video generators and investigate the impact of H.264 compression and robust training. |
Synthetic image detectors fail to reliably detect AI-generated videos, even with robust training against H.264 compression.
Synthetic video generators leave unique traces that differ significantly from those found in synthetic images.
Training detectors specifically on synthetic video traces enables reliable detection and source attribution, even after H.264 re-compression. |
The study primarily focuses on a limited set of publicly available video generators.
Future work should explore the generalization of detectors to entirely new and unseen generation techniques. |
synthetic video detection, generative ai, misinformation detection, forensic traces, few-shot learning |
2404.15909
Report |
Learning Long-form Video Prior via Generative Pre-Training |
Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou |
Concepts involved in long-form videos such as people, objects, and their
interactions, can be viewed as following an implicit prior. They are notably
complex and continue to pose challenges to be comprehensively learned. In
recent years, generative pre-training (GPT) has exhibited versatile capacities
in modeling any kind of text content even visual locations. Can this manner
work for learning long-form video prior? Instead of operating on pixel space,
it is efficient to employ visual locations like bounding boxes and keypoints to
represent key information in videos, which can be simply discretized and then
tokenized for consumption by GPT. Due to the scarcity of suitable data, we
create a new dataset called \textbf{Storyboard20K} from movies to serve as a
representative. It includes synopses, shot-by-shot keyframes, and fine-grained
annotations of film sets and characters with consistent IDs, bounding boxes,
and whole body keypoints. In this way, long-form videos can be represented by a
set of tokens and be learned via generative pre-training. Experimental results
validate that our approach has great potential for learning long-form video
prior. Code and data will be released at
\url{https://github.com/showlab/Long-form-Video-Prior}. |
This paper proposes learning the long-form video prior via generative pre-training by representing videos as sequences of tokens from bounding boxes, keypoints, and textual descriptions. |
Current video generation methods struggle with long-form videos due to their complexity and long-range dependencies. Learning the implicit prior of long-form videos can improve video generation in this domain. |
The authors create a new dataset, Storyboard20K, consisting of movie storyboards with annotations of character bounding boxes, keypoints, film set bounding boxes, and textual descriptions. They represent each storyboard as a sequence of tokens and train a GPT-2 model to predict the next token in the sequence. |
The proposed method outperforms GPT-3.5 in generating coherent and contextually relevant movie storyboards based on textual metrics.
The method also achieves superior performance in visual evaluation using FID compared to GPT-3.5, demonstrating its ability to model and generate visually plausible storyboards.
The model exhibits a high decoding success rate (92.5%) for converting generated token sequences back into movie storyboard format. |
The current work focuses on learning the prior of movie storyboards instead of pixel-level videos.
The work is limited by the computational resources, restricting the maximum number of tokens representing a storyboard. |
generative pre-training, long-form video prior, storyboard generation, video understanding, movie datasets |
2404.15891
Report |
OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation |
Lizhi Wang, Feng Zhou, Jianqin Yin |
Recent advancements in 3D reconstruction technologies have paved the way for
high-quality and real-time rendering of complex 3D scenes. Despite these
achievements, a notable challenge persists: it is difficult to precisely
reconstruct specific objects from large scenes. Current scene reconstruction
techniques frequently result in the loss of object detail textures and are
unable to reconstruct object portions that are occluded or unseen in views. To
address this challenge, we delve into the meticulous 3D reconstruction of
specific objects within large scenes and propose a framework termed OMEGAS:
Object Mesh Extraction from Large Scenes Guided by GAussian Segmentation.
OMEGAS employs a multi-step approach, grounded in several excellent
off-the-shelf methodologies. Specifically, initially, we utilize the Segment
Anything Model (SAM) to guide the segmentation of 3D Gaussian Splatting (3DGS),
thereby creating a basic 3DGS model of the target object. Then, we leverage
large-scale diffusion priors to further refine the details of the 3DGS model,
especially aimed at addressing invisible or occluded object portions from the
original scene views. Subsequently, by re-rendering the 3DGS model onto the
scene views, we achieve accurate object segmentation and effectively remove the
background. Finally, these target-only images are used to improve the 3DGS
model further and extract the definitive 3D object mesh by the SuGaR model. In
various scenarios, our experiments demonstrate that OMEGAS significantly
surpasses existing scene reconstruction methods. Our project page is at:
https://github.com/CrystalWlz/OMEGAS |
Presents OMEGAS, a framework for extracting high-precision meshes of specified objects from multi-view scene images, even reconstructing occluded or unseen object parts. |
Existing methods struggle to reconstruct accurate 3D object meshes from large scenes due to compromised object quality and difficulties in reconstructing occluded or unseen object portions. |
OMEGAS leverages SAM for segmentation-guided 3DGS model creation, utilizes large-scale diffusion priors (Stable Diffusion) to refine details and address unseen parts, and employs SuGaR for final 3DGS optimization and mesh extraction. |
Achieves superior segmentation accuracy and efficiency compared to Gaussian Grouping.
Generates higher quality object meshes with finer details compared to SuGaR and DreamGaussian.
Successfully reconstructs occluded or unseen object portions, as demonstrated in ablation studies. |
The optimization process in SuGaR can be time-consuming, ranging from a few minutes to an hour.
Future work could explore optimizing the framework's efficiency for even faster mesh extraction. |
mesh reconstruction, 3d gaussian splatting, diffusion models, object segmentation, 3d reconstruction |
2404.15889
Report |
Sketch2Human: Deep Human Generation with Disentangled Geometry and Appearance Control |
Linzi Qu, Jiaxiang Shang, Hui Ye, Xiaoguang Han, Hongbo Fu |
Geometry- and appearance-controlled full-body human image generation is an
interesting but challenging task. Existing solutions are either unconditional
or dependent on coarse conditions (e.g., pose, text), thus lacking explicit
geometry and appearance control of body and garment. Sketching offers such
editing ability and has been adopted in various sketch-based face generation
and editing solutions. However, directly adapting sketch-based face generation
to full-body generation often fails to produce high-fidelity and diverse
results due to the high complexity and diversity in the pose, body shape, and
garment shape and texture. Recent geometrically controllable diffusion-based
methods mainly rely on prompts to generate appearance and it is hard to balance
the realism and the faithfulness of their results to the sketch when the input
is coarse. This work presents Sketch2Human, the first system for controllable
full-body human image generation guided by a semantic sketch (for geometry
control) and a reference image (for appearance control). Our solution is based
on the latent space of StyleGAN-Human with inverted geometry and appearance
latent codes as input. Specifically, we present a sketch encoder trained with a
large synthetic dataset sampled from StyleGAN-Human's latent space and directly
supervised by sketches rather than real images. Considering the entangled
information of partial geometry and texture in StyleGAN-Human and the absence
of disentangled datasets, we design a novel training scheme that creates
geometry-preserved and appearance-transferred training data to tune a generator
to achieve disentangled geometry and appearance control. Although our method is
trained with synthetic data, it can handle hand-drawn sketches as well.
Qualitative and quantitative evaluations demonstrate the superior performance
of our method to state-of-the-art methods. |
\sysName is the first deep generative framework for synthesizing full-body human images from a semantic sketch for geometry control and a reference image for appearance control. |
Existing solutions for full-body human image generation lack explicit and flexible control over detailed geometry and appearance, limiting the ability to generate specific images of interest. |
\sysName employs a two-stage generation framework: (1) Sketch Image Inversion: inverts the input sketch into a geometry latent code using a sketch encoder trained on a synthetic dataset sampled from StyleGAN-Human. (2) Body Generator Tuning: fine-tunes a pretrained StyleGAN-Human with appearance-transferred and geometry-preserved data synthesized via style mixing to achieve disentangled geometry and appearance control. |
\sysName enables flexible and disentangled control of geometry and appearance for full-body human image generation.
Qualitative and quantitative evaluations demonstrate superior performance over state-of-the-art methods in terms of geometry preservation, appearance transfer, and visual quality.
The system exhibits robustness in handling sketches with varying levels of abstraction, from professional to amateur styles. |
The method's reliance on embedding sketches into StyleGAN-Human's latent space may prioritize reasonable results over perfectly replicating user intent in some cases.
The system's ability to transfer complex textures from real appearance images is limited by the accuracy of the image inversion method and the generative power of the underlying StyleGAN-Human model. |
full-body image generation, style-based generator, style mixing, sketch-based generation, disentangled geometry and appearance control |
2404.15789
Report |
MotionMaster: Training-free Camera Motion Transfer For Video Generation |
Teng Hu, Jiangning Zhang, Ran Yi, Yating Wang, Hongrui Huang, Jieyu Weng, Yabiao Wang, Lizhuang Ma |
The emergence of diffusion models has greatly propelled the progress in image
and video generation. Recently, some efforts have been made in controllable
video generation, including text-to-video generation and video motion control,
among which camera motion control is an important topic. However, existing
camera motion control methods rely on training a temporal camera module, and
necessitate substantial computation resources due to the large amount of
parameters in video generation models. Moreover, existing methods pre-define
camera motion types during training, which limits their flexibility in camera
control. Therefore, to reduce training costs and achieve flexible camera
control, we propose COMD, a novel training-free video motion transfer model,
which disentangles camera motions and object motions in source videos and
transfers the extracted camera motions to new videos. We first propose a
one-shot camera motion disentanglement method to extract camera motion from a
single source video, which separates the moving objects from the background and
estimates the camera motion in the moving objects region based on the motion in
the background by solving a Poisson equation. Furthermore, we propose a
few-shot camera motion disentanglement method to extract the common camera
motion from multiple videos with similar camera motions, which employs a
window-based clustering technique to extract the common features in temporal
attention maps of multiple videos. Finally, we propose a motion combination
method to combine different types of camera motions together, enabling our
model a more controllable and flexible camera control. Extensive experiments
demonstrate that our training-free approach can effectively decouple
camera-object motion and apply the decoupled camera motion to a wide range of
controllable video generation tasks, achieving flexible and diverse camera
motion control. |
This paper introduces MotionMaster, a training-free model for transferring camera motion in videos, disentangling camera motion from object motion. |
Existing camera motion control methods in video generation require extensive training, limiting their flexibility and computational efficiency. |
MotionMaster leverages temporal attention maps in diffusion models to represent video motion. It disentangles camera and object motion using two methods: 1) One-shot: separating moving objects from the background and estimating camera motion in the foreground by solving a Poisson equation. 2) Few-shot: extracting common camera motion from multiple videos with similar camera movements through a window-based clustering technique. It further enables combining different camera motions for complex controls. |
MotionMaster effectively disentangles camera motion from object motion in single or multiple videos.
It enables flexible camera control by combining different camera motions and applying them to specific regions.
Extensive experiments demonstrate superior performance in camera motion transfer, generation quality, and diversity compared to existing methods. |
The accuracy of camera motion extraction might be affected by complex or rapid object movements.
Future work could explore transferring more intricate camera motions, like those found in professional filmmaking. |
video generation, camera motion transfer, motion disentanglement, training-free, temporal attention |
2404.15677
Report |
CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models |
Qinghe Wang, Baolu Li, Xiaomin Li, Bing Cao, Liqian Ma, Huchuan Lu, Xu Jia |
Recent advances in text-to-image models have opened new frontiers in
human-centric generation. However, these models cannot be directly employed to
generate images with consistent newly coined identities. In this work, we
propose CharacterFactory, a framework that allows sampling new characters with
consistent identities in the latent space of GANs for diffusion models. More
specifically, we consider the word embeddings of celeb names as ground truths
for the identity-consistent generation task and train a GAN model to learn the
mapping from a latent space to the celeb embedding space. In addition, we
design a context-consistent loss to ensure that the generated identity
embeddings can produce identity-consistent images in various contexts.
Remarkably, the whole model only takes 10 minutes for training, and can sample
infinite characters end-to-end during inference. Extensive experiments
demonstrate excellent performance of the proposed CharacterFactory on character
creation in terms of identity consistency and editability. Furthermore, the
generated characters can be seamlessly combined with the off-the-shelf
image/video/3D diffusion models. We believe that the proposed CharacterFactory
is an important step for identity-consistent character generation. Project page
is available at: https://qinghew.github.io/CharacterFactory/. |
CharacterFactory: an end-to-end framework that allows sampling of new, consistent character identities in the latent space of GANs for use in diffusion models, enabling the generation of images featuring the same character in different contexts. |
Existing text-to-image models struggle to generate images with consistent characters across different contexts. Current subject-driven methods are computationally expensive, prone to overfitting, or require complex pipelines. |
CharacterFactory leverages an Identity-Embedding GAN (IDE-GAN) trained on celebrity name embeddings to learn a mapping from a latent space to the embedding space of character identities. A context-consistent loss ensures that generated identity embeddings produce consistent images across various contexts. |
CharacterFactory generates consistent, high-quality character images comparable to or exceeding existing methods.
The method is highly efficient, requiring only 10 minutes for training and 3 seconds for inference.
CharacterFactory exhibits strong generalization ability and integrates seamlessly with various image, video, and 3D diffusion models. |
Potential generation of unnatural images or artifacts due to the use of GANs.
Inherits limitations of the base diffusion model, such as hand anomalies in Stable Diffusion. |
gans, diffusion models, identity-consistent generation, character creation, text-to-image synthesis |
2404.15653
Report |
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data |
Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi, Mehrdad Farajtabar, Oncel Tuzel, Mohammad Rastegari |
Contrastive learning has emerged as a transformative method for learning
effective visual representations through the alignment of image and text
embeddings. However, pairwise similarity computation in contrastive loss
between image and text pairs poses computational challenges. This paper
presents a novel weakly supervised pre-training of vision models on web-scale
image-text data. The proposed method reframes pre-training on image-text data
as a classification task. Consequently, it eliminates the need for pairwise
similarity computations in contrastive loss, achieving a remarkable $2.7\times$
acceleration in training speed compared to contrastive learning on web-scale
data. Through extensive experiments spanning diverse vision tasks, including
detection and segmentation, we demonstrate that the proposed method maintains
high representation quality. Our source code along with pre-trained model
weights and training recipes is available at
\url{https://github.com/apple/corenet}. |
This paper introduces \method, a novel weakly supervised pre-training method for vision models on web-scale image-text data that reframes pre-training as a classification task, leading to a 2.7x speedup compared to contrastive learning methods like CLIP while maintaining comparable downstream task performance. |
Contrastive learning on image-text pairs has shown great success in learning visual representations but suffers from computational challenges due to pairwise similarity computations. |
\method treats image-text pre-training as a classification problem. It extracts nouns from text captions, maps them to WordNet synsets to generate multi-label classification targets, and trains the image encoder using a binary cross-entropy loss. |
\method is 2.7x faster to pre-train than CLIP while maintaining comparable accuracy on downstream tasks.
\method's performance scales effectively with both data and model size.
Transfer learning with \method is more data-efficient, especially when leveraging the learned classification layer for initialization. |
The performance of \method starts to saturate with very large models (ViT-H) on ImageNet.
The gains from \method's classifier initialization are less pronounced for datasets where target labels are not a subset of the pre-training vocabulary. |
image-text pre-training, weakly supervised learning, contrastive learning, classification, transfer learning |
2404.15506
Report |
Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation |
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen |
We introduce Metric3D v2, a geometric foundation model for zero-shot metric
depth and surface normal estimation from a single image, which is crucial for
metric 3D recovery. While depth and normal are geometrically related and highly
complimentary, they present distinct challenges. SoTA monocular depth methods
achieve zero-shot generalization by learning affine-invariant depths, which
cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods
have limited zero-shot performance due to the lack of large-scale labeled data.
To tackle these issues, we propose solutions for both metric depth estimation
and surface normal estimation. For metric depth estimation, we show that the
key to a zero-shot single-view model lies in resolving the metric ambiguity
from various camera models and large-scale data training. We propose a
canonical camera space transformation module, which explicitly addresses the
ambiguity problem and can be effortlessly plugged into existing monocular
models. For surface normal estimation, we propose a joint depth-normal
optimization module to distill diverse data knowledge from metric depth,
enabling normal estimators to learn beyond normal labels. Equipped with these
modules, our depth-normal models can be stably trained with over 16 million of
images from thousands of camera models with different-type annotations,
resulting in zero-shot generalization to in-the-wild images with unseen camera
settings. Our method enables the accurate recovery of metric 3D structures on
randomly collected internet images, paving the way for plausible single-image
metrology. Our project page is at https://JUGGHM.github.io/Metric3Dv2. |
Introduces Metric3D v2, a foundation model for zero-shot metric depth and surface normal estimation from single images, achieving state-of-the-art performance on over 16 benchmarks. |
Metric depth and surface normals are crucial 3D representations for applications like 3D reconstruction, rendering, and robotics, but existing methods suffer from metric ambiguity and limited zero-shot generalization due to data limitations. |
1. A canonical camera transformation module addresses metric ambiguity by transforming training data to a canonical camera space. 2. A random proposal normalization loss enhances depth accuracy by focusing on local geometry. 3. A joint depth-normal optimization module distills knowledge from large-scale depth datasets to improve normal estimation, particularly in outdoor scenes. |
Achieves state-of-the-art zero-shot performance on various metric depth, affine-invariant depth, and surface normal benchmarks.
Outperforms previous methods in challenging cases, including fine-grained structures, foreground/background distinction, and unseen camera models.
Enables accurate metric 3D reconstruction from single images, benefitting downstream tasks like SLAM and metrology. |
The accuracy of normal prediction relies on depth estimation quality.
Current normal prediction struggles with challenging cases such as reflections and thin structures.
Exploring new normal representations and refinement strategies for challenging cases. |
monocular depth estimation, surface normal estimation, zero-shot learning, 3d reconstruction, foundation models |
2404.15449
Report |
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning |
Weifeng Chen, Jiacheng Zhang, Jie Wu, Hefeng Wu, Xuefeng Xiao, Liang Lin |
The rapid development of diffusion models has triggered diverse applications.
Identity-preserving text-to-image generation (ID-T2I) particularly has received
significant attention due to its wide range of application scenarios like AI
portrait and advertising. While existing ID-T2I methods have demonstrated
impressive results, several key challenges remain: (1) It is hard to maintain
the identity characteristics of reference portraits accurately, (2) The
generated images lack aesthetic appeal especially while enforcing identity
retention, and (3) There is a limitation that cannot be compatible with
LoRA-based and Adapter-based methods simultaneously. To address these issues,
we present \textbf{ID-Aligner}, a general feedback learning framework to
enhance ID-T2I performance. To resolve identity features lost, we introduce
identity consistency reward fine-tuning to utilize the feedback from face
detection and recognition models to improve generated identity preservation.
Furthermore, we propose identity aesthetic reward fine-tuning leveraging
rewards from human-annotated preference data and automatically constructed
feedback on character structure generation to provide aesthetic tuning signals.
Thanks to its universal feedback fine-tuning framework, our method can be
readily applied to both LoRA and Adapter models, achieving consistent
performance gains. Extensive experiments on SD1.5 and SDXL diffusion models
validate the effectiveness of our approach. \textbf{Project Page:
\url{https://idaligner.github.io/}} |
ID-Aligner, a novel reward feedback learning framework, enhances identity-preserving text-to-image generation by improving identity consistency and visual appeal. |
Existing ID-T2I methods struggle with accurate identity preservation, lack aesthetic appeal, and often lack compatibility with both LoRA and Adapter methods. |
ID-Aligner leverages face detection and recognition models for identity consistency reward fine-tuning. It also uses human-annotated preference data and character structure feedback for identity aesthetic reward fine-tuning. |
ID-Aligner significantly improves identity preservation compared to baseline models like IP-Adapter and FastComposer.
The method enhances visual appeal, particularly in character structure, leading to more aesthetically pleasing generations.
ID-Aligner demonstrates strong generalization across different base T2I models like Dreamshaper and RealVisXL. |
Improvements might be marginal when applied to already robust existing models.
Enhancing face similarity might sometimes compromise prompt consistency. |
text-to-image generation, diffusion model, feedback learning, identity preservation, reward learning |
2404.15406
Report |
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs |
Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara |
Multimodal LLMs are the natural evolution of LLMs, and enlarge their
capabilities so as to work beyond the pure textual modality. As research is
being carried out to design novel architectures and vision-and-language
adapters, in this paper we concentrate on endowing such models with the
capability of answering questions that require external knowledge. Our
approach, termed Wiki-LLaVA, aims at integrating an external knowledge source
of multimodal documents, which is accessed through a hierarchical retrieval
pipeline. Relevant passages, using this approach, are retrieved from the
external knowledge source and employed as additional context for the LLM,
augmenting the effectiveness and precision of generated dialogues. We conduct
extensive experiments on datasets tailored for visual question answering with
external data and demonstrate the appropriateness of our approach. |
Proposes Wiki-LLaVa, the first Multimodal Large Language Model (MLLM) augmented with a retrieval module to leverage external knowledge from a multimodal document database for answering complex questions. |
Standard MLLMs struggle to answer questions requiring specific or compositional reasoning due to limitations in their encoded knowledge and the scarcity of long-tail information in training data. Wiki-LLaVa addresses this by integrating external knowledge sources. |
Employs a hierarchical retrieval pipeline using CLIP and Contriever to identify relevant documents and passages from an external knowledge base, then feeds this information as additional context to an LLaVA-based MLLM. The model is fine-tuned with a mix of knowledge-requiring and standard question-answer pairs. |
Retrieving relevant passages from an external knowledge base significantly improves accuracy on knowledge-based visual question answering tasks, especially on the InfoSeek dataset.
Using multiple retrieved passages as context generally enhances accuracy, highlighting the importance of rich external information.
Employing oracle entities for retrieval considerably boosts accuracy, emphasizing the need for a robust entity retrieval model to minimize irrelevant content. |
Defining better embedding spaces for improved document retrieval from questions and images is crucial.
Developing efficient methods for selecting appropriate content from retrieved documents and enhancing the MLLM's ability to discern relevance are key areas for future work. |
multimodal large language models, knowledge integration, retrieval augmentation, visual question answering, external knowledge bases |
2404.15349
Report |
A Survey on Multimodal Wearable Sensor-based Human Action Recognition |
Jianyuan Ni, Hao Tang, Syed Tousiful Haque, Yan Yan, Anne H. H. Ngu |
The combination of increased life expectancy and falling birth rates is
resulting in an aging population. Wearable Sensor-based Human Activity
Recognition (WSHAR) emerges as a promising assistive technology to support the
daily lives of older individuals, unlocking vast potential for human-centric
applications. However, recent surveys in WSHAR have been limited, focusing
either solely on deep learning approaches or on a single sensor modality. In
real life, our human interact with the world in a multi-sensory way, where
diverse information sources are intricately processed and interpreted to
accomplish a complex and unified sensing system. To give machines similar
intelligence, multimodal machine learning, which merges data from various
sources, has become a popular research area with recent advancements. In this
study, we present a comprehensive survey from a novel perspective on how to
leverage multimodal learning to WSHAR domain for newcomers and researchers. We
begin by presenting the recent sensor modalities as well as deep learning
approaches in HAR. Subsequently, we explore the techniques used in present
multimodal systems for WSHAR. This includes inter-multimodal systems which
utilize sensor modalities from both visual and non-visual systems and
intra-multimodal systems that simply take modalities from non-visual systems.
After that, we focus on current multimodal learning approaches that have
applied to solve some of the challenges existing in WSHAR. Specifically, we
make extra efforts by connecting the existing multimodal literature from other
domains, such as computer vision and natural language processing, with current
WSHAR area. Finally, we identify the corresponding challenges and potential
research direction in current WSHAR area for further improvement. |
This paper presents a comprehensive survey on multimodal learning for wearable sensor-based human action recognition (WSHAR). |
WSHAR has vast potential for applications like assistive technology, but existing surveys are limited in scope, focusing either on deep learning only or on single sensor modalities. |
The survey covers recent sensor modalities, deep learning in HAR, inter- and intra-multimodal approaches, and multimodal solutions to WSHAR challenges. |
WSHAR datasets with IMU data are limited compared to other modalities.
Multimodal learning shows promise for addressing challenges like data scarcity and feature alignment in WSHAR.
Future research directions include future activity prediction, identifying unknown activities, and developing unified multimodal systems. |
The survey mainly focuses on the combination of IMU with other modalities, excluding some other potential modalities such as pressure sensors.
Discussion on security and ethical considerations for multimodal WSHAR systems is limited. |
multimodal learning, wearable sensors, human action recognition, deep learning, time series analysis |
2404.15276
Report |
SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation |
Xiangyu Xu, Lijuan Liu, Shuicheng Yan |
Existing Transformers for monocular 3D human shape and pose estimation
typically have a quadratic computation and memory complexity with respect to
the feature length, which hinders the exploitation of fine-grained information
in high-resolution features that is beneficial for accurate reconstruction. In
this work, we propose an SMPL-based Transformer framework (SMPLer) to address
this issue. SMPLer incorporates two key ingredients: a decoupled attention
operation and an SMPL-based target representation, which allow effective
utilization of high-resolution features in the Transformer. In addition, based
on these two designs, we also introduce several novel modules including a
multi-scale attention and a joint-aware attention to further boost the
reconstruction performance. Extensive experiments demonstrate the effectiveness
of SMPLer against existing 3D human shape and pose estimation methods both
quantitatively and qualitatively. Notably, the proposed algorithm achieves an
MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by
more than 10% with fewer than one-third of the parameters. Code and pretrained
models are available at https://github.com/xuxy09/SMPLer. |
This paper proposes SMPLer, a novel Transformer framework for monocular 3D human shape and pose estimation, enabling the efficient use of high-resolution image features for improved accuracy. |
Existing Transformers for this task struggle to utilize fine-grained information in high-resolution features due to quadratic computation and memory complexity, limiting their performance. |
SMPLer introduces two key innovations: 1) decoupled attention to reduce complexity to linear w.r.t. feature length and 2) an SMPL-based target representation for a more compact and efficient embedding. Further, it incorporates multi-scale attention and joint-aware attention modules to leverage both global and local image information. |
SMPLer significantly outperforms state-of-the-art methods on Human3.6M and 3DPW datasets, achieving a 10% lower MPJPE error with fewer parameters.
The compact SMPL-based representation ensures smoother and more consistent 3D human meshes compared to vertex-based methods.
The explicit modeling of body part rotations in SMPLer allows for efficient and accurate control of virtual avatars. |
The current implementation still relies on a CNN backbone.
Exploring attention-based backbones within the SMPLer framework could be a future research direction. |
3d human shape and pose estimation, transformer, attention mechanism, multi-scale, smpl |
2404.15275
Report |
ID-Animator: Zero-Shot Identity-Preserving Human Video Generation |
Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Jie Zhang |
Generating high fidelity human video with specified identities has attracted
significant attention in the content generation community. However, existing
techniques struggle to strike a balance between training efficiency and
identity preservation, either requiring tedious case-by-case finetuning or
usually missing the identity details in video generation process. In this
study, we present ID-Animator, a zero-shot human-video generation approach that
can perform personalized video generation given single reference facial image
without further training. ID-Animator inherits existing diffusion-based video
generation backbones with a face adapter to encode the ID-relevant embeddings
from learnable facial latent queries. To facilitate the extraction of identity
information in video generation, we introduce an ID-oriented dataset
construction pipeline, which incorporates decoupled human attribute and action
captioning technique from a constructed facial image pool. Based on this
pipeline, a random face reference training method is further devised to
precisely capture the ID-relevant embeddings from reference images, thus
improving the fidelity and generalization capacity of our model for ID-specific
video generation. Extensive experiments demonstrate the superiority of
ID-Animator to generate personalized human videos over previous models.
Moreover, our method is highly compatible with popular pre-trained T2V models
like animatediff and various community backbone models, showing high
extendability in real-world applications for video generation where identity
preservation is highly desired. Our codes and checkpoints will be released at
https://github.com/ID-Animator/ID-Animator. |
This paper proposes ID-Animator, a novel zero-shot framework for generating identity-specific human videos from a single facial image without further training. |
Generating high-fidelity, identity-specific human videos is crucial in various fields like the film industry, but existing methods struggle to balance training efficiency, identity preservation, and instruction following. |
ID-Animator combines a pretrained text-to-video diffusion model with a lightweight, trainable face adapter. It leverages an ID-oriented dataset with decoupled human attribute and action captions, and utilizes a random reference training strategy to enhance identity fidelity and instruction following. |
ID-Animator outperforms previous methods in generating personalized human videos with higher identity fidelity and motion quality.
The proposed framework allows for recontextualization of reference images by modifying attributes, backgrounds, and actions through text prompts.
ID-Animator exhibits strong generalization capabilities, effectively integrating with ControlNet and community-trained models. |
The current dataset primarily focuses on human subjects, limiting the generation to human-centric videos.
Exploring alternative architectures for the face adapter could potentially further enhance its performance. |
video generation, identity preservation, diffusion models, text-to-video synthesis, personalized content generation |
2404.15267
Report |
From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation |
Zehuan Huang, Hongxing Fan, Lipeng Wang, Lu Sheng |
Recent advancements in controllable human image generation have led to
zero-shot generation using structural signals (e.g., pose, depth) or facial
appearance. Yet, generating human images conditioned on multiple parts of human
appearance remains challenging. Addressing this, we introduce Parts2Whole, a
novel framework designed for generating customized portraits from multiple
reference images, including pose images and various aspects of human
appearance. To achieve this, we first develop a semantic-aware appearance
encoder to retain details of different human parts, which processes each image
based on its textual label to a series of multi-scale feature maps rather than
one image token, preserving the image dimension. Second, our framework supports
multi-image conditioned generation through a shared self-attention mechanism
that operates across reference and target features during the diffusion
process. We enhance the vanilla attention mechanism by incorporating mask
information from the reference human images, allowing for the precise selection
of any part. Extensive experiments demonstrate the superiority of our approach
over existing alternatives, offering advanced capabilities for multi-part
controllable human image customization. See our project page at
https://huanngzh.github.io/Parts2Whole/. |
This paper introduces Parts2Whole, a novel framework that leverages multiple reference images (e.g., hair, face, clothes) and pose maps to generate customizable human portraits. |
Existing methods for controllable human image generation struggle to accurately synthesize images conditioned on multiple aspects of human appearance, limiting customization options for users. |
Parts2Whole utilizes a dual U-Net design, incorporating a semantic-aware appearance encoder to extract detailed features from each labeled reference image. It then employs a shared self-attention mechanism to inject these features into the generation process, guided by subject masks for precise control. |
Parts2Whole demonstrates superior quality and controllability compared to existing methods, accurately synthesizing human images with fine-grained details from multiple reference images.
The framework allows for flexible combinations of body parts, enabling generation from single or multiple reference images with varying aspects.
Evaluations using CLIP score, DINO score, DreamSim, and user studies confirm the effectiveness of Parts2Whole in generating high-quality and well-aligned human images. |
Current training resolution of 512x512 might introduce artifacts, suggesting higher resolution and larger diffusion models for future improvement.
Expanding the framework to achieve layer-wise clothing try-on is a promising avenue for future research. |
controllable image generation, human image synthesis, multi-reference image generation, diffusion models, appearance control |
2404.15264
Report |
TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting |
Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu |
Radiance fields have demonstrated impressive performance in synthesizing
lifelike 3D talking heads. However, due to the difficulty in fitting steep
appearance changes, the prevailing paradigm that presents facial motions by
directly modifying point appearance may lead to distortions in dynamic regions.
To tackle this challenge, we introduce TalkingGaussian, a deformation-based
radiance fields framework for high-fidelity talking head synthesis. Leveraging
the point-based Gaussian Splatting, facial motions can be represented in our
method by applying smooth and continuous deformations to persistent Gaussian
primitives, without requiring to learn the difficult appearance change like
previous methods. Due to this simplification, precise facial motions can be
synthesized while keeping a highly intact facial feature. Under such a
deformation paradigm, we further identify a face-mouth motion inconsistency
that would affect the learning of detailed speaking motions. To address this
conflict, we decompose the model into two branches separately for the face and
inside mouth areas, therefore simplifying the learning tasks to help
reconstruct more accurate motion and structure of the mouth region. Extensive
experiments demonstrate that our method renders high-quality lip-synchronized
talking head videos, with better facial fidelity and higher efficiency compared
with previous methods. |
TalkingGaussian, a deformation-based radiance fields framework using 3D Gaussian Splatting for high-fidelity 3D talking head synthesis. |
Existing NeRF-based methods struggle to synthesize accurate facial features due to difficulties in fitting abrupt appearance changes characteristic of facial movements. |
The method represents the talking head with Deformable Gaussian Fields, using Persistent Gaussian Fields for static head structure and Grid-based Motion Fields to predict deformations applied to Gaussian primitives, representing facial movements. A Face-Mouth Decomposition module separates face and inside-mouth regions to improve motion accuracy. Incremental sampling strategy using facial action priors smooths the deformation learning process. |
TalkingGaussian synthesizes high-quality, lip-synced talking head videos with superior facial fidelity compared to state-of-the-art methods.
The framework achieves high generalization ability, effectively handling cross-domain audio inputs.
TalkingGaussian demonstrates superior efficiency in both training and inference thanks to 3D Gaussian Splatting. |
Random noisy primitives can occur during 3DGS densification, impacting quality.
Alignment between face and inside-mouth branches relies solely on audio features, leading to potential misalignment in cross-domain scenarios. |
talking head synthesis, 3d gaussian splatting, deformation-based, radiance fields, facial fidelity |
2404.15263
Report |
Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization |
Lahav Lipson, Jia Deng |
We introduce a new system for Multi-Session SLAM, which tracks camera motion
across multiple disjoint videos under a single global reference. Our approach
couples the prediction of optical flow with solver layers to estimate camera
pose. The backbone is trained end-to-end using a novel differentiable solver
for wide-baseline two-view pose. The full system can connect disjoint
sequences, perform visual odometry, and global optimization. Compared to
existing approaches, our design is accurate and robust to catastrophic
failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose |
This paper introduces a new system for Multi-Session SLAM, which can track camera motion across multiple disjoint videos under a single global reference. |
Handling disjoint videos in SLAM is important for many applications in AR and robotics where video data often consists of multiple non-continuous sessions. |
The system couples optical flow prediction with differentiable solver layers to estimate camera pose. It utilizes a novel differentiable solver for wide-baseline two-view pose and is trained end-to-end. |
The system is more accurate than prior Multi-Session SLAM approaches on EuRoC-MAV and ETH3D datasets.
It is robust to catastrophic failures common in challenging scenarios.
The two-view pose estimation component is competitive with transformer-based matching networks on Scannet and Megadepth datasets. |
The two-view pose method is less competitive in photo-tourism settings where high-volume matching is easier.
Future work could explore event cameras and inertial sensors to further improve robustness and accuracy. |
slam, multi-session slam, visual odometry, differentiable solvers, optical flow |
2404.15259
Report |
FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent |
Cameron Smith, David Charatan, Ayush Tewari, Vincent Sitzmann |
This paper introduces FlowMap, an end-to-end differentiable method that
solves for precise camera poses, camera intrinsics, and per-frame dense depth
of a video sequence. Our method performs per-video gradient-descent
minimization of a simple least-squares objective that compares the optical flow
induced by depth, intrinsics, and poses against correspondences obtained via
off-the-shelf optical flow and point tracking. Alongside the use of point
tracks to encourage long-term geometric consistency, we introduce
differentiable re-parameterizations of depth, intrinsics, and pose that are
amenable to first-order optimization. We empirically show that camera
parameters and dense depth recovered by our method enable photo-realistic novel
view synthesis on 360-degree trajectories using Gaussian Splatting. Our method
not only far outperforms prior gradient-descent based bundle adjustment
methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM
method, on the downstream task of 360-degree novel view synthesis (even though
our method is purely gradient-descent based, fully differentiable, and presents
a complete departure from conventional SfM). |
FlowMap, an end-to-end differentiable method that recovers accurate camera poses, intrinsics, and dense depth maps from video sequences. |
Enables novel view synthesis from unposed videos and paves the way for deep-learning based 3D reconstruction and scene understanding by being compatible with deep learning pipelines. |
Minimizes a least-squares objective comparing the optical flow induced by depth, intrinsics, and poses against correspondences obtained from off-the-shelf optical flow and point tracking. Introduces differentiable feed-forward estimations of depth (via a neural network), pose (as a solution to a least-squares problem), and intrinsics (using a differentiable selection based on optical flow consistency). |
FlowMap enables photorealistic novel view synthesis up to full 360° trajectories using Gaussian Splatting.
Significantly outperforms prior gradient-descent based bundle adjustment methods.
Performs on par with COLMAP on the downstream task of 360° novel view synthesis. |
Less accurate and robust than COLMAP in terms of pose and intrinsics prediction.
Requires more GPU memory and slightly longer runtime compared to COLMAP. |
structure-from-motion, novel view synthesis, differentiable rendering, optical flow, point tracking |
2404.15228
Report |
Re-Thinking Inverse Graphics With Large Language Models |
Peter Kulits, Haiwen Feng, Weiyang Liu, Victoria Abrevaya, Michael J. Black |
Inverse graphics -- the task of inverting an image into physical variables
that, when rendered, enable reproduction of the observed scene -- is a
fundamental challenge in computer vision and graphics. Disentangling an image
into its constituent elements, such as the shape, color, and material
properties of the objects of the 3D scene that produced it, requires a
comprehensive understanding of the environment. This requirement limits the
ability of existing carefully engineered approaches to generalize across
domains. Inspired by the zero-shot ability of large language models (LLMs) to
generalize to novel contexts, we investigate the possibility of leveraging the
broad world knowledge encoded in such models in solving inverse-graphics
problems. To this end, we propose the Inverse-Graphics Large Language Model
(IG-LLM), an inverse-graphics framework centered around an LLM, that
autoregressively decodes a visual embedding into a structured, compositional
3D-scene representation. We incorporate a frozen pre-trained visual encoder and
a continuous numeric head to enable end-to-end training. Through our
investigation, we demonstrate the potential of LLMs to facilitate inverse
graphics through next-token prediction, without the use of image-space
supervision. Our analysis opens up new possibilities for precise spatial
reasoning about images that exploit the visual knowledge of LLMs. We will
release our code and data to ensure the reproducibility of our investigation
and to facilitate future research at https://ig-llm.is.tue.mpg.de/ |
This paper introduces IG-LLM, a novel framework leveraging Large Language Models (LLMs) for solving inverse graphics tasks, aiming to generate graphics programs from images for 3D scene reproduction. |
Existing inverse graphics methods struggle with generalizing to novel scenes or objects. This work explores the potential of LLMs, with their strong generalization abilities and world knowledge, to overcome these limitations. |
IG-LLM uses a pre-trained LLM enhanced with a visual encoder (CLIP) and a numeric head. It's trained on synthetic data (CLEVR and ShapeNet) with an instruction-tuning approach to predict graphics programs from images. |
IG-LLM exhibits strong compositional generalization, outperforming the baseline NS-VQA by 60% in shape recognition accuracy on out-of-distribution CLEVR data.
The integration of a numeric head enables IG-LLM to perform precise spatial reasoning, showing superior performance in 2D and SO(3) parameter space generalization tasks.
IG-LLM demonstrates promising results in 6-DoF pose estimation, scaling to multi-object scenes and exhibiting generalization ability in both single-object and scene-level settings. |
The expressiveness of IG-LLM is currently limited by the training data and code representation, potentially restricting its ability to handle complex real-world scenes.
Addressing scenes with significant occlusions or complex arrangements might require a balance between the current generic approach and task-specific inductive biases. |
inverse graphics, large language models, 3d scene understanding, compositional generalization, spatial reasoning |
2404.15141
Report |
CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method |
Mingbao Lin, Zhihang Lin, Wengyi Zhan, Liujuan Cao, Rongrong Ji |
Transforming large pre-trained low-resolution diffusion models to cater to
higher-resolution demands, i.e., diffusion extrapolation, significantly
improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at
simplifying and accelerating the diffusion extrapolation process, making it
more affordable and improving performance. CutDiffusion abides by the existing
patch-wise extrapolation but cuts a standard patch diffusion process into an
initial phase focused on comprehensive structure denoising and a subsequent
phase dedicated to specific detail refinement. Comprehensive experiments
highlight the numerous almighty advantages of CutDiffusion: (1) simple method
construction that enables a concise higher-resolution diffusion process without
third-party engagement; (2) fast inference speed achieved through a single-step
higher-resolution diffusion process, and fewer inference patches required; (3)
cheap GPU cost resulting from patch-wise inference and fewer patches during the
comprehensive structure denoising; (4) strong generation performance, stemming
from the emphasis on specific detail refinement. |
This paper introduces CutDiffusion, a tuning-free diffusion extrapolation method for generating high-resolution images from pre-trained low-resolution diffusion models. |
Training high-resolution diffusion models from scratch is computationally expensive and time-consuming. Diffusion extrapolation leverages pre-trained models to generate higher-resolution images efficiently. |
CutDiffusion divides the image generation process into two stages: (1) Comprehensive Structure Denoising: Randomly sampled non-overlapping patches undergo denoising with pixel interaction to ensure similar content across patches. (2) Specific Detail Refinement: Structurally-enhanced patches are reassembled into a higher-resolution latent, followed by denoising with overlapping patches to refine details. |
CutDiffusion is simple to implement, requiring only modification to the sub-patch sampling approach.
CutDiffusion achieves fast inference speeds, comparable to direct inference methods and significantly faster than existing patch-wise methods.
CutDiffusion maintains low GPU memory consumption, making it more accessible than methods demanding high GPU resources. |
The generated high-resolution image quality relies on the quality of the pretrained diffusion model.
The second stage, using overlapping patches, limits further speed improvements. |
image generation, high resolution, diffusion model, diffusion extrapolation, tuning-free |
2404.15100
Report |
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation |
Xun Wu, Shaohan Huang, Furu Wei |
Recent studies have demonstrated the exceptional potentials of leveraging
human preference datasets to refine text-to-image generative models, enhancing
the alignment between generated images and textual prompts. Despite these
advances, current human preference datasets are either prohibitively expensive
to construct or suffer from a lack of diversity in preference dimensions,
resulting in limited applicability for instruction tuning in open-source
text-to-image generative models and hinder further exploration. To address
these challenges and promote the alignment of generative models through
instruction tuning, we leverage multimodal large language models to create
VisionPrefer, a high-quality and fine-grained preference dataset that captures
multiple preference aspects. We aggregate feedback from AI annotators across
four aspects: prompt-following, aesthetic, fidelity, and harmlessness to
construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train
a reward model VP-Score over VisionPrefer to guide the training of
text-to-image generative models and the preference prediction accuracy of
VP-Score is comparable to human annotators. Furthermore, we use two
reinforcement learning methods to supervised fine-tune generative models to
evaluate the performance of VisionPrefer, and extensive experimental results
demonstrate that VisionPrefer significantly improves text-image alignment in
compositional image generation across diverse aspects, e.g., aesthetic, and
generalizes better than previous human-preference metrics across various image
distributions. Moreover, VisionPrefer indicates that the integration of
AI-generated synthetic data as a supervisory signal is a promising avenue for
achieving improved alignment with human preferences in vision generative
models. |
This paper introduces \our{}, a large-scale, high-quality, and diversified preference dataset for text-to-image generative alignment, constructed using feedback from multimodal large language models (MLLMs). |
Existing human preference datasets for aligning text-to-image generative models are expensive to construct and limited in scale and diversity, hindering the development of more aligned models. |
The authors leverage MLLMs to generate preferences for images generated by different text-to-image models based on a curated prompt set. The preferences cover four aspects: prompt-following, fidelity, aesthetic, and harmlessness, providing both numerical scores and textual explanations. |
The resulting dataset, \our{}, is significantly larger and more fine-grained than existing human-annotated datasets.
The authors train a reward model, \ourscore{}, on \our{} and show it achieves comparable performance to reward models trained on human preferences.
Fine-tuning generative models using \ourscore{} and \our{} with PPO and DPO, respectively, leads to significant improvements in image quality and alignment with human preferences across various aspects. |
The textual explanations in \our{} are not fully utilized.
The issue of image distortion, while mitigated, is not completely solved and requires further research. |
text-to-image generation, preference learning, reinforcement learning from ai feedback, multimodal large language models, ai-synthesized data |
2404.15014
Report |
OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving |
Guoqing Wang, Zhongdao Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma |
Existing solutions for 3D semantic occupancy prediction typically treat the
task as a one-shot 3D voxel-wise segmentation perception problem. These
discriminative methods focus on learning the mapping between the inputs and
occupancy map in a single step, lacking the ability to gradually refine the
occupancy map and the reasonable scene imaginative capacity to complete the
local regions somewhere. In this paper, we introduce OccGen, a simple yet
powerful generative perception model for the task of 3D semantic occupancy
prediction. OccGen adopts a ''noise-to-occupancy'' generative paradigm,
progressively inferring and refining the occupancy map by predicting and
eliminating noise originating from a random Gaussian distribution. OccGen
consists of two main components: a conditional encoder that is capable of
processing multi-modal inputs, and a progressive refinement decoder that
applies diffusion denoising using the multi-modal features as conditions. A key
insight of this generative pipeline is that the diffusion denoising process is
naturally able to model the coarse-to-fine refinement of the dense 3D occupancy
map, therefore producing more detailed predictions. Extensive experiments on
several occupancy benchmarks demonstrate the effectiveness of the proposed
method compared to the state-of-the-art methods. For instance, OccGen
relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy
dataset under the muli-modal, LiDAR-only, and camera-only settings,
respectively. Moreover, as a generative perception model, OccGen exhibits
desirable properties that discriminative models cannot achieve, such as
providing uncertainty estimates alongside its multiple-step predictions. |
Introduces OccGen, a generative model for 3D semantic occupancy prediction, which progressively refines the occupancy map by predicting and eliminating noise, leading to better detail and scene completion. |
Addresses limitations of discriminative methods that lack gradual refinement and struggle with local scene completion in 3D occupancy prediction. |
Uses a 'noise-to-occupancy' paradigm with a conditional encoder for multi-modal inputs and a progressive refinement decoder applying diffusion denoising using these inputs. |
Outperforms state-of-the-art methods on nuScenes-Occupancy and SemanticKITTI benchmarks.
Offers flexible compute-accuracy trade-off through progressive inference.
Provides uncertainty estimates alongside predictions. |
Current latency comparable to existing methods, future work aims for lightweight architecture.
Potential for bias in the model based on training data, needing careful consideration for real-world deployment. |
occupancy prediction, generative model, diffusion model, autonomous driving, multi-modal learning |
2404.14967
Report |
CoARF: Controllable 3D Artistic Style Transfer for Radiance Fields |
Deheng Zhang, Clara Fernandez-Labrador, Christopher Schroers |
Creating artistic 3D scenes can be time-consuming and requires specialized
knowledge. To address this, recent works such as ARF, use a radiance
field-based approach with style constraints to generate 3D scenes that resemble
a style image provided by the user. However, these methods lack fine-grained
control over the resulting scenes. In this paper, we introduce Controllable
Artistic Radiance Fields (CoARF), a novel algorithm for controllable 3D scene
stylization. CoARF enables style transfer for specified objects, compositional
3D style transfer and semantic-aware style transfer. We achieve controllability
using segmentation masks with different label-dependent loss functions. We also
propose a semantic-aware nearest neighbor matching algorithm to improve the
style transfer quality. Our extensive experiments demonstrate that CoARF
provides user-specified controllability of style transfer and superior style
transfer quality with more precise feature matching. |
Introduces CoARF, an algorithm for controllable 3D scene stylization using radiance fields, enabling object-specific, compositional, and semantic-aware style transfer. |
Addresses the limitations of existing 3D scene stylization methods by providing fine-grained control over the style transfer process for more precise and user-specified results. |
Utilizes a multi-view 2D mask-based optimization framework with label-dependent loss functions and a novel semantic-aware nearest neighbor matching (SANNFM) algorithm. |
Enables users to selectively stylize specific objects within a scene while preserving the photorealism of other elements.
Allows for the application of different styles to different parts of the 3D scene through compositional style transfer.
Achieves superior style transfer quality, particularly in semantically sensitive scenarios, by leveraging both VGG and LSeg features for improved feature matching. |
Large scale differences between scene and style objects can lead to undesired stylization results.
Freezing the density field during optimization may limit the richness of the stylization outcomes. |
3d scene stylization, radiance fields, semantic-aware style transfer, controllable artistic style, neural rendering |
2404.14966
Report |
Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model |
Xu Han, Yuan Tang, Zhaoxuan Wang, Xianzhi Li |
Existing Transformer-based models for point cloud analysis suffer from
quadratic complexity, leading to compromised point cloud resolution and
information loss. In contrast, the newly proposed Mamba model, based on state
space models (SSM), outperforms Transformer in multiple areas with only linear
complexity. However, the straightforward adoption of Mamba does not achieve
satisfactory performance on point cloud tasks. In this work, we present
Mamba3D, a state space model tailored for point cloud learning to enhance local
feature extraction, achieving superior performance, high efficiency, and
scalability potential. Specifically, we propose a simple yet effective Local
Norm Pooling (LNP) block to extract local geometric features. Additionally, to
obtain better global features, we introduce a bidirectional SSM (bi-SSM) with
both a token forward SSM and a novel backward SSM that operates on the feature
channel. Extensive experimental results show that Mamba3D surpasses
Transformer-based counterparts and concurrent works in multiple tasks, with or
without pre-training. Notably, Mamba3D achieves multiple SoTA, including an
overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1%
(with single-modal pre-training) on the ModelNet40 classification task, with
only linear complexity. |
This paper proposes \ours, a novel state space model (SSM) tailored for 3D point cloud learning that leverages Mamba's efficiency while addressing its limitations for unordered points and local feature extraction. |
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, limiting their scalability. Mamba, based on SSM, offers linear complexity but lacks effective adaptation for point clouds. |
The paper introduces two key components: (1) Local Norm Pooling (LNP) block for local feature extraction, using K-norm for propagation and K-pooling for aggregation. (2) Bidirectional-SSM (bi-SSM) with a token forward SSM and a novel feature reverse backward SSM (C-SSM) to capture global features while mitigating pseudo-order reliance. |
\ours achieves state-of-the-art (SoTA) results on ScanObjectNN classification, outperforming previous methods even when trained from scratch.
It demonstrates superior performance in few-shot learning on ModelNet40, highlighting its ability to learn from limited data.
The model consistently outperforms Transformer-based counterparts across various tasks, including object classification and part segmentation, with reduced parameters and FLOPs. |
The pre-training benefits are not as significant as in Transformers, potentially due to limitations of masked point modeling for recurrent models like Mamba.
Future work will focus on exploring tailored pre-training strategies and scaling up the model to further exploit its linear complexity advantage. |
point cloud analysis, state space model, local feature, mamba, linear complexity |
2404.14768
Report |
Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion |
Hongyu Chen, Yiqi Gao, Min Zhou, Peng Wang, Xubin Li, Tiezheng Ge, Bo Zheng |
Recently, integrating visual controls into text-to-image~(T2I) models, such
as ControlNet method, has received significant attention for finer control
capabilities. While various training-free methods make efforts to enhance
prompt following in T2I models, the issue with visual control is still rarely
studied, especially in the scenario that visual controls are misaligned with
text prompts. In this paper, we address the challenge of ``Prompt Following
With Visual Control" and propose a training-free approach named Mask-guided
Prompt Following (MGPF). Object masks are introduced to distinct aligned and
misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed
as Masked ControlNet, is designed to utilize these object masks for object
generation in the misaligned visual control region. Further, to improve
attribute matching, a simple yet efficient loss is designed to align the
attention maps of attributes with object regions constrained by ControlNet and
object masks. The efficacy and superiority of MGPF are validated through
comprehensive quantitative and qualitative experiments. |
This paper introduces Mask-guided Prompt Following (MGPF), a training-free approach for improving prompt following in text-to-image synthesis models that use visual controls (like ControlNet), specifically addressing misalignment between text prompts and visual cues. |
Existing text-to-image models with visual controls often struggle to accurately reflect text prompts, particularly when there's a misalignment between the prompt and the visual control. This leads to inaccuracies in generated images, such as missing objects or mismatched attributes, limiting the controllability and quality of image generation. |
MGPF uses object masks to separate aligned and misaligned portions of the visual control. It introduces Masked ControlNet, which utilizes these masks to focus on relevant visual features, and an Attribute-matching Loss to ensure attributes in the text prompt are correctly reflected in the generated image. |
MGPF outperforms existing training-free methods in aligning generated images with both text prompts and visual controls, as measured by text-image similarity, VQA-based metrics, and human evaluation.
The Masked ControlNet effectively addresses the 'object missing' problem by focusing on relevant visual features and allowing the model to generate objects based on the text prompt, even when misaligned with the visual control.
The Attribute-matching Loss successfully tackles the 'attribute mismatch' problem, ensuring that attributes in the text prompt are accurately reflected in the generated image without disrupting the visual control. |
The method faces challenges in complex scenarios involving attribute matching for multiple or small objects due to the limited resolution of attention maps.
Future work could explore incorporating cross-attention in higher-resolution layers to enhance localized attribute binding. |
text-to-image synthesis, visual control, prompt following, controlnet, attribute matching |
2404.14743
Report |
Gradient Guidance for Diffusion Models: An Optimization Perspective |
Yingqing Guo, Hui Yuan, Yukang Yang, Minshuo Chen, Mengdi Wang |
Diffusion models have demonstrated empirical successes in various
applications and can be adapted to task-specific needs via guidance. This paper
introduces a form of gradient guidance for adapting or fine-tuning diffusion
models towards user-specified optimization objectives. We study the theoretic
aspects of a guided score-based sampling process, linking the gradient-guided
diffusion model to first-order optimization. We show that adding gradient
guidance to the sampling process of a pre-trained diffusion model is
essentially equivalent to solving a regularized optimization problem, where the
regularization term acts as a prior determined by the pre-training data.
Diffusion models are able to learn data's latent subspace, however, explicitly
adding the gradient of an external objective function to the sample process
would jeopardize the structure in generated samples. To remedy this issue, we
consider a modified form of gradient guidance based on a forward prediction
loss, which leverages the pre-trained score function to preserve the latent
structure in generated samples. We further consider an iteratively fine-tuned
version of gradient-guided diffusion where one can query gradients at newly
generated data points and update the score network using new samples. This
process mimics a first-order optimization iteration in expectation, for which
we proved O(1/K) convergence rate to the global optimum when the objective
function is concave. |
The paper introduces a novel gradient-based guidance method for diffusion models, allowing them to be adapted for generating samples that optimize user-specified objectives while preserving the learned data structure. |
This is important because it bridges the gap between generative AI and optimization, enabling efficient optimization in complex design spaces (images, videos, proteins, etc.) where traditional methods struggle. |
The paper proposes a gradient guidance based on a forward prediction loss, which leverages the pre-trained score function to preserve the latent subspace structure of the data. They analyze two algorithms: one iteratively updates the guidance using newly queried gradients, and another additionally fine-tunes the score network with self-generated samples. |
Iteratively applying gradient guidance with a pre-trained score function generates samples whose expectation converges to a solution regularized with respect to the original data distribution.
The pre-trained score function acts as a prior, limiting the extent to which the model can be adapted away from the original data distribution.
Adaptively fine-tuning the score network using self-generated samples allows the model to converge to the global optimum of the objective function within the latent subspace, achieving a convergence rate comparable to classical convex optimization. |
Theoretical analysis focuses on the class of linear score functions, while experiments utilize a more complex U-Net architecture.
The paper primarily focuses on concave objective functions. Future work could explore extensions to non-concave objectives. |
diffusion models, generative ai, optimization, gradient guidance, score matching |
2404.14676
Report |
DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance |
Linxuan Xin, Zheng Zhang, Jinfu Wei, Ge Li, Duan Gao |
Prior material creation methods had limitations in producing diverse results
mainly because reconstruction-based methods relied on real-world measurements
and generation-based methods were trained on relatively small material
datasets. To address these challenges, we propose DreamPBR, a novel
diffusion-based generative framework designed to create spatially-varying
appearance properties guided by text and multi-modal controls, providing high
controllability and diversity in material generation. Key to achieving diverse
and high-quality PBR material generation lies in integrating the capabilities
of recent large-scale vision-language models trained on billions of text-image
pairs, along with material priors derived from hundreds of PBR material
samples. We utilize a novel material Latent Diffusion Model (LDM) to establish
the mapping between albedo maps and the corresponding latent space. The latent
representation is then decoded into full SVBRDF parameter maps using a
rendering-aware PBR decoder. Our method supports tileable generation through
convolution with circular padding. Furthermore, we introduce a multi-modal
guidance module, which includes pixel-aligned guidance, style image guidance,
and 3D shape guidance, to enhance the control capabilities of the material LDM.
We demonstrate the effectiveness of DreamPBR in material creation, showcasing
its versatility and user-friendliness on a wide range of controllable
generation and editing applications. |
DreamPBR, a novel diffusion-based generative framework for creating high-resolution spatially-varying bidirectional reflectance distribution functions (SVBRDFs) guided by text and multi-modal controls. |
Prior material creation methods were limited in producing diverse results due to relying on real-world measurements or training on small datasets. |
The method integrates pre-trained text-to-image diffusion models with material priors, using a two-stage material Latent Diffusion Model (LDM) and a rendering-aware PBR decoder. It also incorporates multi-modal guidance modules for pixel control, style control, and shape control. |
DreamPBR generates semantically correct and detailed materials based on various textual prompts, ranging from structured to imaginative.
The method supports tileable generation through convolution with circular padding.
DreamPBR enables a wide range of controllable generation and editing applications, showcasing its versatility and user-friendliness. |
Current implementation uses normal maps without displacement maps, leading to ignoring self-occlusion during rendering.
Generating detailed textures requires users to craft lengthy descriptions. |
physically-based rendering, svbrdf, diffusion models, text-to-image synthesis, multi-modal learning |
2404.14674
Report |
HOIN: High-Order Implicit Neural Representations |
Yang Chen, Ruituo Wu, Yipeng Liu, Ce Zhu |
Implicit neural representations (INR) suffer from worsening spectral bias,
which results in overly smooth solutions to the inverse problem. To deal with
this problem, we propose a universal framework for processing inverse problems
called \textbf{High-Order Implicit Neural Representations (HOIN)}. By refining
the traditional cascade structure to foster high-order interactions among
features, HOIN enhances the model's expressive power and mitigates spectral
bias through its neural tangent kernel's (NTK) strong diagonal properties,
accelerating and optimizing inverse problem resolution. By analyzing the
model's expression space, high-order derivatives, and the NTK matrix, we
theoretically validate the feasibility of HOIN. HOIN realizes 1 to 3 dB
improvements in most inverse problems, establishing a new state-of-the-art
recovery quality and training efficiency, thus providing a new general paradigm
for INR and paving the way for it to solve the inverse problem. |
This paper introduces HOIN (High-Order Implicit Neural Representations), a novel framework designed to enhance the performance of Implicit Neural Representations (INRs) in tackling inverse problems. |
Traditional INRs struggle with spectral bias, resulting in overly smooth solutions lacking crucial high-frequency details. Existing mitigation strategies are often task-specific and fail to fully restore high-frequency details, highlighting the need for a universally applicable and effective solution. |
HOIN integrates high-order interaction blocks into INRs, expanding their functional space to capture richer, high-frequency information. This is achieved through a combination of suitable encoding layers (e.g., Hash Table, Position Encoding, Fourier Features) and a novel High-Order (HO) block architecture facilitating complex feature interactions. |
HOIN significantly improves image representation abilities, achieving higher PSNR values compared to baseline models, with HO-FFN demonstrating superior performance.
In image denoising, HO-Pos.Enc excels due to its moderate acceleration in high-frequency learning, outperforming models that aggressively mitigate spectral bias and blend noise with signal details.
HOIN consistently enhances performance in super-resolution, CT reconstruction, and image inpainting tasks, with HO-SIREN and HO-FFN consistently achieving superior results compared to other INR-based methods. |
While HOIN effectively mitigates spectral bias, careful consideration is needed regarding the degree of acceleration in high-frequency learning to avoid incorporating noise in specific inverse problems.
Future work could explore the adaptation of HOIN to other domains beyond image processing, such as audio or 3D model reconstruction, to further evaluate its generalizability and effectiveness. |
implicit neural representation, inverse problem, spectral bias, high-frequency information, neural tangent kernel |
2404.14667
Report |
3DFlowRenderer: One-shot Face Re-enactment via Dense 3D Facial Flow Estimation |
Siddharth Nijhawan, Takuya Yashima, Tamaki Kojima |
Performing facial expression transfer under one-shot setting has been
increasing in popularity among research community with a focus on precise
control of expressions. Existing techniques showcase compelling results in
perceiving expressions, but they lack robustness with extreme head poses. They
also struggle to accurately reconstruct background details, thus hindering the
realism. In this paper, we propose a novel warping technology which integrates
the advantages of both 2D and 3D methods to achieve robust face re-enactment.
We generate dense 3D facial flow fields in feature space to warp an input image
based on target expressions without depth information. This enables explicit 3D
geometric control for re-enacting misaligned source and target faces. We
regularize the motion estimation capability of the 3D flow prediction network
through proposed "Cyclic warp loss" by converting warped 3D features back into
2D RGB space. To ensure the generation of finer facial region with
natural-background, our framework only renders the facial foreground region
first and learns to inpaint the blank area which needs to be filled due to
source face translation, thus reconstructing the detailed background without
any unwanted pixel motion. Extensive evaluation reveals that our method
outperforms state-of-the-art techniques in rendering artifact-free facial
images. |
This paper proposes 3DFlowRenderer, a novel one-shot face re-enactment framework that leverages dense 3D facial flow estimation to enhance robustness and realism, especially in extreme head pose variations. |
Existing methods struggle with extreme head poses and accurate background reconstruction, limiting the realism of face re-enactment. This work addresses these limitations by integrating the strengths of both 2D and 3D methods. |
The proposed 3DFlowRenderer employs a four-stage process: 1) Pre-processing: separates foreground and background, estimates 3DMM parameters for target motion; 2) 3D Warping: computes dense 3D facial flow fields to warp source foreground based on target expressions; 3) Image Refinement: refines the warped foreground using a TransUNet block; and 4) Image Inpainting: projects refined foreground onto the source background and inpaints the missing regions using another TransUNet block. |
Outperforms state-of-the-art methods in terms of realism (FID), noise reduction (PSNR), reconstruction quality (SSIM), identity preservation (CSIM), and motion transfer accuracy (AED, AKD, APD).
Demonstrates robustness to extreme head pose and expression variations.
Successfully renders finer facial details and preserves background information without leakage or unwanted motion. |
The accuracy of 3DMM parameter estimation can impact the overall performance.
Future work includes extending the framework for handling occlusions and incorporating temporal consistency for video re-enactment. |
face re-enactment, one-shot, 3d warping, image-to-image synthesis, 3dmm |
2404.14581
Report |
The Adversarial AI-Art: Understanding, Generation, Detection, and Benchmarking |
Yuying Li, Zeyan Liu, Junyi Zhao, Liangqin Ren, Fengjun Li, Jiebo Luo, Bo Luo |
Generative AI models can produce high-quality images based on text prompts.
The generated images often appear indistinguishable from images generated by
conventional optical photography devices or created by human artists (i.e.,
real images). While the outstanding performance of such generative models is
generally well received, security concerns arise. For instance, such image
generators could be used to facilitate fraud or scam schemes, generate and
spread misinformation, or produce fabricated artworks. In this paper, we
present a systematic attempt at understanding and detecting AI-generated images
(AI-art) in adversarial scenarios. First, we collect and share a dataset of
real images and their corresponding artificial counterparts generated by four
popular AI image generators. The dataset, named ARIA, contains over 140K images
in five categories: artworks (painting), social media images, news photos,
disaster scenes, and anime pictures. This dataset can be used as a foundation
to support future research on adversarial AI-art. Next, we present a user study
that employs the ARIA dataset to evaluate if real-world users can distinguish
with or without reference images. In a benchmarking study, we further evaluate
if state-of-the-art open-source and commercial AI image detectors can
effectively identify the images in the ARIA dataset. Finally, we present a
ResNet-50 classifier and evaluate its accuracy and transferability on the ARIA
dataset. |
This paper presents ARIA, a comprehensive dataset of adversarial AI-generated art, and investigates the challenges in detecting such art by both humans and AI detectors. |
The rise of AI-generated art poses significant risks, including social media fraud, fake news, and art style imitation, necessitating a better understanding and reliable detection methods. |
The authors collected a large-scale dataset (ARIA) of real and AI-generated images across five categories. They conducted a user study to assess human detection ability and benchmarked various open-source and commercial AI image detectors. |
Human users struggle to distinguish real from AI-generated images, even with references.
Most open-source and commercial detectors exhibit unsatisfactory accuracy, especially for images generated with both text and image prompts.
Supervised classifiers trained on ARIA show promise, with models trained on Midjourney data demonstrating better generalizability. |
The dataset, while extensive, may not encompass the full spectrum of future AI models.
Budget limitations restricted the evaluation of some commercial detectors. |
aigc, ai-generated images, ai-art, adversarial attacks, image detection |
2404.14507
Report |
Align Your Steps: Optimizing Sampling Schedules in Diffusion Models |
Amirmojtaba Sabour, Sanja Fidler, Karsten Kreis |
Diffusion models (DMs) have established themselves as the state-of-the-art
generative modeling approach in the visual domain and beyond. A crucial
drawback of DMs is their slow sampling speed, relying on many sequential
function evaluations through large neural networks. Sampling from DMs can be
seen as solving a differential equation through a discretized set of noise
levels known as the sampling schedule. While past works primarily focused on
deriving efficient solvers, little attention has been given to finding optimal
sampling schedules, and the entire literature relies on hand-crafted
heuristics. In this work, for the first time, we propose a general and
principled approach to optimizing the sampling schedules of DMs for
high-quality outputs, called $\textit{Align Your Steps}$. We leverage methods
from stochastic calculus and find optimal schedules specific to different
solvers, trained DMs and datasets. We evaluate our novel approach on several
image, video as well as 2D toy data synthesis benchmarks, using a variety of
different samplers, and observe that our optimized schedules outperform
previous hand-crafted schedules in almost all experiments. Our method
demonstrates the untapped potential of sampling schedule optimization,
especially in the few-step synthesis regime. |
A novel framework, named Align Your Steps (AYS), is introduced for optimizing sampling schedules in diffusion models, particularly beneficial for generating high-quality outputs in few-step synthesis. |
Diffusion models (DMs) are powerful but suffer from slow sampling speed due to sequential function evaluations. Optimizing sampling schedules, a previously overlooked aspect, can significantly enhance output quality and efficiency. |
The methodology leverages stochastic calculus to minimize the Kullback-Leibler divergence between the true generative SDE and a solver-specific linearized SDE. This is formulated as an optimization problem over the sampling schedule, solved iteratively using Monte Carlo integration with time-based importance sampling. |
Optimized schedules consistently outperform hand-crafted schedules across various datasets (2D toy data, CIFAR10, FFHQ, ImageNet), models (Stable Diffusion, SDXL, DeepFloyd-IF, Stable Video Diffusion), and solvers.
Significant quality improvements are observed in the low NFE (Number of Function Evaluations) regime, with optimized schedules sometimes achieving quality comparable to default schedules with 1.5x fewer steps.
Optimized schedules derived for one solver often generalize well to other solvers, both stochastic and deterministic. |
The optimization objective is an upper bound on the discretization error, necessitating an early stopping mechanism to avoid over-optimization.
Optimizing schedules for conditional diffusion models, where the optimal schedule might vary depending on the conditioning input, needs further exploration. |
diffusion models, sampling schedules, generative modeling, stochastic calculus, optimization |
2404.14410
Report |
Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses |
Inhee Lee, Byungjun Kim, Hanbyul Joo |
In this paper, we present a method to reconstruct the world and multiple
dynamic humans in 3D from a monocular video input. As a key idea, we represent
both the world and multiple humans via the recently emerging 3D Gaussian
Splatting (3D-GS) representation, enabling to conveniently and efficiently
compose and render them together. In particular, we address the scenarios with
severely limited and sparse observations in 3D human reconstruction, a common
challenge encountered in the real world. To tackle this challenge, we introduce
a novel approach to optimize the 3D-GS representation in a canonical space by
fusing the sparse cues in the common space, where we leverage a pre-trained 2D
diffusion model to synthesize unseen views while keeping the consistency with
the observed 2D appearances. We demonstrate our method can reconstruct
high-quality animatable 3D humans in various challenging examples, in the
presence of occlusion, image crops, few-shot, and extremely sparse
observations. After reconstruction, our method is capable of not only rendering
the scene in any novel views at arbitrary time instances, but also editing the
3D scene by removing individual humans or applying different motions for each
human. Through various experiments, we demonstrate the quality and efficiency
of our methods over alternative existing approaches. |
This paper proposes a novel method for reconstructing dynamic 3D scenes with multiple humans from monocular videos, addressing the challenges of sparse and limited observations. |
Reconstructing 4D scenes from monocular videos is crucial for various applications, but existing methods struggle with realistic human representation, especially under sparse observations. |
The method leverages 3D Gaussian Splatting to represent both the static world and dynamic humans, enabling efficient composing and rendering. It introduces a novel canonical space optimization approach that fuses sparse cues and utilizes a pre-trained 2D diffusion model with Texture Inversion to synthesize unseen human body parts, ensuring consistency with observed appearances. |
The method successfully reconstructs high-quality animatable 3D human avatars, even with severe occlusions and limited viewpoints.
It demonstrates superior performance compared to existing approaches on challenging datasets like Panoptic and Hi4D.
The proposed approach offers high computational efficiency, achieving real-time novel pose rendering speed. |
The method currently relies on provided SMPL fitting and primarily focuses on humans as dynamic objects.
Future work could explore integrating SMPL estimation within the pipeline and extending the approach to encompass various dynamic objects beyond humans. |
3d scene reconstruction, monocular video, dynamic humans, sparse observations, diffusion models |
2404.14409
Report |
CrossScore: Towards Multi-View Image Evaluation and Scoring |
Zirui Wang, Wenjing Bian, Omkar Parkhi, Yuheng Ren, Victor Adrian Prisacariu |
We introduce a novel cross-reference image quality assessment method that
effectively fills the gap in the image assessment landscape, complementing the
array of established evaluation schemes -- ranging from full-reference metrics
like SSIM, no-reference metrics such as NIQE, to general-reference metrics
including FID, and Multi-modal-reference metrics, e.g., CLIPScore. Utilising a
neural network with the cross-attention mechanism and a unique data collection
pipeline from NVS optimisation, our method enables accurate image quality
assessment without requiring ground truth references. By comparing a query
image against multiple views of the same scene, our method addresses the
limitations of existing metrics in novel view synthesis (NVS) and similar tasks
where direct reference images are unavailable. Experimental results show that
our method is closely correlated to the full-reference metric SSIM, while not
requiring ground truth references. |
This paper introduces CrossScore, a novel cross-reference image quality assessment (CR-IQA) method for evaluating image quality using multiple unregistered reference views of the same scene. |
Existing IQA methods, relying on full-reference, no-reference, general-reference, or multi-modal-reference schemes, are inadequate for tasks like novel view synthesis (NVS) where ground truth references are unavailable for true novel views. |
The method utilizes a neural network with a cross-attention mechanism. It predicts a score map approximating the SSIM score by comparing a query image with a set of multi-view reference images. The model is trained using a self-supervised approach, leveraging NVS algorithms to generate distorted images and their corresponding SSIM maps. |
CrossScore exhibits a strong correlation with the full-reference SSIM score without requiring ground truth reference images.
Trained solely on the Map-free Relocalisation (MFR) dataset, CrossScore generalizes well to other datasets, demonstrating its versatility.
CrossScore effectively evaluates NVS renderings from true novel trajectories without ground truth, aligning with traditional SSIM-based evaluations. |
The score maps generated by CrossScore lack the sharpness of full-reference SSIM, potentially due to patch-wise encoding.
The method faces challenges in evaluating unconventional images, such as those from fish-eye lenses, leading to inaccurate predictions. |
image quality assessment, novel view synthesis, cross-reference, cross-attention, self-supervised learning |
2404.14403
Report |
GeoDiffuser: Geometry-Based Image Editing with Diffusion Models |
Rahul Sajnani, Jeroen Vanbaar, Jie Min, Kapil Katyal, Srinath Sridhar |
The success of image generative models has enabled us to build methods that
can edit images based on text or other user input. However, these methods are
bespoke, imprecise, require additional information, or are limited to only 2D
image edits. We present GeoDiffuser, a zero-shot optimization-based method that
unifies common 2D and 3D image-based object editing capabilities into a single
method. Our key insight is to view image editing operations as geometric
transformations. We show that these transformations can be directly
incorporated into the attention layers in diffusion models to implicitly
perform editing operations. Our training-free optimization method uses an
objective function that seeks to preserve object style but generate plausible
images, for instance with accurate lighting and shadows. It also inpaints
disoccluded parts of the image where the object was originally located. Given a
natural image and user input, we segment the foreground object using SAM and
estimate a corresponding transform which is used by our optimization approach
for editing. GeoDiffuser can perform common 2D and 3D edits like object
translation, 3D rotation, and removal. We present quantitative results,
including a perceptual study, that shows how our approach is better than
existing methods. Visit https://ivl.cs.brown.edu/research/geodiffuser.html for
more information. |
\coolname is a novel zero-shot optimization-based method for 2D and 3D image editing that leverages the power of pre-trained diffusion models. It unifies various image editing capabilities, such as object translation, rotation, scaling, and removal, into a single framework by treating these operations as geometric transformations directly incorporated into the attention layers of diffusion models. |
Existing image editing methods often require bespoke solutions, lack precision, demand additional information (e.g., text prompts, optical flow), or are limited to 2D edits. \coolname overcomes these limitations by providing a unified and flexible approach for realistic and style-preserving image editing in both 2D and 3D. |
\coolname employs a shared attention mechanism within a diffusion model's editing framework. First, it performs DDIM inversion on the input image to obtain a latent noise trajectory. Then, it applies user-specified geometric transformations to the query embeddings of the reference attention layer, guiding the edit diffusion process. An optimization procedure, incorporating losses for background preservation, object preservation, inpainting, and smoothness, refines the edited image while ensuring realism and style consistency. |
Qualitative results demonstrate \coolname's capability to perform a variety of realistic 2D and 3D edits, including object translation, rotation, scaling, and removal, while preserving object style, lighting, shadows, and reflections.
Quantitative evaluation, including a perceptual study, shows that users significantly prefer \coolname's editing results over existing methods like LaMa and Zero123-XL for realism, adherence to the desired edit, and inpainting quality.
Metrics such as Mean Distance and Warp Error confirm \coolname's superior performance in accurately transforming foreground objects and adhering to user-specified edits compared to baselines. |
\coolname currently struggles with foreground object disocclusions arising from significant 3D motions.
The method occasionally produces artifacts due to downsampled attention masks. |
image editing, diffusion models, geometric transformations, shared attention, zero-shot learning |
2404.14396
Report |
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation |
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan |
The rapid evolution of multimodal foundation model has demonstrated
significant progresses in vision-language understanding and generation, e.g.,
our previous work SEED-LLaMA. However, there remains a gap between its
capability and the real-world applicability, primarily due to the model's
limited capacity to effectively respond to various user instructions and
interact with diverse visual data. In this work, we focus on bridging this gap
through integrating two enhanced features: (1) comprehending images of
arbitrary sizes and ratios, and (2) enabling multi-granularity image
generation. We present a unified and versatile foundation model, namely,
SEED-X, which is able to model multi-granularity visual semantics for
comprehension and generation tasks. Besides the competitive results on public
benchmarks, SEED-X demonstrates its effectiveness in handling real-world
applications across various domains after instruction tuning. We hope that our
work will inspire future research into what can be achieved by versatile
multimodal foundation models in real-world applications. The models, codes, and
datasets will be released in https://github.com/AILab-CVC/SEED-X. |
SEED-X, a versatile multimodal foundation model that integrates image comprehension of arbitrary sizes and multi-granularity image generation for real-world applications. |
Existing multimodal models struggle to effectively respond to user instructions and interact with diverse visual data in real-world scenarios. |
The authors incorporate a visual tokenizer for unified image comprehension and generation, dynamic resolution image encoding for arbitrary image size handling, and multi-stage training including pre-training on massive data and instruction tuning on domain-specific datasets. |
SEED-X achieves state-of-the-art image generation results on SEED-Bench-2, outperforming previous unified comprehension and generation models.
The model demonstrates strong performance in multimodal comprehension tasks, achieving competitive results on benchmarks like MMB and SEED-Bench-2.
Qualitative evaluations showcase SEED-X's capabilities as a multimodal AI assistant, excelling in tasks like image editing, text-rich comprehension, and creative image generation. |
The paper lacks an all-in-one instruction-tuned model, focusing on domain-specific fine-tuning instead.
The advantage of dynamic resolution encoding is not fully demonstrated due to limited data with unusual aspect ratios in existing benchmarks. |
multimodal foundation model, image comprehension, image generation, instruction tuning, real-world applications |
2404.14368
Report |
Graphic Design with Large Multimodal Model |
Yutao Cheng, Zhao Zhang, Maoke Yang, Hui Nie, Chunyuan Li, Xinglong Wu, Jie Shao |
In the field of graphic design, automating the integration of design elements
into a cohesive multi-layered artwork not only boosts productivity but also
paves the way for the democratization of graphic design. One existing practice
is Graphic Layout Generation (GLG), which aims to layout sequential design
elements. It has been constrained by the necessity for a predefined correct
sequence of layers, thus limiting creative potential and increasing user
workload. In this paper, we present Hierarchical Layout Generation (HLG) as a
more flexible and pragmatic setup, which creates graphic composition from
unordered sets of design elements. To tackle the HLG task, we introduce
Graphist, the first layout generation model based on large multimodal models.
Graphist efficiently reframes the HLG as a sequence generation problem,
utilizing RGB-A images as input, outputs a JSON draft protocol, indicating the
coordinates, size, and order of each element. We develop new evaluation metrics
for HLG. Graphist outperforms prior arts and establishes a strong baseline for
this field. Project homepage: https://github.com/graphic-design-ai/graphist |
This paper introduces Hierarchical Layout Generation (HLG), a new task for creating graphic compositions from unordered design elements, and presents Graphist, the first large multimodal model (LMM) for this task. |
HLG overcomes the limitations of previous Graphic Layout Generation (GLG) methods by removing the need for predefined layer ordering, allowing for greater flexibility and practicality in AI-assisted graphic design. |
Graphist reframes HLG as a sequence generation problem, taking RGB-A images as input and outputting a JSON draft protocol specifying element positions, sizes, and order. |
Graphist outperforms existing methods on GLG tasks and establishes a strong baseline for HLG.
New evaluation metrics for HLG are introduced: Inverse Order Pair Ratio (IOPR) for layer order accuracy and GPT-4V Eval for overall aesthetic quality.
Ablation studies demonstrate the importance of input sequence flexibility, LLM choice, visual token length, and the use of RGB-A over RGB images. |
Generating complete sets of high-quality design materials and aligning designs more closely with human aesthetics require further research.
Potential negative impacts include design homogeneity and the environmental cost of model training. |
graphic design, layout generation, lmm, mllm, hlg |
2404.14249
Report |
CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding |
Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Jingdong Wang, Qing Li, Kanglin Liu |
The recent 3D Gaussian Splatting (GS) exhibits high-quality and real-time
synthesis of novel views in 3D scenes. Currently, it primarily focuses on
geometry and appearance modeling, while lacking the semantic understanding of
scenes. To bridge this gap, we present CLIP-GS, which integrates semantics from
Contrastive Language-Image Pre-Training (CLIP) into Gaussian Splatting to
efficiently comprehend 3D environments without annotated semantic data. In
specific, rather than straightforwardly learning and rendering high-dimensional
semantic features of 3D Gaussians, which significantly diminishes the
efficiency, we propose a Semantic Attribute Compactness (SAC) approach. SAC
exploits the inherent unified semantics within objects to learn compact yet
effective semantic representations of 3D Gaussians, enabling highly efficient
rendering (>100 FPS). Additionally, to address the semantic ambiguity, caused
by utilizing view-inconsistent 2D CLIP semantics to supervise Gaussians, we
introduce a 3D Coherent Self-training (3DCS) strategy, resorting to the
multi-view consistency originated from the 3D model. 3DCS imposes cross-view
semantic consistency constraints by leveraging refined, self-predicted
pseudo-labels derived from the trained 3D Gaussian model, thereby enhancing
precise and view-consistent segmentation results. Extensive experiments
demonstrate that our method remarkably outperforms existing state-of-the-art
approaches, achieving improvements of 17.29% and 20.81% in mIoU metric on
Replica and ScanNet datasets, respectively, while maintaining real-time
rendering speed. Furthermore, our approach exhibits superior performance even
with sparse input data, verifying the robustness of our method. |
This paper introduces CLIP-GS, a novel method for real-time and accurate semantic understanding of 3D scenes using Gaussian Splatting. It leverages the inherent efficiency of Gaussian Splatting and incorporates semantic information from CLIP. |
Existing methods for 3D scene understanding either lack semantic comprehension or suffer from slow rendering speeds, hindering real-time applications like robotics and AR/VR. |
CLIP-GS addresses these limitations through two key innovations: 1) **Semantic Attribute Compactness (SAC):** Efficiently represents scene semantics by learning compact embeddings for 3D Gaussians. 2) **3D Coherent Self-training (3DCS):** Enhances semantic consistency across different views by leveraging cross-view self-predicted semantics. |
Significantly outperforms state-of-the-art methods in both semantic segmentation accuracy and rendering efficiency on Replica and ScanNet datasets.
Achieves over 17% and 20% improvement in mIoU over the second-best method on Replica and ScanNet datasets, respectively, while maintaining real-time rendering speed (>100 FPS).
Exhibits superior robustness compared to existing approaches, achieving high-quality reconstruction and segmentation even with sparse input data. |
The current implementation primarily focuses on indoor scenes and could be extended to handle more complex outdoor environments.
Exploring the integration of temporal information for dynamic scene understanding presents a promising direction for future research. |
3d gaussian splatting, real-time, view-consistent, 3d scene semantic understanding, 3d scene reconstruction |
2404.14239
Report |
MultiBooth: Towards Generating All Your Concepts in an Image from Text |
Chenyang Zhu, Kai Li, Yue Ma, Chunming He, Li Xiu |
This paper introduces MultiBooth, a novel and efficient technique for
multi-concept customization in image generation from text. Despite the
significant advancements in customized generation methods, particularly with
the success of diffusion models, existing methods often struggle with
multi-concept scenarios due to low concept fidelity and high inference cost.
MultiBooth addresses these issues by dividing the multi-concept generation
process into two phases: a single-concept learning phase and a multi-concept
integration phase. During the single-concept learning phase, we employ a
multi-modal image encoder and an efficient concept encoding technique to learn
a concise and discriminative representation for each concept. In the
multi-concept integration phase, we use bounding boxes to define the generation
area for each concept within the cross-attention map. This method enables the
creation of individual concepts within their specified regions, thereby
facilitating the formation of multi-concept images. This strategy not only
improves concept fidelity but also reduces additional inference cost.
MultiBooth surpasses various baselines in both qualitative and quantitative
evaluations, showcasing its superior performance and computational efficiency.
Project Page: https://multibooth.github.io/ |
This paper introduces MultiBooth, a novel and efficient two-phase method for multi-concept customization in text-to-image generation, addressing the limitations of existing techniques in handling multiple customized subjects. |
Existing customized generation methods primarily focus on single-concept customization and struggle to generate high-fidelity images with multiple customized subjects while preserving text alignment. |
MultiBooth employs a two-phase approach: single-concept learning using a multi-modal encoder, adaptive concept normalization, and efficient concept encoding, followed by multi-concept integration using a regional customization module within the cross-attention layers of the U-Net. |
MultiBooth achieves superior image quality, faithfulness to intended concepts, and alignment with text prompts compared to state-of-the-art methods.
The method demonstrates high efficiency in both training and inference time due to its single-concept learning and regional customization module.
The framework exhibits flexibility and can be seamlessly integrated with other techniques like LoRA-based DreamBooth and ControlNet for enhanced customization. |
The current method still requires training for learning new concepts.
Future work will focus on exploring training-free multi-concept customization based on MultiBooth. |
text-to-image generation, personalized image generation, multi-concept customization, diffusion models, adaptive concept normalization |
2404.14199
Report |
Generalizable Neural Human Renderer |
Mana Masuda, Jinhyung Park, Shun Iwase, Rawal Khirodkar, Kris Kitani |
While recent advancements in animatable human rendering have achieved
remarkable results, they require test-time optimization for each subject which
can be a significant limitation for real-world applications. To address this,
we tackle the challenging task of learning a Generalizable Neural Human
Renderer (GNH), a novel method for rendering animatable humans from monocular
video without any test-time optimization. Our core method focuses on
transferring appearance information from the input video to the output image
plane by utilizing explicit body priors and multi-view geometry. To render the
subject in the intended pose, we utilize a straightforward CNN-based image
renderer, foregoing the more common ray-sampling or rasterizing-based rendering
modules. Our GNH achieves remarkable generalizable, photorealistic rendering
with unseen subjects with a three-stage process. We quantitatively and
qualitatively demonstrate that GNH significantly surpasses current
state-of-the-art methods, notably achieving a 31.3% improvement in LPIPS. |
The paper introduces GNH, a generalizable neural human renderer that generates animatable humans from monocular videos without test-time optimization. |
Existing animatable human rendering methods require time-consuming per-subject optimization, limiting their practical application. |
GNH uses a three-stage process: 1) appearance feature extraction from input video frames, 2) feature transformation to the target pose and projection to 2D, 3) multi-frame feature fusion and rendering using a CNN. |
GNH outperforms state-of-the-art generalizable human rendering methods, achieving a 31.3% improvement in LPIPS.
GNH demonstrates superior rendering quality compared to methods requiring test-time optimization or multi-view inputs.
The rendering speed of GNH is 2-7 times faster than baseline generalizable human NeRF methods. |
GNH relies on accurate pose and mask estimations for input views, which can impact performance.
The model does not account for dynamic lighting changes. |
neural rendering, novel view synthesis, human rendering, generalizable rendering, monocular video |
2404.14162
Report |
FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on |
Chenhui Wang, Tao Chen, Zhihao Chen, Zhizhong Huang, Taoran Jiang, Qi Wang, Hongming Shan |
Despite their impressive generative performance, latent diffusion model-based
virtual try-on (VTON) methods lack faithfulness to crucial details of the
clothes, such as style, pattern, and text. To alleviate these issues caused by
the diffusion stochastic nature and latent supervision, we propose a novel
Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves
the conventional latent diffusion process in three major aspects. First, we
propose incorporating warped clothes as both the starting point and local
condition, supplying the model with faithful clothes priors. Second, we
introduce a novel clothes flattening network to constrain generated try-on
images, providing clothes-consistent faithful supervision. Third, we devise a
clothes-posterior sampling for faithful inference, further enhancing the model
performance over conventional clothes-agnostic Gaussian sampling. Extensive
experimental results on the benchmark VITON-HD and Dress Code datasets
demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is
able to generate photo-realistic try-on images with faithful clothing details. |
This paper proposes FLDM-VTON, a novel faithful latent diffusion model for virtual try-on that enhances the faithfulness of generated clothing details. |
Existing latent diffusion model-based virtual try-on methods often produce unfaithful clothing details due to the stochastic nature of diffusion models and latent supervision. |
FLDM-VTON incorporates warped clothes as priors, introduces a clothes flattening network for clothes-consistent supervision, and employs clothes-posterior sampling for faithful inference. |
FLDM-VTON outperforms state-of-the-art baselines on VITON-HD and Dress Code datasets, demonstrating superior performance in generating realistic try-on images with faithful clothing details.
The proposed method effectively preserves complex style, pattern, and text on clothes, addressing limitations of previous approaches.
Ablation studies validate the contribution of each proposed component to the overall performance. |
FLDM-VTON may struggle with preserving extremely small or complex logos and patterns due to information loss during the latent diffusion process.
Future work could explore diffusion in pixel space or utilize a more robust pre-trained LDM to address this limitation. |
virtual try-on, diffusion models, faithful image generation, clothes-consistent supervision, posterior sampling |
2404.14132
Report |
CRNet: A Detail-Preserving Network for Unified Image Restoration and Enhancement Task |
Kangzhen Yang, Tao Hu, Kexin Dai, Genggeng Chen, Yu Cao, Wei Dong, Peng Wu, Yanning Zhang, Qingsen Yan |
In real-world scenarios, images captured often suffer from blurring, noise,
and other forms of image degradation, and due to sensor limitations, people
usually can only obtain low dynamic range images. To achieve high-quality
images, researchers have attempted various image restoration and enhancement
operations on photographs, including denoising, deblurring, and high dynamic
range imaging. However, merely performing a single type of image enhancement
still cannot yield satisfactory images. In this paper, to deal with the
challenge above, we propose the Composite Refinement Network (CRNet) to address
this issue using multiple exposure images. By fully integrating
information-rich multiple exposure inputs, CRNet can perform unified image
restoration and enhancement. To improve the quality of image details, CRNet
explicitly separates and strengthens high and low-frequency information through
pooling layers, using specially designed Multi-Branch Blocks for effective
fusion of these frequencies. To increase the receptive field and fully
integrate input features, CRNet employs the High-Frequency Enhancement Module,
which includes large kernel convolutions and an inverted bottleneck ConvFFN.
Our model secured third place in the first track of the Bracketing Image
Restoration and Enhancement Challenge, surpassing previous SOTA models in both
testing metrics and visual quality. |
This paper proposes Composite Refinement Network (CRNet), a novel architecture for unified image restoration and enhancement using multiple exposure images, which effectively restores high-frequency details and outperforms previous state-of-the-art methods. |
Existing methods often focus on individual image restoration or enhancement tasks and fail to adequately enhance high-frequency details, leading to unsatisfactory results. CRNet addresses this gap by unifying these tasks and improving high-frequency detail restoration. |
CRNet aligns multiple exposure images using optical flow, separates high and low-frequency information using pooling layers, and employs Multi-Branch Blocks for effective fusion. It also utilizes a Convolutional Enhancement Block with large kernel convolutions and an inverted bottleneck ConvFFN to enhance feature fusion and increase the receptive field. |
CRNet achieves state-of-the-art performance on the Bracketing Image Restoration and Enhancement Challenge dataset, surpassing previous methods in both visual quality and evaluation metrics.
Ablation studies demonstrate the effectiveness of each module in CRNet, highlighting the importance of frequency separation, Multi-Branch Blocks, and the Convolutional Enhancement Block.
CRNet secured third place in track 1 of the Bracketing Image Restoration and Enhancement Challenge, exhibiting significantly lower computational costs compared to other top-ranking models. |
The model's performance could be further investigated on a wider range of real-world datasets with diverse degradation types.
Exploring alternative frequency separation and fusion techniques may lead to further improvements in image quality. |
image restoration, image enhancement, high dynamic range (hdr) imaging, deep learning, multi-exposure fusion |
2404.14055
Report |
RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification |
Hai Ci, Pei Yang, Yiren Song, Mike Zheng Shou |
We revisit Tree-Ring Watermarking, a recent diffusion model watermarking
method that demonstrates great robustness to various attacks. We conduct an
in-depth study on it and reveal that the distribution shift unintentionally
introduced by the watermarking process, apart from watermark pattern matching,
contributes to its exceptional robustness. Our investigation further exposes
inherent flaws in its original design, particularly in its ability to identify
multiple distinct keys, where distribution shift offers no assistance. Based on
these findings and analysis, we present RingID for enhanced multi-key
identification. It consists of a novel multi-channel heterogeneous watermarking
approach designed to seamlessly amalgamate distinctive advantages from diverse
watermarks. Coupled with a series of suggested enhancements, RingID exhibits
substantial advancements in multi-key identification. Github Page:
https://github.com/showlab/RingID |
This paper revisits Tree-Ring Watermarking and identifies an overlooked factor contributing to its robustness: distribution shift introduced during watermark imprinting. The paper further reveals vulnerabilities in Tree-Ring's ability to identify multiple keys, particularly under image transformations like rotation and cropping/scaling, and proposes RingID, an enhanced watermarking method for improved multi-key identification. |
Identifying the source and authenticity of AI-generated images, especially with the rise of advanced diffusion models, is crucial for copyright protection and combating malicious uses. |
The authors analyze the impact of distribution shift on Tree-Ring's performance under different attacks. They propose RingID, which leverages a multi-channel heterogeneous watermarking framework, discretization, and lossless imprinting for enhanced distinguishability and robustness. |
Distribution shift, stemming from discarding the imaginary part during watermarking, significantly contributes to Tree-Ring's robustness in verification tasks, particularly against rotation and cropping/scaling.
Tree-Ring shows limited effectiveness in identifying multiple keys, particularly under attacks.
RingID significantly outperforms Tree-Ring in multi-key identification while maintaining comparable image generation quality. |
Both Tree-Ring and RingID remain vulnerable to cropping and scaling attacks in multi-key identification scenarios.
Future work could explore different transform domains for enhanced robustness against cropping and scaling. |
diffusion models, tree-ring watermarking, multi-key identification, watermarking, copyright protection |
2404.14044
Report |
HashPoint: Accelerated Point Searching and Sampling for Neural Rendering |
Jiahao Ma, Miaomiao Liu, David Ahmedt-Aristizaba, Chuong Nguyen |
In this paper, we address the problem of efficient point searching and
sampling for volume neural rendering. Within this realm, two typical approaches
are employed: rasterization and ray tracing. The rasterization-based methods
enable real-time rendering at the cost of increased memory and lower fidelity.
In contrast, the ray-tracing-based methods yield superior quality but demand
longer rendering time. We solve this problem by our HashPoint method combining
these two strategies, leveraging rasterization for efficient point searching
and sampling, and ray marching for rendering. Our method optimizes point
searching by rasterizing points within the camera's view, organizing them in a
hash table, and facilitating rapid searches. Notably, we accelerate the
rendering process by adaptive sampling on the primary surface encountered by
the ray. Our approach yields substantial speed-up for a range of
state-of-the-art ray-tracing-based methods, maintaining equivalent or superior
accuracy across synthetic and real test datasets. The code will be available at
https://jiahao-ma.github.io/hashpoint/. |
Presents HashPoint, a novel method that combines rasterization and ray tracing for efficient point searching and adaptive sampling in neural rendering. |
Addresses the limitations of existing point cloud rendering methods that are either fast but low-fidelity (rasterization-based) or high-quality but slow (ray-tracing-based). |
Transforms the 3D point cloud search to a 2D image plane for efficient hash table lookup and introduces adaptive primary surface sampling based on distance to the viewpoint and point cloud distribution. |
Achieves up to 80x speedup compared to existing ray-tracing methods like Point-NeRF while maintaining similar visual quality.
Outperforms traditional point cloud search methods (Uniform Grid, K-d tree, Octree) in efficiency for ray casting.
Demonstrates robust performance on various datasets (Synthetic-NeRF, Waymo, Replica, ShapeNet). |
Current implementation requires multi-surface sampling during initial optimization due to gradient propagation issues.
The \beta parameter, controlling sampling scope, is fixed and could be dynamically adjusted based on geometry noise and optimization progress in future work. |
neural rendering, point cloud, ray tracing, rasterization, adaptive sampling |
2404.14037
Report |
GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting |
Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, Gang Yu |
Recent works on audio-driven talking head synthesis using Neural Radiance
Fields (NeRF) have achieved impressive results. However, due to inadequate pose
and expression control caused by NeRF implicit representation, these methods
still have some limitations, such as unsynchronized or unnatural lip movements,
and visual jitter and artifacts. In this paper, we propose GaussianTalker, a
novel method for audio-driven talking head synthesis based on 3D Gaussian
Splatting. With the explicit representation property of 3D Gaussians, intuitive
control of the facial motion is achieved by binding Gaussians to 3D facial
models. GaussianTalker consists of two modules, Speaker-specific Motion
Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator
achieves accurate lip movements specific to the target speaker through
universalized audio feature extraction and customized lip motion generation.
Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance
facial detail representation via a latent pose, delivering stable and realistic
rendered videos. Extensive experimental results suggest that GaussianTalker
outperforms existing state-of-the-art methods in talking head synthesis,
delivering precise lip synchronization and exceptional visual quality. Our
method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU,
significantly exceeding the threshold for real-time rendering performance, and
can potentially be deployed on other hardware platforms. |
GaussianTalker, a novel audio-driven talking head synthesis framework using 3D Gaussian Splatting bound to the FLAME model, generates realistic videos with accurate lip synchronization. |
Existing methods struggle with unnatural lip movements, visual jitters, and artifacts due to limitations in pose and expression control with implicit representations like NeRF. |
GaussianTalker uses a Speaker-specific Motion Translator for natural lip movements by decoupling identity information and using personalized embeddings. It also employs a Dynamic Gaussian Renderer with Speaker-specific BlendShapes to refine facial details and enhance visual realism. |
Outperforms state-of-the-art methods in image quality (PSNR, SSIM, LPIPS, FID) and lip synchronization (LMD, LSE-C, LSE-D).
Achieves ultra-high rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, enabling real-time performance.
Demonstrates strong generalization capability across different speakers, languages, and audio inputs. |
The lack of teeth in the original FLAME model necessitates manual additions, which may not fully capture dental details.
Further exploration is needed to extend the approach beyond talking head synthesis, capturing a wider range of body movements and expressions. |
talking head synthesis, 3d gaussian splatting, speaker-specific, facial animation, real-time rendering |
2404.14007
Report |
Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting |
Weili Zeng, Yichao Yan, Qi Zhu, Zhuo Chen, Pengzhi Chu, Weiming Zhao, Xiaokang Yang |
Text-to-image (T2I) customization aims to create images that embody specific
visual concepts delineated in textual descriptions. However, existing works
still face a main challenge, concept overfitting. To tackle this challenge, we
first analyze overfitting, categorizing it into concept-agnostic overfitting,
which undermines non-customized concept knowledge, and concept-specific
overfitting, which is confined to customize on limited modalities, i.e,
backgrounds, layouts, styles. To evaluate the overfitting degree, we further
introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to
measure the distribution changes of non-customized and customized concept
respectively. Drawing from the analysis, we propose Infusion, a T2I
customization method that enables the learning of target concepts to avoid
being constrained by limited training modalities, while preserving
non-customized knowledge. Remarkably, Infusion achieves this feat with
remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive
experiments also demonstrate that our approach outperforms state-of-the-art
methods in both single and multi-concept customized generation. |
This paper presents "Infusion," a text-to-image customization method that leverages the generative capabilities of foundational models while mitigating concept overfitting. |
Existing T2I customization methods struggle with concept overfitting, which limits their ability to generate diverse and imaginative images that incorporate specific visual concepts. |
Infusion decouples attention maps and value features in cross-attention modules. It preserves the foundational model's attention maps for layout and posture diversity, while learning residual value embeddings for customized concepts. |
Infusion demonstrates superior performance in generating imaginative and concept-faithful images compared to state-of-the-art methods.
It effectively mitigates both concept-agnostic and concept-specific overfitting.
Infusion offers a lightweight and plug-and-play solution for single- and multi-concept customization. |
Infusion might face limitations in preserving intricate textures when high fidelity is required.
Future work could explore training strategies that optimize the balance between diversity and fidelity for specific customization tasks. |
text-to-image generation, t2i customization, concept overfitting, diffusion models, cross-attention |
2404.13984
Report |
RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance |
Chengrui Wang, Pengfei Liu, Min Zhou, Ming Zeng, Xubin Li, Tiezheng Ge, Bo zheng |
Although diffusion models can generate high-quality human images, their
applications are limited by the instability in generating hands with correct
structures. Some previous works mitigate the problem by considering hand
structure yet struggle to maintain style consistency between refined malformed
hands and other image regions. In this paper, we aim to solve the problem of
inconsistency regarding hand structure and style. We propose a conditional
diffusion-based framework RHanDS to refine the hand region with the help of
decoupled structure and style guidance. Specifically, the structure guidance is
the hand mesh reconstructed from the malformed hand, serving to correct the
hand structure. The style guidance is a hand image, e.g., the malformed hand
itself, and is employed to furnish the style reference for hand refining. In
order to suppress the structure leakage when referencing hand style and
effectively utilize hand data to improve the capability of the model, we build
a multi-style hand dataset and introduce a twostage training strategy. In the
first stage, we use paired hand images for training to generate hands with the
same style as the reference. In the second stage, various hand images generated
based on the human mesh are used for training to enable the model to gain
control over the hand structure. We evaluate our method and counterparts on the
test dataset of the proposed multi-style hand dataset. The experimental results
show that RHanDS can effectively refine hands structure- and style- correctly
compared with previous methods. The codes and datasets will be available soon. |
RHanDS, a novel diffusion-based framework that refines malformed hands in generated images by leveraging decoupled structure and style guidance. |
Existing diffusion models struggle to generate hands with correct structures while maintaining style consistency with the rest of the image. |
RHanDS uses a two-stage training strategy: first learning style guidance from paired hand images and then learning structure guidance from hand-mesh pairs. It utilizes a hand mesh reconstructed from the malformed hand for structure guidance and a separate hand image for style guidance. |
RHanDS effectively refines hands with correct structure and consistent style compared to previous methods.
A user study confirms that RHanDS produces more preferred results with better style consistency and structure quality.
The two-stage training strategy is crucial for achieving both accurate structure and style preservation. |
RHanDS may struggle with specific styles or complex hand configurations, such as hands wearing gloves or holding objects.
Automatic hand mesh reconstruction can fail in some cases, requiring manual intervention. |
malformed hand refining, diffusion models, conditional generation, hand structure, hand style |
2404.13944
Report |
Gorgeous: Create Your Desired Character Facial Makeup from Any Ideas |
Jia Wei Sii, Chee Seng Chan |
Contemporary makeup transfer methods primarily focus on replicating makeup
from one face to another, considerably limiting their use in creating diverse
and creative character makeup essential for visual storytelling. Such methods
typically fail to address the need for uniqueness and contextual relevance,
specifically aligning with character and story settings as they depend heavily
on existing facial makeup in reference images. This approach also presents a
significant challenge when attempting to source a perfectly matched facial
makeup style, further complicating the creation of makeup designs inspired by
various story elements, such as theme, background, and props that do not
necessarily feature faces. To address these limitations, we introduce
$Gorgeous$, a novel diffusion-based makeup application method that goes beyond
simple transfer by innovatively crafting unique and thematic facial makeup.
Unlike traditional methods, $Gorgeous$ does not require the presence of a face
in the reference images. Instead, it draws artistic inspiration from a minimal
set of three to five images, which can be of any type, and transforms these
elements into practical makeup applications directly on the face. Our
comprehensive experiments demonstrate that $Gorgeous$ can effectively generate
distinctive character facial makeup inspired by the chosen thematic reference
images. This approach opens up new possibilities for integrating broader story
elements into character makeup, thereby enhancing the narrative depth and
visual impact in storytelling. |
$Gorgeous$, a novel diffusion-based makeup application method that creates unique and thematic facial makeup from a minimal set of 3-5 reference images, regardless of whether the images contain faces. |
Existing makeup transfer methods are limited to replicating existing makeup looks from source faces, hindering creativity and diversity in character design for visual storytelling. |
Gorgeous uses three components: (i) MaFor Module: learns makeup knowledge and preserves facial identity using ControlNet; (ii) CSL Module: encodes artistic elements from reference images into text embeddings using textual inversion; (iii) MaIP Pipeline: combines MaFor and CSL to apply makeup seamlessly on the face using an inpainting-like approach. |
Gorgeous generates more unique and diverse character facial makeups compared to traditional makeup transfer methods.
Gorgeous can effectively adapt makeup styles from non-facial images, overcoming the limitations of existing methods relying on face parsing.
User study (N=100) showed a strong preference for makeups generated by Gorgeous, highlighting its ability to generate appealing and relevant character makeups. |
Current evaluation metrics for makeup assessment are limited, focusing on global style rather than makeup-specific nuances like color accuracy and texture fidelity.
Future work will focus on developing new metrics specifically designed to evaluate makeup style similarity, considering factors like color harmony, textural alignment, and contextual relevance. |
makeup generation, character design, diffusion models, textual inversion, image inpainting |
2404.13923
Report |
MaterialSeg3D: Segmenting Dense Materials from 2D Priors for 3D Assets |
Zeyu Li, Ruitong Gan, Chuanchen Luo, Yuxi Wang, Jiaheng Liu, Ziwei Zhu Man Zhang, Qing Li, Xucheng Yin, Zhaoxiang Zhang, Junran Peng |
Driven by powerful image diffusion models, recent research has achieved the
automatic creation of 3D objects from textual or visual guidance. By performing
score distillation sampling (SDS) iteratively across different views, these
methods succeed in lifting 2D generative prior to the 3D space. However, such a
2D generative image prior bakes the effect of illumination and shadow into the
texture. As a result, material maps optimized by SDS inevitably involve
spurious correlated components. The absence of precise material definition
makes it infeasible to relight the generated assets reasonably in novel scenes,
which limits their application in downstream scenarios. In contrast, humans can
effortlessly circumvent this ambiguity by deducing the material of the object
from its appearance and semantics. Motivated by this insight, we propose
MaterialSeg3D, a 3D asset material generation framework to infer underlying
material from the 2D semantic prior. Based on such a prior model, we devise a
mechanism to parse material in 3D space. We maintain a UV stack, each map of
which is unprojected from a specific viewpoint. After traversing all
viewpoints, we fuse the stack through a weighted voting scheme and then employ
region unification to ensure the coherence of the object parts. To fuel the
learning of semantics prior, we collect a material dataset, named Materialized
Individual Objects (MIO), which features abundant images, diverse categories,
and accurate annotations. Extensive quantitative and qualitative experiments
demonstrate the effectiveness of our method. |
This paper introduces MaterialSeg3D, a novel workflow that leverages 2D material priors to generate accurate and realistic surface materials for 3D assets, addressing the limitations of existing methods that struggle with realistic material generation. |
High-quality PBR materials are crucial for 3D assets to appear realistic under various lighting conditions, but existing 3D asset generation methods often lack accurate material information or struggle to generate realistic materials. |
The method employs a multi-view rendering approach, generating images of the 3D asset from various angles. These renderings are then fed into a material segmentation model trained on a novel dataset called Materialized Individual Objects (MIO). This dataset contains single-object images with dense material semantic annotations. Finally, the predicted material labels from different views are projected back onto the UV map and fused using a weighted voting mechanism. |
MaterialSeg3D effectively generates accurate and realistic surface materials for 3D assets, outperforming existing methods.
The proposed MIO dataset, with its diverse camera angles and material annotations, proves valuable for training the material segmentation model.
The weighted voting mechanism effectively combines material predictions from different views, ensuring accurate material assignment on the 3D asset's surface. |
The current implementation relies on 3D assets with pre-existing Albedo UV maps, limiting its applicability to assets without such information.
The quality of the generated surface material is influenced by the quality of the input mesh; low-quality meshes can lead to less accurate results. |
3d asset generation, surface material generation, material segmentation, multi-view rendering, pbr materials |
2404.13903
Report |
Accelerating Image Generation with Sub-path Linear Approximation Model |
Chen Xu, Tianhui Song, Weixin Feng, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang |
Diffusion models have significantly advanced the state of the art in image,
audio, and video generation tasks. However, their applications in practical
scenarios are hindered by slow inference speed. Drawing inspiration from the
approximation strategies utilized in consistency models, we propose the
Sub-path Linear Approximation Model (SLAM), which accelerates diffusion models
while maintaining high-quality image generation. SLAM treats the PF-ODE
trajectory as a series of PF-ODE sub-paths divided by sampled points, and
harnesses sub-path linear (SL) ODEs to form a progressive and continuous error
estimation along each individual PF-ODE sub-path. The optimization on such
SL-ODEs allows SLAM to construct denoising mappings with smaller cumulative
approximated errors. An efficient distillation method is also developed to
facilitate the incorporation of more advanced diffusion models, such as latent
diffusion models. Our extensive experimental results demonstrate that SLAM
achieves an efficient training regimen, requiring only 6 A100 GPU days to
produce a high-quality generative model capable of 2 to 4-step generation with
high performance. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO
2017 datasets also illustrate that SLAM surpasses existing acceleration methods
in few-step generation tasks, achieving state-of-the-art performance both on
FID and the quality of the generated images. |
This paper introduces SLAM (Sub-path Linear Approximation Model) which accelerates diffusion models while preserving high-quality image generation. |
Diffusion models, despite impressive results, suffer from slow inference speed, hindering practical use. SLAM addresses this by accelerating generation without compromising quality. |
SLAM divides the Probability Flow ODE trajectory into sub-paths and approximates them with linear ODEs. This allows for a more nuanced optimization of denoising mappings, reducing cumulative errors. |
SLAM outperforms existing acceleration methods in few-step generation on FID and image quality across LAION, MS COCO 2014, and MS COCO 2017 datasets.
The method exhibits efficient training, needing only 6 A100 GPU days for a high-quality generative model capable of 2 to 4-step generation.
SLAM consistently achieves smaller denoising mapping errors compared to methods like LCM, especially at larger timesteps, as evidenced by quantitative analysis. |
The paper primarily focuses on text-to-image generation, leaving exploration of other modalities for future work.
While SLAM mitigates the limitations of large skipping step sizes, further investigation into optimal step size selection strategies is warranted. |
diffusion models, accelerating diffusion models, diffusion model distillation, consistency models, image generation |
2404.13896
Report |
CT-NeRF: Incremental Optimizing Neural Radiance Field and Poses with Complex Trajectory |
Yunlong Ran, Yanxu Li, Qi Ye, Yuchi Huo, Zechun Bai, Jiahao Sun, Jiming Chen |
Neural radiance field (NeRF) has achieved impressive results in high-quality
3D scene reconstruction. However, NeRF heavily relies on precise camera poses.
While recent works like BARF have introduced camera pose optimization within
NeRF, their applicability is limited to simple trajectory scenes. Existing
methods struggle while tackling complex trajectories involving large rotations.
To address this limitation, we propose CT-NeRF, an incremental reconstruction
optimization pipeline using only RGB images without pose and depth input. In
this pipeline, we first propose a local-global bundle adjustment under a pose
graph connecting neighboring frames to enforce the consistency between poses to
escape the local minima caused by only pose consistency with the scene
structure. Further, we instantiate the consistency between poses as a
reprojected geometric image distance constraint resulting from pixel-level
correspondences between input image pairs. Through the incremental
reconstruction, CT-NeRF enables the recovery of both camera poses and scene
structure and is capable of handling scenes with complex trajectories. We
evaluate the performance of CT-NeRF on two real-world datasets, NeRFBuster and
Free-Dataset, which feature complex trajectories. Results show CT-NeRF
outperforms existing methods in novel view synthesis and pose estimation
accuracy. |
This paper proposes CT-NeRF, an incremental reconstruction optimization pipeline that jointly optimizes neural radiance fields and camera poses using only RGB images, particularly addressing challenges in scenes with complex trajectories involving large rotations. |
Existing NeRF-based methods often struggle with complex trajectories due to reliance on precise camera poses or limitations in handling large rotations. This work aims to address this gap and enable accurate 3D scene reconstruction in such challenging scenarios. |
The method introduces a local-global bundle adjustment with pose graphs connecting neighboring frames, enforcing pose consistency beyond just the scene structure. A reprojected geometric image distance constraint, derived from learned correspondences between image pairs, is used to robustly optimize poses and scene geometry. |
CT-NeRF significantly outperforms state-of-the-art methods in pose estimation accuracy on datasets with complex trajectories, as demonstrated by lower rotation and translation errors.
The method achieves high-quality novel view synthesis, even in challenging scenarios with arbitrary trajectory variations and reduced frame overlap.
Ablation studies validate the importance of each component, particularly the reprojection loss and the incremental optimization strategy, in achieving accurate and robust results. |
The current work explores simple pose graphs, and investigating more sophisticated graph optimization techniques could be beneficial for very long trajectories.
The paper highlights the need for dedicated evaluation datasets, protocols, and metrics specifically designed for complex camera trajectories to better assess reconstruction quality. |
neural radiance fields, pose estimation, structure from motion, incremental optimization, complex trajectories |
2404.13816
Report |
Neural Radiance Field in Autonomous Driving: A Survey |
Lei He, Leheng Li, Wenchao Sun, Zeyu Han, Yichen Liu, Sifa Zheng, Jianqiang Wang, Keqiang Li |
Neural Radiance Field (NeRF) has garnered significant attention from both
academia and industry due to its intrinsic advantages, particularly its
implicit representation and novel view synthesis capabilities. With the rapid
advancements in deep learning, a multitude of methods have emerged to explore
the potential applications of NeRF in the domain of Autonomous Driving (AD).
However, a conspicuous void is apparent within the current literature. To
bridge this gap, this paper conducts a comprehensive survey of NeRF's
applications in the context of AD. Our survey is structured to categorize
NeRF's applications in Autonomous Driving (AD), specifically encompassing
perception, 3D reconstruction, simultaneous localization and mapping (SLAM),
and simulation. We delve into in-depth analysis and summarize the findings for
each application category, and conclude by providing insights and discussions
on future directions in this field. We hope this paper serves as a
comprehensive reference for researchers in this domain. To the best of our
knowledge, this is the first survey specifically focused on the applications of
NeRF in the Autonomous Driving domain. |
This paper presents the first comprehensive survey of Neural Radiance Fields (NeRF) applications in autonomous driving, encompassing perception, 3D reconstruction, SLAM, and simulation. |
NeRF's implicit representation and novel view synthesis capabilities hold significant potential for enhancing autonomous driving technologies, prompting a surge of research in this area. |
The authors systematically categorize and analyze existing NeRF-based methods across various autonomous driving applications, summarizing key features and limitations. |
NeRF proves valuable for data augmentation in perception tasks, generating realistic training data and mitigating the sim-to-real gap.
In 3D reconstruction, NeRF facilitates dynamic scene reconstruction, surface reconstruction, and inverse rendering, enabling applications like relighting and object insertion.
NeRF-based SLAM methods demonstrate progress in pose estimation, scene representation, and handling depth uncertainty, with applications in localization and mapping. |
Current NeRF-based methods for autonomous driving often face computational challenges, particularly in high-dynamic scenarios and large-scale environments.
Further research is needed to address limitations in reconstructing non-rigid objects, handling severe light conditions, and ensuring real-time performance in complex driving scenarios. |
neural radiance fields, autonomous driving, perception, 3d reconstruction, slam, simulation |
2404.13784
Report |
Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images |
Ali Naseh, Katherine Thai, Mohit Iyyer, Amir Houmansadr |
With the digital imagery landscape rapidly evolving, image stocks and
AI-generated image marketplaces have become central to visual media.
Traditional stock images now exist alongside innovative platforms that trade in
prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3
and Midjourney. This paper studies the possibility of employing multi-modal
models with enhanced visual understanding to mimic the outputs of these
platforms, introducing an original attack strategy. Our method leverages
fine-tuned CLIP models, a multi-label classifier, and the descriptive
capabilities of GPT-4V to create prompts that generate images similar to those
available in marketplaces and from premium stock image providers, yet at a
markedly lower expense. In presenting this strategy, we aim to spotlight a new
class of economic and security considerations within the realm of digital
imagery. Our findings, supported by both automated metrics and human
assessment, reveal that comparable visual content can be produced for a
fraction of the prevailing market prices ($0.23 - $0.27 per image), emphasizing
the need for awareness and strategic discussions about the integrity of digital
media in an increasingly AI-integrated landscape. Our work also contributes to
the field by assembling a dataset consisting of approximately 19 million
prompt-image pairs generated by the popular Midjourney platform, which we plan
to release publicly. |
This paper introduces a novel attack strategy using multi-modal models to generate images similar to those in AI-generated image marketplaces and stock photo websites, at a fraction of the cost. |
This work exposes a vulnerability in the digital imagery landscape, highlighting the economic and security implications of AI-generated images and the potential for misuse. |
The proposed method utilizes a fine-tuned CLIP model, a multi-label classifier for extracting keywords and modifiers, and GPT-4V for generating refined prompts based on image analysis. |
The attack successfully generates comparable images for a significantly lower cost (\$0.23 - \$0.27 per image).
The method outperforms baseline models like BLIP2 and CLIP Interrogator in image similarity tests.
A large-scale dataset of 19 million prompt-image pairs from Midjourney was collected and will be publicly released. |
The success of the attack relies heavily on the performance of individual components (e.g., CLIP, GPT-4V), which can be unpredictable.
Future work can explore the refinement of each component and investigate the generalization of the attack to other text-to-image models. |
ai-generated images, text-to-image synthesis, prompt engineering, digital image integrity, multi-modal learning |
2404.13766
Report |
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control |
Maria Mihaela Trusca, Wolf Nuyts, Jonathan Thomm, Robert Honig, Thomas Hofmann, Tinne Tuytelaars, Marie-Francine Moens |
Current diffusion models create photorealistic images given a text prompt as
input but struggle to correctly bind attributes mentioned in the text to the
right objects in the image. This is evidenced by our novel image-graph
alignment model called EPViT (Edge Prediction Vision Transformer) for the
evaluation of image-text alignment. To alleviate the above problem, we propose
focused cross-attention (FCA) that controls the visual attention maps by
syntactic constraints found in the input sentence. Additionally, the syntax
structure of the prompt helps to disentangle the multimodal CLIP embeddings
that are commonly used in T2I generation. The resulting DisCLIP embeddings and
FCA are easily integrated in state-of-the-art diffusion models without
additional training of these models. We show substantial improvements in T2I
generation and especially its attribute-object binding on several
datasets.\footnote{Code and data will be made available upon acceptance. |
This paper proposes two novel training-free methods, focused cross-attention (FCA) and disentangled CLIP encoding (DisCLIP), for improving object-attribute binding in text-to-image synthesis by leveraging syntactic structure of text prompts. |
Existing diffusion models excel at generating photorealistic images but struggle to accurately bind attributes to objects in multi-object text prompts, leading to incorrect or nonsensical image generation. |
FCA utilizes syntactic dependencies to focus attribute attention within corresponding object regions during image generation. DisCLIP generates disentangled text prompt representations using a constituency tree encoding compositional information and object-attribute bindings. Both methods are seamlessly integrated into existing diffusion models without requiring retraining. |
FCA and DisCLIP effectively improve object-attribute binding and reduce attribute leakage as evidenced by improved performance on DAA-200, CC-500, and AE-276 benchmarks.
A novel evaluation metric, EPViT, based on a ViT model trained to predict image-graph alignment, outperforms CLIP in assessing object-attribute binding accuracy.
Integration of FCA and DisCLIP into various state-of-the-art diffusion models consistently enhances their performance without degrading image quality on general text prompts. |
Current EPViT training and FCA application focus solely on object-attribute binding, with potential for expansion to other syntactic relationships.
The effectiveness of the proposed methods depends on the accuracy and expressiveness of syntactic parsers, potentially limiting performance when dealing with complex linguistic structures or languages with limited parsing capabilities. |
text-to-image synthesis, diffusion models, object-attribute binding, syntactic structure, image-text alignment |
2404.13706
Report |
Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models |
Vitali Petsiuk, Kate Saenko |
Motivated by ethical and legal concerns, the scientific community is actively
developing methods to limit the misuse of Text-to-Image diffusion models for
reproducing copyrighted, violent, explicit, or personal information in the
generated images. Simultaneously, researchers put these newly developed safety
measures to the test by assuming the role of an adversary to find
vulnerabilities and backdoors in them. We use compositional property of
diffusion models, which allows to leverage multiple prompts in a single image
generation. This property allows us to combine other concepts, that should not
have been affected by the inhibition, to reconstruct the vector, responsible
for target concept generation, even though the direct computation of this
vector is no longer accessible. We provide theoretical and empirical evidence
why the proposed attacks are possible and discuss the implications of these
findings for safe model deployment. We argue that it is essential to consider
all possible approaches to image generation with diffusion models that can be
employed by an adversary. Our work opens up the discussion about the
implications of concept arithmetics and compositional inference for safety
mechanisms in diffusion models.
Content Advisory: This paper contains discussions and model-generated content
that may be considered offensive. Reader discretion is advised.
Project page: https://cs-people.bu.edu/vpetsiuk/arc |
This paper presents ARC (ARithmetics in Concept space) attacks, a novel method to circumvent concept inhibition in text-to-image diffusion models by exploiting the models' compositional properties. |
Concept inhibition is crucial for preventing the misuse of diffusion models for generating harmful or copyrighted content. This work exposes vulnerabilities in existing inhibition techniques, highlighting the need for more robust solutions. |
The authors leverage the linearity of conditional guidance in diffusion models. They design attacks that use compositional inference with carefully crafted prompts to reconstruct the erased concept's guidance vector, effectively bypassing the inhibition. |
ARC attacks significantly increase the reproduction rates of inhibited concepts, even when tested against various state-of-the-art inhibition methods.
The attacks are straightforward to implement, requiring only black-box access to the model's compositional inference.
The findings demonstrate that local modifications to the model's weights are insufficient for robust concept inhibition. |
The work primarily focuses on demonstrating the existence and effectiveness of such attacks. Further research is needed to explore optimal attack strategies and defenses.
The study focuses on a limited set of concepts and inhibition methods. Evaluating the attacks on a broader range of concepts and models is important future work. |
diffusion models, concept inhibition, adversarial attacks, text-to-image generation, compositional inference |
2404.13696
Report |
Clio: Real-time Task-Driven Open-Set 3D Scene Graphs |
Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith, Carlyn Dougherty, Eric Cristofalo, Lukas Schmid, Luca Carlone |
Modern tools for class-agnostic image segmentation (e.g., SegmentAnything)
and open-set semantic understanding (e.g., CLIP) provide unprecedented
opportunities for robot perception and mapping. While traditional closed-set
metric-semantic maps were restricted to tens or hundreds of semantic classes,
we can now build maps with a plethora of objects and countless semantic
variations. This leaves us with a fundamental question: what is the right
granularity for the objects (and, more generally, for the semantic concepts)
the robot has to include in its map representation? While related work
implicitly chooses a level of granularity by tuning thresholds for object
detection, we argue that such a choice is intrinsically task-dependent. The
first contribution of this paper is to propose a task-driven 3D scene
understanding problem, where the robot is given a list of tasks in natural
language and has to select the granularity and the subset of objects and scene
structure to retain in its map that is sufficient to complete the tasks. We
show that this problem can be naturally formulated using the Information
Bottleneck (IB), an established information-theoretic framework. The second
contribution is an algorithm for task-driven 3D scene understanding based on an
Agglomerative IB approach, that is able to cluster 3D primitives in the
environment into task-relevant objects and regions and executes incrementally.
The third contribution is to integrate our task-driven clustering algorithm
into a real-time pipeline, named Clio, that constructs a hierarchical 3D scene
graph of the environment online using only onboard compute, as the robot
explores it. Our final contribution is an extensive experimental campaign
showing that Clio not only allows real-time construction of compact open-set 3D
scene graphs, but also improves the accuracy of task execution by limiting the
map to relevant semantic concepts. |
This paper presents Clio, a real-time system that builds task-driven 3D scene graphs with open-set semantics, clustering 3D primitives into task-relevant objects and regions using an Information Bottleneck approach. |
Current methods for building semantic maps are limited to a fixed set of concepts and don't consider the task-dependency of choosing relevant semantic concepts, which is crucial for robot perception. |
The paper leverages vision-language models (VLMs) like CLIP and task-agnostic segmentation (e.g., SegmentAnything) to cluster 3D primitives using an incremental Agglomerative Information Bottleneck algorithm, enabling real-time operation. |
Clio constructs more compact and useful scene representations compared to task-agnostic methods, retaining only task-relevant objects and regions.
It achieves comparable performance to state-of-the-art methods in closed-set object detection tasks, demonstrating its efficacy in both open and closed-set settings.
Clio enables real-time onboard mapping and supports mobile manipulation tasks on a Spot robot, showcasing its practicality for robotics applications. |
The approach inherits limitations from the foundation models used, such as vulnerability to prompt tuning.
Current implementation uses simple averaging to merge semantic descriptions of primitives, and extending it to handle more complex, multi-step tasks is desirable. |
3d scene understanding, robotics, information bottleneck, vision-language models, open-set recognition |
2404.13686
Report |
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis |
Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, Xuefeng Xiao |
Recently, a series of diffusion-aware distillation algorithms have emerged to
alleviate the computational overhead associated with the multi-step inference
process of Diffusion Models (DMs). Current distillation techniques often
dichotomize into two distinct aspects: i) ODE Trajectory Preservation; and ii)
ODE Trajectory Reformulation. However, these approaches suffer from severe
performance degradation or domain shifts. To address these limitations, we
propose Hyper-SD, a novel framework that synergistically amalgamates the
advantages of ODE Trajectory Preservation and Reformulation, while maintaining
near-lossless performance during step compression. Firstly, we introduce
Trajectory Segmented Consistency Distillation to progressively perform
consistent distillation within pre-defined time-step segments, which
facilitates the preservation of the original ODE trajectory from a higher-order
perspective. Secondly, we incorporate human feedback learning to boost the
performance of the model in a low-step regime and mitigate the performance loss
incurred by the distillation process. Thirdly, we integrate score distillation
to further improve the low-step generation capability of the model and offer
the first attempt to leverage a unified LoRA to support the inference process
at all steps. Extensive experiments and user studies demonstrate that Hyper-SD
achieves SOTA performance from 1 to 8 inference steps for both SDXL and SD1.5.
For example, Hyper-SDXL surpasses SDXL-Lightning by +0.68 in CLIP Score and
+0.51 in Aes Score in the 1-step inference. |
Hyper-SD, a novel framework combining ODE Trajectory Preservation and Reformulation, accelerates diffusion models (SDXL and SD1.5) while maintaining near-lossless performance during step compression |
Diffusion models, though powerful for Generative AI, suffer from high computational cost due to multi-step inference. Existing distillation methods for acceleration either compromise generation quality or introduce domain shifts |
The framework leverages: (1) Trajectory Segmented Consistency Distillation for progressive, fine-grained distillation; (2) Human feedback learning to optimize the model for few-step inference; (3) Score distillation for enhanced one-step generation and a unified LoRA for all inference steps |
Hyper-SD achieves SOTA performance for both SDXL and SD1.5 in low-step inference (1 to 8 steps) across quantitative metrics and user studies.
Hyper-SD maintains better image quality and text-image alignment than competing methods, especially for SD15 with limited model capacity.
Hyper-SD is compatible with ControlNet, various base models, and supports flexible inference with a unified LoRA. |
Current acceleration methods, including Hyper-SD, eliminate Classifier Free Guidance, limiting control with negative prompts.
The use of generic reward models for human feedback can be further improved by customized ones for accelerated models. |
diffusion models, model acceleration, distillation, human feedback learning, generative ai |
2404.13680
Report |
PoseAnimate: Zero-shot high fidelity pose controllable character animation |
Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Yu-Gang Jiang, Guo-Jun Qi |
Image-to-video(I2V) generation aims to create a video sequence from a single
image, which requires high temporal coherence and visual fidelity with the
source image.However, existing approaches suffer from character appearance
inconsistency and poor preservation of fine details. Moreover, they require a
large amount of video data for training, which can be computationally
demanding.To address these limitations,we propose PoseAnimate, a novel
zero-shot I2V framework for character animation.PoseAnimate contains three key
components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose
signals into conditional embeddings, to preserve character-independent content
and maintain precise alignment of actions.2) Dual Consistency Attention Module
(DCAM) enhances temporal consistency, and retains character identity and
intricate background details.3) Mask-Guided Decoupling Module (MGDM) refines
distinct feature perception, improving animation fidelity by decoupling the
character and background.We also propose a Pose Alignment Transition Algorithm
(PATA) to ensure smooth action transition.Extensive experiment results
demonstrate that our approach outperforms the state-of-the-art training-based
methods in terms of character consistency and detail fidelity. Moreover, it
maintains a high level of temporal coherence throughout the generated
animations. |
PoseAnimate: A zero-shot, reconstruction-based I2V framework for character animation that generates high-quality videos of arbitrary character images performing user-defined pose sequences. |
Existing I2V methods suffer from appearance inconsistency, poor detail preservation, and high computational cost due to training requirements. This work explores a training-free approach for efficient and high-fidelity character animation. |
The framework leverages a novel pose-aware control module (PACM) to optimize embeddings for pose alignment while maintaining scene consistency. It incorporates a dual consistency attention module (DCAM) for temporal coherence and identity preservation, further enhanced by a mask-guided decoupling module (MGDM) for refined detail perception. |
Outperforms state-of-the-art training-based methods in character consistency and detail fidelity.
Demonstrates superior preservation of complex fine-grained details and temporal coherence.
Achieves high-quality animation without requiring training, leading to lower computational overhead. |
Reliance on pre-trained models might limit generalization ability to unseen domains.
Further exploration of handling complex interactions between character and background. |
image animation, character animation, zero-shot learning, diffusion models, pose control |
2404.13679
Report |
GScream: Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal |
Yuxin Wang, Qianyi Wu, Guofeng Zhang, Dan Xu |
This paper tackles the intricate challenge of object removal to update the
radiance field using the 3D Gaussian Splatting. The main challenges of this
task lie in the preservation of geometric consistency and the maintenance of
texture coherence in the presence of the substantial discrete nature of
Gaussian primitives. We introduce a robust framework specifically designed to
overcome these obstacles. The key insight of our approach is the enhancement of
information exchange among visible and invisible areas, facilitating content
restoration in terms of both geometry and texture. Our methodology begins with
optimizing the positioning of Gaussian primitives to improve geometric
consistency across both removed and visible areas, guided by an online
registration process informed by monocular depth estimation. Following this, we
employ a novel feature propagation mechanism to bolster texture coherence,
leveraging a cross-attention design that bridges sampling Gaussians from both
uncertain and certain areas. This innovative approach significantly refines the
texture coherence within the final radiance field. Extensive experiments
validate that our method not only elevates the quality of novel view synthesis
for scenes undergoing object removal but also showcases notable efficiency
gains in training and rendering speeds. |
This paper presents GScream, a novel framework for efficient and effective 3D object removal from pre-captured scenes using 3D Gaussian Splatting (3DGS). |
Existing methods for 3D object removal based on Neural Radiance Fields (NeRF) suffer from slow training and rendering speeds, while standard 3DGS methods lack geometric accuracy and texture coherence needed for object removal. This work addresses these limitations. |
GScream leverages monocular depth estimation as extra supervision for improving geometric consistency and introduces a novel feature propagation mechanism based on cross-attention between Gaussians in visible and in-painted regions to enhance texture coherence. |
GScream achieves comparable or superior performance to state-of-the-art NeRF-based methods in terms of visual quality while achieving significantly faster training speeds (1.5x to 4x faster).
Monocular depth guidance is shown to significantly improve the geometric accuracy of 3DGS, leading to more realistic object removal.
The proposed cross-attention feature regularization effectively propagates texture information from visible to in-painted regions, resulting in enhanced texture coherence and natural-looking object removal. |
The reliance on 2D in-painting for the reference view might introduce limitations if the in-painting results are imperfect.
Future work could explore joint optimization of 2D in-painting and 3DGS for better overall consistency. |
3d object removal, 3d gaussian splatting, neural radiance fields, depth completion, texture propagation |
2404.13579
Report |
LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions |
Xiaoran Zhao, Tianhao Wu, Yu Lai, Zhiliang Tian, Zhen Huang, Yahui Liu, Zejiang He, Dongsheng Li |
Controllable text-to-image generation synthesizes visual text and objects in
images with certain conditions, which are frequently applied to emoji and
poster generation. Visual text rendering and layout-to-image generation tasks
have been popular in controllable text-to-image generation. However, each of
these tasks typically focuses on single modality generation or rendering,
leaving yet-to-be-bridged gaps between the approaches correspondingly designed
for each of the tasks. In this paper, we combine text rendering and
layout-to-image generation tasks into a single task: layout-controllable
text-object synthesis (LTOS) task, aiming at synthesizing images with object
and visual text based on predefined object layout and text contents. As
compliant datasets are not readily available for our LTOS task, we construct a
layout-aware text-object synthesis dataset, containing elaborate well-aligned
labels of visual text and object information. Based on the dataset, we propose
a layout-controllable text-object adaptive fusion (TOF) framework, which
generates images with clear, legible visual text and plausible objects. We
construct a visual-text rendering module to synthesize text and employ an
object-layout control module to generate objects while integrating the two
modules to harmoniously generate and integrate text content and objects in
images. To better the image-text integration, we propose a self-adaptive
cross-attention fusion module that helps the image generation to attend more to
important text information. Within such a fusion module, we use a self-adaptive
learnable factor to learn to flexibly control the influence of cross-attention
outputs on image generation. Experimental results show that our method
outperforms the state-of-the-art in LTOS, text rendering, and layout-to-image
tasks, enabling harmonious visual text rendering and object generation. |
This paper presents a novel framework, called TOF, for layout-controllable text-object synthesis (LTOS) which aims to generate images with user-controlled object placement and visual text. |
Existing text-to-image generation methods struggle to accurately control both object layout and visual text rendering simultaneously, creating a need for an integrated approach. |
The TOF framework consists of: (1) An object-layout control module for generating objects at specific locations, (2) a visual-text rendering module for synthesizing text with custom layouts, and (3) a text-object self-adaptive fusion module for balancing text and object generation using adaptive cross-attention. |
TOF significantly outperforms state-of-the-art methods in text rendering quality while maintaining object generation accuracy.
The proposed LTOS dataset, containing aligned object and visual text annotations, proves valuable for training and evaluating LTOS tasks.
Ablation studies confirm the contribution of each component, especially the self-adaptive fusion module. |
The current dataset focuses primarily on English text; expanding to other languages is a future goal.
Future work will explore incorporating more sophisticated text layouts and styles. |
diffusion model, text rendering, multi-modal generation, text-object synthesis, layout-to-image generation |
2404.13573
Report |
Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap |
Bowen Qu, Xiaoyu Liang, Shangkun Sun, Wei Gao |
The recent advancements in Text-to-Video Artificial Intelligence Generated
Content (AIGC) have been remarkable. Compared with traditional videos, the
assessment of AIGC videos encounters various challenges: visual inconsistency
that defy common sense, discrepancies between content and the textual prompt,
and distribution gap between various generative models, etc. Target at these
challenges, in this work, we categorize the assessment of AIGC video quality
into three dimensions: visual harmony, video-text consistency, and domain
distribution gap. For each dimension, we design specific modules to provide a
comprehensive quality assessment of AIGC videos. Furthermore, our research
identifies significant variations in visual quality, fluidity, and style among
videos generated by different text-to-video models. Predicting the source
generative model can make the AIGC video features more discriminative, which
enhances the quality assessment performance. The proposed method was used in
the third-place winner of the NTIRE 2024 Quality Assessment for AI-Generated
Content - Track 2 Video, demonstrating its effectiveness. Code will be
available at https://github.com/Coobiw/TriVQA. |
This paper presents a novel framework for assessing AI-Generated Content (AIGC) video quality, addressing the unique challenges posed by this new type of video. |
Existing video quality assessment methods fall short in evaluating AIGC videos due to their unique characteristics, such as visual inconsistencies, discrepancies between content and textual prompts, and variations across generative models. |
The proposed framework decouples AIGC video quality assessment into three dimensions: visual harmony, video-text consistency, and domain distribution gap. It employs a dual-stream architecture with explicit prompt injection, implicit text guidance, caption similarity, and auxiliary inter-domain classification. |
The method outperforms state-of-the-art VQA methods on the NTIRE 2024 AIGC Video Quality Assessment dataset.
Explicit prompt injection, implicit text guidance, and auxiliary inter-domain classification are shown to significantly improve performance.
The proposed method secured the third-place position in the NTIRE 2024 Quality Assessment for AI-Generated Content - Track 2 Video Challenge. |
The reliance on a limited AIGC video dataset may not fully encompass the diversity of future generative models.
Future work could explore expanding the dataset with samples from a wider range of T2V models. |
aigc, video quality assessment, text-to-video, multimodal learning, domain gap |
2404.13445
Report |
DMesh: A Differentiable Mesh Representation |
Sanghyun Son, Matheus Gadelha, Yang Zhou, Zexiang Xu, Ming C. Lin, Yi Zhou |
We present a differentiable representation, DMesh, for general 3D triangular
meshes. DMesh considers both the geometry and connectivity information of a
mesh. In our design, we first get a set of convex tetrahedra that compactly
tessellates the domain based on Weighted Delaunay Triangulation (WDT), and
select triangular faces on the tetrahedra to define the final mesh. We
formulate probability of faces to exist on the actual surface in a
differentiable manner based on the WDT. This enables DMesh to represent meshes
of various topology in a differentiable way, and allows us to reconstruct the
mesh under various observations, such as point cloud and multi-view images
using gradient-based optimization. The source code and full paper is available
at: https://sonsang.github.io/dmesh-project. |
DMesh, a differentiable representation for general 3D triangular meshes, which considers both geometry and connectivity and enables gradient-based optimization of mesh topology and features. |
Existing differentiable mesh representations are limited by fixed topology or reliance on intermediate forms, leading to challenges in representing diverse geometries. |
Utilizes differentiable Weighted Delaunay Triangulation (WDT) to divide a convex domain into tetrahedra, selecting a subset of triangular faces from them to define the final mesh, and formulates the probability of faces existing on the actual surface in a differentiable manner. |
DMesh is versatile and can represent meshes of various topologies, including non-convex polyhedra, non-orientable geometries, and complex structures.
A computationally efficient approach to differentiable WDT is proposed, running in approximately linear time compared to the exponential cost of previous methods.
DMesh allows for efficient reconstruction of surfaces from point clouds and multi-view images, resulting in compact and accurate meshes. |
Current DMesh resolution is limited by computational cost, particularly due to WDT construction.
While DMesh generalizes well to various mesh connectivities, it can exhibit non-manifold errors, requiring further research to guarantee manifoldness. |
differentiable mesh representation, weighted delaunay triangulation, mesh reconstruction, point cloud reconstruction, multi-view reconstruction |
2404.13400
Report |
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding |
Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu |
Visual grounding, which aims to ground a visual region via natural language,
is a task that heavily relies on cross-modal alignment. Existing works utilized
uni-modal pre-trained models to transfer visual/linguistic knowledge separately
while ignoring the multimodal corresponding information. Motivated by recent
advancements in contrastive language-image pre-training and low-rank adaptation
(LoRA) methods, we aim to solve the grounding task based on multimodal
pre-training. However, there exists significant task gaps between pre-training
and grounding. Therefore, to address these gaps, we propose a concise and
efficient hierarchical multimodal fine-grained modulation framework, namely
HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge
and a hierarchical multimodal low-rank adaptation (Hi LoRA) paradigm. The
cross-modal bridge can address the inconsistency between visual features and
those required for grounding, and establish a connection between multi-level
visual and text features. Hi LoRA prevents the accumulation of perceptual
errors by adapting the cross-modal features from shallow to deep layers in a
hierarchical manner. Experimental results on five datasets demonstrate the
effectiveness of our approach and showcase the significant grounding
capabilities as well as promising energy efficiency advantages. The project
page: https://github.com/linhuixiao/HiVG. |
This paper presents HiVG, a hierarchical multimodal fine-grained modulation framework that effectively adapts a pre-trained CLIP model for visual grounding. |
Existing visual grounding methods suffer from task gaps between pre-training and grounding, particularly data bias and differences in learning objectives. This work aims to address these gaps by leveraging the power of multimodal pre-training. |
HiVG consists of two main components: (1) a multi-layer adaptive cross-modal bridge to align visual and textual features and (2) a hierarchical low-rank adaptation (Hi LoRA) paradigm for efficient fine-tuning of the pre-trained model. |
HiVG achieves state-of-the-art performance on five benchmark datasets, outperforming both CLIP-based and detector-based methods.
The proposed Hi LoRA paradigm enables efficient adaptation with minimal trainable parameters while maintaining high performance.
HiVG exhibits strong semantic comprehension capabilities, achieving superior results on grounding tasks involving complex and lengthy text descriptions. |
The performance of HiVG with a Beit-3 backbone, while improved by Hi LoRA, is still lower than that of CLIP, indicating potential limitations in generalizing to other pre-trained models.
Future work could investigate adaptive selection of layer groups and LoRA stages for enhanced hierarchical adaptation. |
visual grounding, referring expression comprehension, multimodal learning, low-rank adaptation, hierarchical learning |
2404.13370
Report |
Movie101v2: Improved Movie Narration Benchmark |
Zihao Yue, Yepeng Zhang, Ziheng Wang, Qin Jin |
Automatic movie narration targets at creating video-aligned plot descriptions
to assist visually impaired audiences. It differs from standard video
captioning in that it requires not only describing key visual details but also
inferring the plots developed across multiple movie shots, thus posing unique
and ongoing challenges. To advance the development of automatic movie narrating
systems, we first revisit the limitations of existing datasets and develop a
large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into
account the essential difficulties in achieving applicable movie narration, we
break the long-term goal into three progressive stages and tentatively focus on
the initial stages featuring understanding within individual clips. We also
introduce a new narration assessment to align with our staged task goals.
Third, using our new dataset, we baseline several leading large vision-language
models, including GPT-4V, and conduct in-depth investigations into the
challenges current models face for movie narration generation. Our findings
reveal that achieving applicable movie narration generation is a fascinating
goal that requires thorough research. |
This paper introduces Movie101v2, a large-scale, bilingual dataset for movie narration generation, building upon and improving the original Movie101 dataset. |
Automatic movie narration is crucial for visually impaired audiences but remains challenging due to the need to describe visual details and infer plots. Existing datasets have limitations such as small scale, single language, and short, simple clips. |
The authors collected 102 additional movies with narrations, used ASR and LLMs for text processing, and enhanced data quality by completing and correcting character names. They defined three progressive stages for movie narration: visual fact description (L1), plot reasoning and narration (L2), and applicable AD text generation (L3). A new evaluation framework using LLMs assesses L1 and L2 separately. |
Movie101v2 consists of 203 movies and 46K bilingual video-narration pairs, exceeding the scale of existing datasets.
Baseline models, including GPT-4V, show promising results but still struggle with L2-level plot reasoning and narration.
Analysis reveals challenges in visual perception, particularly character action/emotion recognition and face matching, and text generation due to the complexity of narration language. |
The current work focuses on L1 and L2, leaving the more complex L3 for future exploration.
The analysis mainly focuses on GPT-4V, limiting insights into other models' limitations and potential solutions. |
movie narration, video understanding, multi-modal, dataset, large vision-language models |
2404.13320
Report |
Pixel is a Barrier: Diffusion Models Are More Adversarially Robust Than We Think |
Haotian Xue, Yongxin Chen |
Adversarial examples for diffusion models are widely used as solutions for
safety concerns. By adding adversarial perturbations to personal images,
attackers can not edit or imitate them easily. However, it is essential to note
that all these protections target the latent diffusion model (LDMs), the
adversarial examples for diffusion models in the pixel space (PDMs) are largely
overlooked. This may mislead us to think that the diffusion models are
vulnerable to adversarial attacks like most deep models. In this paper, we show
novel findings that: even though gradient-based white-box attacks can be used
to attack the LDMs, they fail to attack PDMs. This finding is supported by
extensive experiments of almost a wide range of attacking methods on various
PDMs and LDMs with different model structures, which means diffusion models are
indeed much more robust against adversarial attacks. We also find that PDMs can
be used as an off-the-shelf purifier to effectively remove the adversarial
patterns that were generated on LDMs to protect the images, which means that
most protection methods nowadays, to some extent, cannot protect our images
from malicious attacks. We hope that our insights will inspire the community to
rethink the adversarial samples for diffusion models as protection methods and
move forward to more effective protection. Codes are available in
https://github.com/xavihart/PDM-Pure. |
This paper reveals that Pixel Diffusion Models (PDMs) are significantly more robust against adversarial attacks than commonly believed, contrary to the vulnerability observed in Latent Diffusion Models (LDMs). |
This finding challenges the existing assumption that diffusion models are easily fooled by adversarial attacks and has important implications for the security and protection of these models. |
The authors conduct extensive experiments on various LDMs and PDMs with different architectures, datasets, and resolutions. They test existing attack methods and evaluate the robustness of PDMs. |
Existing adversarial attack methods designed for LDMs fail to effectively attack PDMs.
PDMs exhibit strong robustness against adversarial perturbations, even with large perturbation budgets.
A new purification method, PDM-Pure, leverages the robustness of PDMs to effectively remove protective perturbations from images, bypassing existing protection methods. |
The study primarily focuses on image-based diffusion models, and further investigation is needed for other modalities.
The purification effectiveness of PDM-Pure may vary depending on the strength and type of adversarial perturbations. |
diffusion models, adversarial attacks, robustness, image protection, purification |
2404.13306
Report |
FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models |
Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, Weisi Lin |
Recently, fake images generated by artificial intelligence (AI) models have
become indistinguishable from the real, exerting new challenges for fake image
detection models. To this extent, simple binary judgments of real or fake seem
less convincing and credible due to the absence of human-understandable
explanations. Fortunately, Large Multimodal Models (LMMs) bring possibilities
to materialize the judgment process while their performance remains
undetermined. Therefore, we propose FakeBench, the first-of-a-kind benchmark
towards transparent defake, consisting of fake images with human language
descriptions on forgery signs. FakeBench gropes for two open questions of LMMs:
(1) can LMMs distinguish fake images generated by AI, and (2) how do LMMs
distinguish fake images? In specific, we construct the FakeClass dataset with
6k diverse-sourced fake and real images, each equipped with a Question&Answer
pair concerning the authenticity of images, which are utilized to benchmark the
detection ability. To examine the reasoning and interpretation abilities of
LMMs, we present the FakeClue dataset, consisting of 15k pieces of descriptions
on the telltale clues revealing the falsification of fake images. Besides, we
construct the FakeQA to measure the LMMs' open-question answering ability on
fine-grained authenticity-relevant aspects. Our experimental results discover
that current LMMs possess moderate identification ability, preliminary
interpretation and reasoning ability, and passable open-question answering
ability for image defake. The FakeBench will be made publicly available soon. |
This paper introduces FakeBench, the first benchmark for evaluating the 'transparent defake' abilities of Large Multimodal Models (LMMs), focusing on whether LMMs can not only detect fake images but also provide human-understandable explanations for their judgments. |
With the rise of highly realistic AI-generated fake images, simple binary judgments of 'real' or 'fake' are no longer sufficient. Transparent defake, with its emphasis on human-interpretable explanations, is crucial for building trust and understanding potential model biases. |
FakeBench consists of three datasets: FakeClass (for evaluating detection ability), FakeClue (for evaluating reasoning and interpretation abilities), and FakeQA (for evaluating open-ended question answering on authenticity details). The researchers collected diverse fake images and created natural language annotations, including questions, answers, and detailed descriptions of forgery signs. 13 well-known LMMs were then evaluated on these datasets. |
Current LMMs show moderate ability in detecting fake images, but their performance varies significantly across different generation models.
LMMs exhibit only preliminary abilities in interpreting and reasoning about fake images using human-understandable language.
Explicit chain-of-thought reasoning, while generally beneficial in other tasks, does not significantly improve the fake image detection accuracy for most LMMs. |
The reasoning ability of LMMs is still limited by their understanding of the real world and their capability to describe image irrationality.
Future work should focus on introducing conflict awareness and real-world knowledge to guide LMMs towards better fake image detection and explanation. |
large multimodal models, fake image detection, reasoning and interpretation, benchmark, transparent defake |
2404.13299
Report |
PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition |
Xi Fang, Weigang Wang, Xiaoxin Lv, Jun Yan |
The development of Large Language Models (LLM) and Diffusion Models brings
the boom of Artificial Intelligence Generated Content (AIGC). It is essential
to build an effective quality assessment framework to provide a quantifiable
evaluation of different images or videos based on the AIGC technologies. The
content generated by AIGC methods is driven by the crafted prompts. Therefore,
it is intuitive that the prompts can also serve as the foundation of the AIGC
quality assessment. This study proposes an effective AIGC quality assessment
(QA) framework. First, we propose a hybrid prompt encoding method based on a
dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to
understand and respond to the prompt conditions. Second, we propose an
ensemble-based feature mixer module to effectively blend the adapted prompt and
vision features. The empirical study practices in two datasets: AIGIQA-20K
(AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video
Quality Assessment DataBase), which validates the effectiveness of our proposed
method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and
feasible framework may promote research development in the multimodal
generation field. |
This paper introduces PCQA, a unified framework for assessing the quality of AI-generated images and videos by incorporating prompt information as a conditional factor. |
With the rise of AIGC, evaluating the quality of generated content, particularly in alignment with the creative intent expressed in prompts, is crucial. Existing UGC quality assessment methods fall short in addressing this need. |
The PCQA method leverages a hybrid CLIP text encoder to understand prompts and employs a feature mixer module to blend visual features with adapted prompt representations. The final quality score is obtained through a regression head trained on MOS values. |
PCQA significantly outperforms baseline methods on both AIGIQA-20K (image) and T2VQA-DB (video) datasets.
Ablation studies demonstrate the benefits of using a hybrid text encoder, feature adapter, and model ensemble techniques.
The proposed method secured top rankings in the NTIRE 2024 AIGC quality assessment competition. |
The current method resizes images, potentially losing aspect ratio information crucial for aesthetic evaluation. Future work should explore aspect-ratio-preserving techniques.
The model's reliance on global average pooling for feature extraction results in a loss of spatial information, potentially limiting its performance in video quality assessment. Future research should investigate incorporating spatial-temporal information. |
aigc quality assessment, prompt-conditional quality assessment, clip text encoder, feature mixer, model ensemble |
2404.13263
Report |
FilterPrompt: Guiding Image Transfer in Diffusion Models |
Xi Wang, Yichen Peng, Heng Fang, Haoran Xie, Xi Yang, Chuntao Li |
In controllable generation tasks, flexibly manipulating the generated images
to attain a desired appearance or structure based on a single input image cue
remains a critical and longstanding challenge. Achieving this requires the
effective decoupling of key attributes within the input image data, aiming to
get representations accurately. Previous research has predominantly
concentrated on disentangling image attributes within feature space. However,
the complex distribution present in real-world data often makes the application
of such decoupling algorithms to other datasets challenging. Moreover, the
granularity of control over feature encoding frequently fails to meet specific
task requirements. Upon scrutinizing the characteristics of various generative
models, we have observed that the input sensitivity and dynamic evolution
properties of the diffusion model can be effectively fused with the explicit
decomposition operation in pixel space. This integration enables the image
processing operations performed in pixel space for a specific feature
distribution of the input image, and can achieve the desired control effect in
the generated results. Therefore, we propose FilterPrompt, an approach to
enhance the model control effect. It can be universally applied to any
diffusion model, allowing users to adjust the representation of specific image
features in accordance with task requirements, thereby facilitating more
precise and controllable generation outcomes. In particular, our designed
experiments demonstrate that the FilterPrompt optimizes feature correlation,
mitigates content conflicts during the generation process, and enhances the
model's control capability. |
The paper introduces FilterPrompt, a novel approach that enhances control in diffusion models by manipulating frequency and distribution characteristics of image attributes in pixel space, influencing their representation during generation. |
This approach addresses the limitations of feature space manipulation, offering a more intuitive, controllable, and universally applicable method for enhancing control in diffusion models. |
FilterPrompt integrates filtering operations with a baseline architecture combining ControlNet and IP-Adapter. It applies filters to input images, guiding the diffusion process by modulating feature expression based on specific tasks. |
FilterPrompt excels in preserving structure, shape, and edge similarity, as evidenced by higher SP and lower CD scores.
It effectively transfers color distribution and texture features, exhibiting lower FID and GLCM values compared to other methods.
The method demonstrates strong performance in both style transfer and appearance transfer tasks across diverse domains. |
Designing FilterPrompt necessitates manual adjustments based on specific task requirements and data characteristics, involving a degree of trial and error.
While the current study focuses on a specific baseline architecture, integrating FilterPrompt with more advanced diffusion models holds potential for further improvement. |
image transfer, controllable generation, diffusion models, explicit decomposition, visual prompt |
2404.13153
Report |
Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring |
Chengxu Liu, Xuan Wang, Xiangyu Xu, Ruhao Tian, Shuai Li, Xueming Qian, Ming-Hsuan Yang |
Eliminating image blur produced by various kinds of motion has been a
challenging problem. Dominant approaches rely heavily on model capacity to
remove blurring by reconstructing residual from blurry observation in feature
space. These practices not only prevent the capture of spatially variable
motion in the real world but also ignore the tailored handling of various
motions in image space. In this paper, we propose a novel real-world deblurring
filtering model called the Motion-adaptive Separable Collaborative (MISC)
Filter. In particular, we use a motion estimation network to capture motion
information from neighborhoods, thereby adaptively estimating spatially-variant
motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter. The
MISC Filter first aligns the motion-induced blurring patterns to the motion
middle along the predicted flow direction, and then collaboratively filters the
aligned image through the predicted kernels, weights, and offsets to generate
the output. This design can handle more generalized and complex motion in a
spatially differentiated manner. Furthermore, we analyze the relationships
between the motion estimation network and the residual reconstruction network.
Extensive experiments on four widely used benchmarks demonstrate that our
method provides an effective solution for real-world motion blur removal and
achieves state-of-the-art performance. Code is available at
https://github.com/ChengxuLiu/MISCFilter |
This paper introduces the Motion-adaptive Separable Collaborative (MISC) Filter for blind motion deblurring. It tackles the limitations of previous methods by directly addressing motion blur in the image space instead of solely focusing on feature space. |
Existing deblurring methods struggle to handle the spatially varying and complex motion found in real-world scenarios. This method provides a novel approach by estimating spatially-variant motion information and applying a tailored filtering process. |
The MISC filter uses a motion estimation network to predict motion flow, mask, kernels, weights, and offsets. It aligns blurring patterns to the motion middle and then collaboratively filters the aligned image using predicted parameters. The paper also analyzes different coupling strategies between the motion estimation network and a residual reconstruction network. |
The MISC Filter significantly outperforms state-of-the-art methods on complex real-world motion blur datasets like RealBlur-R and RealBlur-J.
The method demonstrates strong performance on the GoPro and HIDE datasets, achieving comparable results while requiring half the runtime of some leading methods.
Ablation studies validate the contribution of each component within the MISC Filter and demonstrate the effectiveness of the shared-based network coupling strategy. |
The method currently shows limitations in addressing the low-light degradation often present in hardware-induced blurring (e.g., under-display cameras).
Further exploration is needed to optimize the MISC filter for broader applicability in scenarios involving both motion and low-light challenges. |
motion deblurring, image restoration, misc filter, motion estimation, collaborative filtering |
2404.13046
Report |
MoVA: Adapting Mixture of Vision Experts to Multimodal Context |
Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu |
As the key component in multimodal large language models (MLLMs), the ability
of the visual encoder greatly affects MLLM's understanding on diverse image
content. Although some large-scale pretrained vision encoders such as vision
encoders in CLIP and DINOv2 have brought promising performance, we found that
there is still no single vision encoder that can dominate various image content
understanding, e.g., the CLIP vision encoder leads to outstanding results on
general image understanding but poor performance on document or chart content.
To alleviate the bias of CLIP vision encoder, we first delve into the inherent
behavior of different pre-trained vision encoders and then propose the MoVA, a
powerful and novel MLLM, adaptively routing and fusing task-specific vision
experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design
a context-aware expert routing strategy to dynamically select the most suitable
vision experts according to the user instruction, input image, and expertise of
vision experts. This benefits from the powerful model function understanding
ability of the large language model (LLM) equipped with expert-routing low-rank
adaptation (LoRA). In the fine-grained stage, we elaborately conduct the
mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse
task-specific knowledge from various experts. This coarse-to-fine paradigm
effectively leverages representations from experts based on multimodal context
and model expertise, further enhancing the generalization ability. We conduct
extensive experiments to evaluate the effectiveness of the proposed approach.
Without any bells and whistles, MoVA can achieve significant performance gains
over current state-of-the-art methods in a wide range of challenging multimodal
benchmarks. Codes and models will be available at
https://github.com/TempleX98/MoVA. |
Proposes MoVA, a multimodal large language model that adaptively routes and fuses task-specific vision experts with a coarse-to-fine mechanism to enhance multimodal understanding and generalization. |
Existing MLLMs often rely on single vision encoders (e.g., CLIP) that exhibit inconsistent performance across different tasks and domains, limiting their generalization ability. |
Uses a context-aware expert routing strategy to select relevant experts based on user input and model expertise, followed by fine-grained expert fusion with MoV-Adapter to extract and integrate task-specific knowledge. |
Achieves state-of-the-art performance on various MLLM benchmarks, including MMBench, MME, and QBench.
Outperforms specialist models on text-oriented VQA benchmarks while exhibiting strong performance on general VQA and visual grounding tasks.
Demonstrates robust generalization capabilities across diverse domains, including medical visual question answering and image segmentation. |
The number of experts used for fusion is limited to three to manage computational costs.
Exploring alternative expert routing strategies and incorporating more diverse experts could further improve performance. |
multimodal large language models, vision encoder, mixture-of-experts, context-aware routing, expert fusion |
2404.13044
Report |
Unified Scene Representation and Reconstruction for 3D Large Language Models |
Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qiong Liu, Jiaqi Wang |
Enabling Large Language Models (LLMs) to interact with 3D environments is
challenging. Existing approaches extract point clouds either from ground truth
(GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image
aligned 2D features from CLIP are then lifted to point clouds, which serve as
inputs for LLMs. However, this solution lacks the establishment of 3D
point-to-point connections, leading to a deficiency of spatial structure
information. Concurrently, the absence of integration and unification between
the geometric and semantic representations of the scene culminates in a
diminished level of 3D scene understanding. In this paper, we demonstrate the
importance of having a unified scene representation and reconstruction
framework, which is essential for LLMs in 3D scenes. Specifically, we introduce
Uni3DR^2 extracts 3D geometric and semantic aware representation features via
the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a
multi-scale aggregate 3D decoder. Our learned 3D representations not only
contribute to the reconstruction process but also provide valuable knowledge
for LLMs. Experimental results validate that our Uni3DR^2 yields convincing
gains over the baseline on the 3D reconstruction dataset ScanNet (increasing
F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior
performance over the baseline on the 3D vision-language understanding dataset
ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set,
respectively). Furthermore, it outperforms the state-of-the-art method that
uses additional GT point clouds on both ScanQA and 3DMV-VQA. |
This paper presents \methodname, a unified scene representation and reconstruction framework for enhancing 3D Large Language Models (LLMs). |
Existing methods for enabling LLMs to interact with 3D environments suffer from limitations in establishing spatial connections and integrating geometric and semantic information, hindering their performance. |
\methodname leverages frozen pre-trained 2D foundation models (CLIP and SAM) and a multi-scale 3D decoder to extract 3D geometric and semantic representations. These representations are then used for both scene reconstruction and as input for the LLM. |
\methodname achieves superior 3D reconstruction results on ScanNet, improving F-Score by +1.8% over the baseline.
\methodname-LLM surpasses the baseline and state-of-the-art methods on 3D vision-language understanding benchmarks ScanQA and 3DMV-VQA, even without relying on ground truth point clouds.
Ablation studies confirm the importance of unified representation and reconstruction, highlighting the contribution of each component to the overall performance. |
The current method focuses on indoor scenes and may require adaptation for diverse and complex outdoor environments.
Future work will explore scaling the approach to enhance more 3D capabilities with LLMs, including 3D scene perception and generation. |
3d reconstruction, 3d representation, large language models, vision-language understanding, 3d vision |
2404.13040
Report |
Analysis of Classifier-Free Guidance Weight Schedulers |
Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernandez Abrevaya, David Picard, Vicky Kalogeiton |
Classifier-Free Guidance (CFG) enhances the quality and condition adherence
of text-to-image diffusion models. It operates by combining the conditional and
unconditional predictions using a fixed weight. However, recent works vary the
weights throughout the diffusion process, reporting superior results but
without providing any rationale or analysis. By conducting comprehensive
experiments, this paper provides insights into CFG weight schedulers. Our
findings suggest that simple, monotonically increasing weight schedulers
consistently lead to improved performances, requiring merely a single line of
code. In addition, more complex parametrized schedulers can be optimized for
further improvement, but do not generalize across different models and tasks. |
This paper investigates the impact of dynamic guidance weight schedulers in Classifier-Free Guidance (CFG) for diffusion models, proposing simple yet effective schedulers to improve image generation quality. |
Static guidance weight in CFG often presents a trade-off between detail and sharpness in generated images. Dynamic schedulers have shown promise but lack comprehensive analysis and justification. |
The paper explores various heuristic (linear, cosine, etc.) and parameterized (power-cosine, clamping) dynamic schedulers. Their effects are evaluated on class-conditioned image generation (CIFAR-10, ImageNet) and text-to-image generation (Stable Diffusion 1.5 and SDXL) using FID, CLIP-Score, and Diversity metrics. |
Monotonically increasing schedulers (linear, cosine) consistently outperform static guidance and decreasing schedulers.
A simple linear scheduler significantly improves results without additional computational cost or tuning.
Parameterized schedulers, like clamp-linear, can further boost performance but require parameter tuning specific to the model and task. |
Optimal parameters for parameterized schedulers do not generalize across models and datasets.
Further investigation is needed to understand the theoretical underpinnings of why dynamic schedulers improve performance. |
diffusion models, classifier-free guidance, text-to-image generation, dynamic schedulers, image generation |
2404.13026
Report |
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation |
Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y. Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, William T. Freeman |
Realistic object interactions are crucial for creating immersive virtual
experiences, yet synthesizing realistic 3D object dynamics in response to novel
interactions remains a significant challenge. Unlike unconditional or
text-conditioned dynamics generation, action-conditioned dynamics requires
perceiving the physical material properties of objects and grounding the 3D
motion prediction on these properties, such as object stiffness. However,
estimating physical material properties is an open problem due to the lack of
material ground-truth data, as measuring these properties for real objects is
highly difficult. We present PhysDreamer, a physics-based approach that endows
static 3D objects with interactive dynamics by leveraging the object dynamics
priors learned by video generation models. By distilling these priors,
PhysDreamer enables the synthesis of realistic object responses to novel
interactions, such as external forces or agent manipulations. We demonstrate
our approach on diverse examples of elastic objects and evaluate the realism of
the synthesized interactions through a user study. PhysDreamer takes a step
towards more engaging and realistic virtual experiences by enabling static 3D
objects to dynamically respond to interactive stimuli in a physically plausible
manner. See our project page at https://physdreamer.github.io/. |
PhysDreamer is a novel method for synthesizing interactive 3D dynamics by imbuing static 3D objects with physically-based material properties learned from video generation models. |
Realistic object interaction is crucial for immersive virtual experiences. However, existing methods struggle to generate convincing action-conditioned dynamics that realistically capture how objects respond to external forces. |
PhysDreamer leverages the object dynamics priors learned by video generation models. It generates a plausible motion sequence for a static 3D object using a video generation model, then optimizes a spatially varying material field for the object. This optimization leverages differentiable simulation (MPM) and rendering to match the rendered object motion to the generated motion. |
PhysDreamer successfully synthesizes realistic interactive dynamics for various elastic objects, including flowers, a plant, a telephone cord, and a beanie hat.
User study results show that PhysDreamer significantly outperforms state-of-the-art methods in terms of motion realism and visual quality.
The method can benefit from multi-view supervision, improving results for objects with self-occlusion. |
The approach requires manual object segmentation and specification of boundary conditions.
The method is computationally demanding, requiring further optimization for real-time applications. |
3d object interaction, physics-based simulation, video generation, material estimation, differentiable rendering |
2404.13024
Report |
BANF: Band-limited Neural Fields for Levels of Detail Reconstruction |
Ahan Shabanov, Shrisudhan Govindarajan, Cody Reading, Lily Goli, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi |
Largely due to their implicit nature, neural fields lack a direct mechanism
for filtering, as Fourier analysis from discrete signal processing is not
directly applicable to these representations. Effective filtering of neural
fields is critical to enable level-of-detail processing in downstream
applications, and support operations that involve sampling the field on regular
grids (e.g. marching cubes). Existing methods that attempt to decompose neural
fields in the frequency domain either resort to heuristics or require extensive
modifications to the neural field architecture. We show that via a simple
modification, one can obtain neural fields that are low-pass filtered, and in
turn show how this can be exploited to obtain a frequency decomposition of the
entire signal. We demonstrate the validity of our technique by investigating
level-of-detail reconstruction, and showing how coarser representations can be
computed effectively. |
This paper introduces BANF, a method for band-limited frequency decomposition in neural fields using a sampling-aware training process that enables low-pass filtering. |
Effective filtering in neural fields is crucial for level-of-detail processing, anti-aliasing, and applications like marching cubes, but traditional Fourier analysis is not directly applicable. |
BANF samples the neural field on a regular grid, applies a band-limited interpolation kernel (e.g., linear, sinc), and incorporates this interpolated output into the training loss, approximating low-pass filtering during optimization. A cascaded training scheme then enables multi-scale representation. |
BANF successfully decomposes signals into frequency bands, enabling multi-scale reconstruction for images and signed distance fields (SDFs).
It outperforms baselines in level-of-detail surface reconstruction from multi-view images, especially at coarser scales, demonstrating its anti-aliasing capabilities.
The method is agnostic to the underlying neural field architecture, working with both fully-connected and hybrid representations. |
The current implementation is memory intensive at high resolutions.
The paper primarily focuses on uniformly sampled signals, and extending it to contracted representations used in NeRFs for unbounded signals is left for future work. |
neural fields, frequency decomposition, anti-aliasing, level-of-detail, multi-scale representation |
2404.13013
Report |
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models |
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi |
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded
and fine-grained visual perception ability. Beyond holistic image
understanding, Groma is adept at region-level tasks such as region captioning
and visual grounding. Such capabilities are built upon a localized visual
tokenization mechanism, where an image input is decomposed into regions of
interest and subsequently encoded into region tokens. By integrating region
tokens into user instructions and model responses, we seamlessly enable Groma
to understand user-specified region inputs and ground its textual output to
images. Besides, to enhance the grounded chat ability of Groma, we curate a
visually grounded instruction dataset by leveraging the powerful GPT-4V and
visual prompting techniques. Compared with MLLMs that rely on the language
model or external module for localization, Groma consistently demonstrates
superior performances in standard referring and grounding benchmarks,
highlighting the advantages of embedding localization into image tokenization.
Project page: https://groma-mllm.github.io/. |
Introducing Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception abilities for tasks like region captioning and visual grounding. |
Current MLLMs lack localization capabilities, limiting their real-world applications in areas like robotics and augmented reality. Groma addresses this by enabling region-level understanding and grounding. |
Groma integrates localized visual tokenization: an image is decomposed into regions of interest, encoded into region tokens, and integrated into user instructions and model responses. |
Outperforms comparable MLLMs on referring and grounding benchmarks.
Demonstrates strong image-level understanding and reasoning on conversational VQA benchmarks.
Exhibits robust and precise localization capabilities, surpassing alternative methods on the LVIS-Ground benchmark by a significant margin (over 10% AR). |
Current implementation doesn't support free-form region inputs and pixel-level grounding.
Future work involves exploring visual samplers for the region encoder and mask region proposers to address these limitations. |
multimodal large language models, visual grounding, region captioning, localized visual tokenization, grounded chat |
2404.12940
Report |
Neural Flow Diffusion Models: Learnable Forward Process for Improved Diffusion Modelling |
Grigory Bartosh, Dmitry Vetrov, Christian A. Naesseth |
Conventional diffusion models typically relies on a fixed forward process,
which implicitly defines complex marginal distributions over latent variables.
This can often complicate the reverse process' task in learning generative
trajectories, and results in costly inference for diffusion models. To address
these limitations, we introduce Neural Flow Diffusion Models (NFDM), a novel
framework that enhances diffusion models by supporting a broader range of
forward processes beyond the fixed linear Gaussian. We also propose a novel
parameterization technique for learning the forward process. Our framework
provides an end-to-end, simulation-free optimization objective, effectively
minimizing a variational upper bound on the negative log-likelihood.
Experimental results demonstrate NFDM's strong performance, evidenced by
state-of-the-art likelihood estimation. Furthermore, we investigate NFDM's
capacity for learning generative dynamics with specific characteristics, such
as deterministic straight lines trajectories. This exploration underscores
NFDM's versatility and its potential for a wide range of applications. |
The paper introduces Neural Flow Diffusion Models (NFDM), a novel framework that enhances diffusion models by allowing for flexible and learnable forward processes, going beyond fixed linear Gaussian processes. |
The fixed forward process in conventional diffusion models limits the flexibility of the latent space and complicates the learning process for the reverse process. NFDM addresses this limitation, leading to improved performance and versatility. |
NFDM implicitly defines the forward process through a learnable transformation. The paper proposes an end-to-end, simulation-free optimization procedure that minimizes a variational upper bound on the negative log-likelihood. |
NFDM achieves state-of-the-art likelihood estimation results on CIFAR-10, ImageNet 32, and ImageNet 64 datasets.
The framework allows for learning generative processes with specific characteristics, such as deterministic straight-line trajectories.
NFDM with curvature penalization on trajectories (NFDM-OT) demonstrates improved computational efficiency and enhanced generation quality with fewer sampling steps. |
The use of neural networks for parameterizing the forward process increases computational costs compared to conventional diffusion models.
The chosen Gaussian parameterization for the forward process, while effective, may not be optimal, and exploring alternative parameterizations is left for future research. |
diffusion models, generative models, variational inference, learnable forward process, likelihood estimation |
2404.12887
Report |
3D Multi-frame Fusion for Video Stabilization |
Zhan Peng, Xinyi Ye, Weiyue Zhao, Tianqi Liu, Huiqiang Sun, Baopu Li, Zhiguo Cao |
In this paper, we present RStab, a novel framework for video stabilization
that integrates 3D multi-frame fusion through volume rendering. Departing from
conventional methods, we introduce a 3D multi-frame perspective to generate
stabilized images, addressing the challenge of full-frame generation while
preserving structure. The core of our approach lies in Stabilized Rendering
(SR), a volume rendering module, which extends beyond the image fusion by
incorporating feature fusion. The core of our RStab framework lies in
Stabilized Rendering (SR), a volume rendering module, fusing multi-frame
information in 3D space. Specifically, SR involves warping features and colors
from multiple frames by projection, fusing them into descriptors to render the
stabilized image. However, the precision of warped information depends on the
projection accuracy, a factor significantly influenced by dynamic regions. In
response, we introduce the Adaptive Ray Range (ARR) module to integrate depth
priors, adaptively defining the sampling range for the projection process.
Additionally, we propose Color Correction (CC) assisting geometric constraints
with optical flow for accurate color aggregation. Thanks to the three modules,
our RStab demonstrates superior performance compared with previous stabilizers
in the field of view (FOV), image quality, and video stability across various
datasets. |
This paper introduces RStab, a novel video stabilization framework that uses 3D multi-frame fusion via volume rendering for full-frame generation and structure preservation. |
Existing 2D video stabilization methods struggle with either full-frame generation or preserving structure, while 3D methods often have limited field of view. RStab overcomes these limitations. |
RStab leverages Stabilized Rendering (SR), a 3D multi-frame fusion module based on volume rendering. It incorporates the Adaptive Ray Range (ARR) module for defining sampling ranges using depth priors and the Color Correction (CC) module for accurate color aggregation via optical flow. |
RStab achieves full-frame video stabilization without aggressive cropping.
RStab outperforms previous state-of-the-art methods on various benchmark datasets (NUS, Selfie, DeepStab).
Ablation studies confirm the importance of each module (SR, ARR, CC) for achieving superior performance. |
The reliance on pre-trained depth and optical flow models might impact performance if those models fail.
Future work could explore joint optimization of depth/flow estimation with the proposed modules for better efficiency. |
video stabilization, 3d multi-frame fusion, volume rendering, structure preservation, full-frame generation |
2404.12803
Report |
TextSquare: Scaling up Text-Centric Visual Instruction Tuning |
Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang |
Text-centric visual question answering (VQA) has made great strides with the
development of Multimodal Large Language Models (MLLMs), yet open-source models
still fall short of leading models like GPT4V and Gemini, partly due to a lack
of extensive, high-quality instruction tuning data. To this end, we introduce a
new approach for creating a massive, high-quality instruction-tuning dataset,
Square-10M, which is generated using closed-source MLLMs. The data construction
process, termed Square, consists of four steps: Self-Questioning, Answering,
Reasoning, and Evaluation. Our experiments with Square-10M led to three key
findings: 1) Our model, TextSquare, considerably surpasses open-source previous
state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%).
It even outperforms top-tier models like GPT4V and Gemini in 6 of 10
text-centric benchmarks. 2) Additionally, we demonstrate the critical role of
VQA reasoning data in offering comprehensive contextual insights for specific
questions. This not only improves accuracy but also significantly mitigates
hallucinations. Specifically, TextSquare scores an average of 75.1% across four
general VQA and hallucination evaluation datasets, outperforming previous
state-of-the-art models. 3) Notably, the phenomenon observed in scaling
text-centric VQA datasets reveals a vivid pattern: the exponential increase of
instruction tuning data volume is directly proportional to the improvement in
model performance, thereby validating the necessity of the dataset scale and
the high quality of Square-10M. |
This paper introduces Square-10M, a large-scale, high-quality dataset for text-centric Visual Question Answering (VQA) instruction tuning, and TextSquare, a text-centric Multimodal Large Language Model (MLLM) trained on this dataset. |
Open-source MLLMs lag behind closed-source models in text-centric VQA due to the lack of extensive, high-quality instruction tuning data. This work aims to bridge this gap by providing such a dataset. |
The Square-10M dataset is created using a four-step process called Square: Self-Questioning, Answering, Reasoning, and Evaluation. This involves using a closed-source MLLM (Gemini Pro) to generate VQA pairs with reasoning and then filtering them for quality. TextSquare is then trained on Square-10M and a collection of in-domain datasets. |
TextSquare outperforms previous open-source text-centric MLLMs and achieves comparable or superior performance to state-of-the-art closed-source models on various benchmarks.
The inclusion of VQA reasoning data in Square-10M is shown to improve model performance and mitigate hallucinations.
Experiments reveal a scaling law: increasing the scale of instruction tuning data leads to better model performance, demonstrating the effectiveness and necessity of large, high-quality datasets like Square-10M. |
Training large-scale models on massive datasets requires significant computational resources.
While the Square strategy enhances data quality, it still falls short of human-level performance. |
multimodal large language models, text-centric visual question answering, instruction tuning, dataset creation, reasoning |
2404.12794
Report |
MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model |
Kang Zeng, Hao Shi, Jiacheng Lin, Siyu Li, Jintao Cheng, Kaiwei Wang, Zhiyong Li, Kailun Yang |
LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment
moving objects in point clouds of the current scan using motion information
from previous scans. Despite the promising results achieved by previous MOS
methods, several key issues, such as the weak coupling of temporal and spatial
information, still need further study. In this paper, we propose a novel
LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model,
termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue
Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial
information in point clouds and alleviate the issue of overlooked temporal
clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to
endow the model with the capacity to understand the temporal correlations of
the same object across different time steps. Specifically, MSSM emphasizes the
motion states of the same object at different time steps through two distinct
temporal modeling and correlation steps. We utilize an improved state space
model to represent these motion differences, significantly modeling the motion
states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road
benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art
performance. The source code of this work will be made publicly available at
https://github.com/Terminal-K/MambaMOS. |
This paper introduces MambaMOS, a novel LiDAR-based 3D Moving Object Segmentation framework with Motion-aware State Space Model to address the weak coupling of temporal and spatial information in existing methods. |
Moving object segmentation is crucial for autonomous driving systems, ensuring stable operation by providing accurate dynamic scene understanding and assisting in removing ghost effects during mapping. |
MambaMOS leverages a U-Net architecture with Time Clue Bootstrapping Embedding (TCBE) and a Motion-aware State Space Model (MSSM). TCBE enhances temporal-spatial coupling in shallow layers, while MSSM, based on the State Space Model, achieves deep-level coupling by interacting with single-scan and multi-scan features. |
MambaMOS achieves state-of-the-art performance on SemanticKITTI-MOS and KITTI-Road benchmarks.
It effectively segments distant moving objects even with sparse point clouds by emphasizing temporal information.
The method shows strong generalization ability, achieving superior results on KITTI-Road after fine-tuning with limited data. |
The reliance on accurate pose information for scan alignment.
Further exploration of more effective serialization techniques for better capturing spatial context. |
moving object segmentation, state space model, spatio-temporal fusion, lidar point cloud, autonomous driving |
2404.12784
Report |
Contrastive Gaussian Clustering: Weakly Supervised 3D Scene Segmentation |
Myrna C. Silva, Mahtab Dahaghin, Matteo Toso, Alessio Del Bue |
We introduce Contrastive Gaussian Clustering, a novel approach capable of
provide segmentation masks from any viewpoint and of enabling 3D segmentation
of the scene. Recent works in novel-view synthesis have shown how to model the
appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate
images from a given viewpoint by projecting on it the Gaussians before $\alpha$
blending their color. Following this example, we train a model to include also
a segmentation feature vector for each Gaussian. These can then be used for 3D
scene segmentation, by clustering Gaussians according to their feature vectors;
and to generate 2D segmentation masks, by projecting the Gaussians on a plane
and $\alpha$ blending over their segmentation features. Using a combination of
contrastive learning and spatial regularization, our method can be trained on
inconsistent 2D segmentation masks, and still learn to generate segmentation
masks consistent across all views. Moreover, the resulting model is extremely
accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the
state of the art. Code and trained models will be released soon. |
Introduces Contrastive Gaussian Clustering, a novel method for 3D scene segmentation using 3D Gaussian Splatting with a 3D feature field and contrastive learning. |
Addresses the challenge of limited annotated 3D scene datasets by leveraging readily available 2D image segmentation data and handles inconsistent 2D masks to learn consistent 3D segmentation. |
Augments 3D Gaussians with feature vectors, uses contrastive learning to maximize similarity within segments and minimize between, and employs spatial regularization for feature continuity. |
Significantly outperforms LERF, Gaussian Grouping, and LangSplat in mIoU and mBIoU on LERF-Mask and 3D-OVS datasets.
Learns multi-view consistency, enabling accurate 3D segmentation from inconsistent 2D masks.
Enables real-time rendering of novel segmentation masks and 3D object selection. |
Higher computational cost compared to standard 3DGS.
Performance depends on the accuracy of initial 2D segmentations and object localization. |
3d scene segmentation, contrastive learning, 3d gaussian splatting, novel view synthesis, weakly supervised learning |
2404.12547
Report |
Does Gaussian Splatting need SFM Initialization? |
Yalda Foroutan, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi |
3D Gaussian Splatting has recently been embraced as a versatile and effective
method for scene reconstruction and novel view synthesis, owing to its
high-quality results and compatibility with hardware rasterization. Despite its
advantages, Gaussian Splatting's reliance on high-quality point cloud
initialization by Structure-from-Motion (SFM) algorithms is a significant
limitation to be overcome. To this end, we investigate various initialization
strategies for Gaussian Splatting and delve into how volumetric reconstructions
from Neural Radiance Fields (NeRF) can be utilized to bypass the dependency on
SFM data. Our findings demonstrate that random initialization can perform much
better if carefully designed and that by employing a combination of improved
initialization strategies and structure distillation from low-cost NeRF models,
it is possible to achieve equivalent results, or at times even superior, to
those obtained from SFM initialization. |
This paper investigates initialization strategies for 3D Gaussian Splatting, aiming to remove the dependence on Structure-from-Motion (SfM) data by leveraging Neural Radiance Fields (NeRF). |
Gaussian Splatting relies on high-quality point cloud initialization from SfM, which is computationally expensive and can be unreliable in certain scenarios like SLAM or autonomous vehicle applications. |
The authors experiment with different initialization strategies: 1) improved random initialization within a large bounding box, 2) point cloud initialization from a pre-trained NeRF model, and 3) depth supervision from a pre-trained NeRF model during Gaussian Splatting training. |
Carefully designed random initialization, specifically a large uniform initialization, outperforms previous attempts and achieves competitive results.
Initializing Gaussian Splatting with points sampled from a pre-trained NeRF model surpasses random initialization and, in some cases, even outperforms SfM initialization.
Adding depth supervision from the pre-trained NeRF model further improves the performance of Gaussian Splatting, achieving the best overall results. |
The proposed method still requires camera calibration, which is often obtained from SfM.
The performance of the NeRF pre-training can be sensitive to the scene, requiring further research to automate the NeRF configuration process. |
gaussian splatting, nerf, sfm, initialization, depth supervision |
2404.12541
Report |
GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models |
Sai Sree Harsha, Ambareesh Revanur, Dhwanit Agarwal, Shradha Agrawal |
Video editing methods based on diffusion models that rely solely on a text
prompt for the edit are hindered by the limited expressive power of text
prompts. Thus, incorporating a reference target image as a visual guide becomes
desirable for precise control over edit. Also, most existing methods struggle
to accurately edit a video when the shape and size of the object in the target
image differ from the source object. To address these challenges, we propose
"GenVideo" for editing videos leveraging target-image aware T2I models. Our
approach handles edits with target objects of varying shapes and sizes while
maintaining the temporal consistency of the edit using our novel target and
shape aware InvEdit masks. Further, we propose a novel target-image aware
latent noise correction strategy during inference to improve the temporal
consistency of the edits. Experimental analyses indicate that GenVideo can
effectively handle edits with objects of varying shapes, where existing
approaches fail. |
GenVideo, a novel framework for editing videos using target-image aware text-to-image (T2I) diffusion models. |
Existing video editing methods based on diffusion models struggle to make temporally consistent edits when the shape and size of the object in the target image differ from the source object. They are also often limited by the expressive power of text prompts. |
GenVideo leverages target-image aware T2I models and introduces two novel components: InvEdit and latent correction. InvEdit generates target-image and shape-aware masks to identify regions of interest. Latent correction improves temporal consistency by blending inter-frame latents. |
GenVideo can effectively handle edits with objects of varying shapes and sizes, outperforming existing methods.
InvEdit masks accurately identify regions of interest, enabling localized edits.
Latent correction strategy improves the temporal consistency of edits, even for objects with substantial shape differences. |
The quality of edits is limited by the underlying T2I model.
Fine-grained inconsistencies may remain, especially for complex objects. |
video editing, diffusion models, target-image awareness, temporal consistency, invedit |
2404.12391
Report |
On the Content Bias in Fréchet Video Distance |
Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang |
Fr\'echet Video Distance (FVD), a prominent metric for evaluating video
generation models, is known to conflict with human perception occasionally. In
this paper, we aim to explore the extent of FVD's bias toward per-frame quality
over temporal realism and identify its sources. We first quantify the FVD's
sensitivity to the temporal axis by decoupling the frame and motion quality and
find that the FVD increases only slightly with large temporal corruption. We
then analyze the generated videos and show that via careful sampling from a
large set of generated videos that do not contain motions, one can drastically
decrease FVD without improving the temporal quality. Both studies suggest FVD's
bias towards the quality of individual frames. We further observe that the bias
can be attributed to the features extracted from a supervised video classifier
trained on the content-biased dataset. We show that FVD with features extracted
from the recent large-scale self-supervised video models is less biased toward
image quality. Finally, we revisit a few real-world examples to validate our
hypothesis. |
This paper presents a systematic study quantifying the bias of Fréchet Video Distance (FVD) towards per-frame quality over temporal realism in video generation. |
Accurately evaluating the quality and diversity of generated videos is crucial with the rapid progress in video generation, and understanding the limitations of widely used metrics like FVD is essential. |
The authors analyze FVD's sensitivity to temporal aspects by: (1) Distorting videos with controlled spatial and spatiotemporal corruptions, (2) Probing the perceptual null space by resampling generated videos to minimize FVD without improving temporal quality, and (3) Examining real-world examples where FVD contradicts human perception. |
FVD exhibits low sensitivity to temporal inconsistencies, often favoring videos with better frame quality over temporal realism.
Resampling generated videos without motion can still significantly reduce FVD, indicating a large perceptual null space where temporal quality is disregarded.
FVD computed with features from self-supervised video models (e.g., VideoMAE-v2) trained on diverse datasets is less biased towards frame quality and more sensitive to temporal inconsistencies. |
The impact of resizing high-resolution generated videos and handling non-square aspect ratios on FVD remains unexplored.
Computing FVD with self-supervised features for long videos is computationally expensive and requires further investigation. |
video generation, evaluation metrics, fréchet video distance (fvd), content bias, self-supervised learning |
2404.12390
Report |
BLINK: Multimodal Large Language Models Can See but Not Perceive |
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna |
We introduce Blink, a new benchmark for multimodal language models (LLMs)
that focuses on core visual perception abilities not found in other
evaluations. Most of the Blink tasks can be solved by humans "within a blink"
(e.g., relative depth estimation, visual correspondence, forensics detection,
and multi-view reasoning). However, we find these perception-demanding tasks
cast significant challenges for current multimodal LLMs because they resist
mediation through natural language. Blink reformats 14 classic computer vision
tasks into 3,807 multiple-choice questions, paired with single or multiple
images and visual prompting. While humans get 95.70% accuracy on average, Blink
is surprisingly challenging for existing multimodal LLMs: even the
best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only
13.17% and 7.63% higher than random guessing, indicating that such perception
abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also
highlights that specialist CV models could solve these problems much better,
suggesting potential pathways for future improvements. We believe Blink will
stimulate the community to help multimodal LLMs catch up with human-level
visual perception. |
BLINK, a new benchmark for multimodal language models (LLMs), focuses on core visual perception abilities like depth estimation, correspondence, and 3D reasoning, which are often overlooked in other evaluations. |
Existing multimodal LLM benchmarks often conflate perception with language knowledge and reasoning, primarily evaluating perception as a dense captioning task. BLINK aims to highlight and assess the nuanced perception capabilities of LLMs, going beyond recognition-based tasks. |
BLINK reimagines 14 classic computer vision problems, ranging from low-level pattern matching to high-level visual understanding, into 3,807 multiple-choice questions paired with images and visual prompts. These tasks are designed to be easily solvable by humans but difficult to address through dense captioning alone. |
Humans achieve 95.70% average accuracy on BLINK, while even the best-performing LLMs (GPT-4V, Gemini) struggle, achieving accuracies of 51.26% and 45.72% respectively.
Multimodal LLMs show relative strengths in mid-level perception tasks like spatial reasoning and counting but struggle with pixel-level tasks like relative reflectance.
Specialist computer vision models significantly outperform LLMs on BLINK tasks, suggesting potential for improvement by integrating insights from specialized models. |
BLINK relies on existing image datasets and may not encompass all real-world visual perception abilities.
Future work could explore incorporating a wider range of visual perception tasks and developing novel evaluation metrics that better capture the nuanced aspects of visual understanding in LLMs. |
multimodal llms, visual perception, benchmarking, computer vision, artificial intelligence |
2404.12389
Report |
Moving Object Segmentation: All You Need Is SAM (and Flow) |
Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman |
The objective of this paper is motion segmentation -- discovering and
segmenting the moving objects in a video. This is a much studied area with
numerous careful,and sometimes complex, approaches and training schemes
including: self-supervised learning, learning from synthetic datasets,
object-centric representations, amodal representations, and many more. Our
interest in this paper is to determine if the Segment Anything model (SAM) can
contribute to this task. We investigate two models for combining SAM with
optical flow that harness the segmentation power of SAM with the ability of
flow to discover and group moving objects. In the first model, we adapt SAM to
take optical flow, rather than RGB, as an input. In the second, SAM takes RGB
as an input, and flow is used as a segmentation prompt. These surprisingly
simple methods, without any further modifications, outperform all previous
approaches by a considerable margin in both single and multi-object benchmarks.
We also extend these frame-level segmentations to sequence-level segmentations
that maintain object identity. Again, this simple model outperforms previous
methods on multiple video object segmentation benchmarks. |
This paper explores adapting the Segment Anything Model (SAM) for moving object segmentation in videos, introducing two methods: FlowSAM, which uses optical flow as input, and MotionSAM, which uses optical flow as a prompt for guiding SAM on RGB inputs. |
Moving object segmentation is a challenging task, and SAM, despite its success in image segmentation, needs adaptation for video. This paper investigates simple yet effective ways to leverage SAM’s power for this task. |
The paper introduces FlowSAM, which fine-tunes SAM on optical flow inputs, and MotionSAM, which uses a trainable prompt generator to feed flow-derived prompts to SAM processing RGB frames. They further propose a sequence-level mask association method for maintaining object identity across frames. |
FlowSAM with flow-only inputs outperforms previous methods by a large margin (>10%) on moving object segmentation benchmarks.
MotionSAM, using RGB+flow, achieves state-of-the-art performance, especially excelling at multi-object benchmarks.
Combining FlowSAM and MotionSAM further boosts performance, demonstrating the complementary roles of flow and RGB modalities. |
The methods suffer from extended running time due to SAM’s computationally heavy image encoder.
The sequence-wise association, while strong, can be improved with longer temporal context. |
motion segmentation, video object segmentation, segment anything model (sam), optical flow, motion-based object discovery |
2404.12388
Report |
VideoGigaGAN: Towards Detail-rich Video Super-Resolution |
Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, Difan Liu |
Video super-resolution (VSR) approaches have shown impressive temporal
consistency in upsampled videos. However, these approaches tend to generate
blurrier results than their image counterparts as they are limited in their
generative capability. This raises a fundamental question: can we extend the
success of a generative image upsampler to the VSR task while preserving the
temporal consistency? We introduce VideoGigaGAN, a new generative VSR model
that can produce videos with high-frequency details and temporal consistency.
VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply
inflating GigaGAN to a video model by adding temporal modules produces severe
temporal flickering. We identify several key issues and propose techniques that
significantly improve the temporal consistency of upsampled videos. Our
experiments show that, unlike previous VSR methods, VideoGigaGAN generates
temporally consistent videos with more fine-grained appearance details. We
validate the effectiveness of VideoGigaGAN by comparing it with
state-of-the-art VSR models on public datasets and showcasing video results
with $8\times$ super-resolution. |
Introducing VideoGigaGAN, the first large-scale GAN-based model for video super-resolution, generating high-frequency details and temporal consistency. |
Existing VSR models struggle to balance temporal consistency with generating realistic high-frequency details. |
Building upon GigaGAN, the authors add: 1) temporal modules (convolution and attention) to the decoder, 2) a flow-guided feature propagation module, 3) anti-aliasing blocks in the encoder, and 4) a high-frequency shuttle mechanism. |
VideoGigaGAN generates sharper, more detailed videos than state-of-the-art VSR methods.
The model successfully performs 8x video upsampling with good detail and temporal consistency.
A new metric, Referenced Warping Error (RWE), is proposed for evaluating temporal consistency in VSR. |
Struggles with extremely long videos due to optical flow inaccuracies.
Performance degrades with very small objects (e.g. text) due to information loss in the LR input. |
video super-resolution, generative adversarial networks, temporal consistency, high-frequency details, gigagan |
2404.12387
Report |
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models |
Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, Zhihui Xie |
We introduce Reka Core, Flash, and Edge, a series of powerful multimodal
language models trained from scratch by Reka. Reka models are able to process
and reason with text, images, video, and audio inputs. This technical report
discusses details of training some of these models and provides comprehensive
evaluation results. We show that Reka Edge and Reka Flash are not only
state-of-the-art but also outperform many much larger models, delivering
outsized values for their respective compute class. Meanwhile, our most capable
and largest model, Reka Core, approaches the best frontier models on both
automatic evaluations and blind human evaluations. On image question answering
benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V.
Meanwhile, on multimodal chat, Core ranks as the second most preferred model
under a blind third-party human evaluation setup, outperforming other models
such as Claude 3 Opus. On text benchmarks, Core not only performs competitively
to other frontier models on a set of well-established benchmarks (e.g. MMLU,
GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question
answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped
in production at http://chat.reka.ai . A showcase of non cherry picked
qualitative examples can also be found at http://showcase.reka.ai . |
This paper introduces Reka Core, Flash, and Edge, a series of multimodal language models (MLLMs) trained from scratch by Reka. |
Reka models are important because they are state-of-the-art for their compute class, outperforming many much larger models. They can process text, images, video, and audio, achieving competitive performance to other frontier models on various benchmarks. |
The models use a modular encoder-decoder transformer architecture, trained on a massive dataset of text and multimodal data. They are aligned and instruction-tuned using supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). |
Reka Core approaches the performance of GPT-4V on image question answering and outperforms Claude 3 on multimodal chat.
Reka Flash outperforms GPT-3.5 Turbo and models much larger in size, like Grok-1 and Gemini Pro 1.0.
Reka Edge surpasses other state-of-the-art 7B models such as Gemma 7B and Mistral 7B. |
Reka Core has not finished training and is still being improved.
Limited details about the tool-use, function calling, and web search capabilities are provided in the report. |
multimodal language models, large language models, computer vision, natural language processing, benchmarking |
2404.12386
Report |
SOHES: Self-supervised Open-world Hierarchical Entity Segmentation |
Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang, Handong Zhao, Ani Nenkova, Liang-Yan Gui, Tong Sun, Yu-Xiong Wang |
Open-world entity segmentation, as an emerging computer vision task, aims at
segmenting entities in images without being restricted by pre-defined classes,
offering impressive generalization capabilities on unseen images and concepts.
Despite its promise, existing entity segmentation methods like Segment Anything
Model (SAM) rely heavily on costly expert annotators. This work presents
Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel
approach that eliminates the need for human annotations. SOHES operates in
three phases: self-exploration, self-instruction, and self-correction. Given a
pre-trained self-supervised representation, we produce abundant high-quality
pseudo-labels through visual feature clustering. Then, we train a segmentation
model on the pseudo-labels, and rectify the noises in pseudo-labels via a
teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES
also captures their constituent parts, providing a hierarchical understanding
of visual entities. Using raw images as the sole training data, our method
achieves unprecedented performance in self-supervised open-world segmentation,
marking a significant milestone towards high-quality open-world entity
segmentation in the absence of human-annotated masks. Project page:
https://SOHES.github.io. |
This paper introduces SOHES, a self-supervised open-world hierarchical entity segmentation approach that eliminates the reliance on human annotations. |
Existing open-world entity segmentation models rely heavily on costly human-annotated datasets, limiting their scalability and practicality. |
SOHES operates in three self-supervised phases: 1) Self-exploration: generates initial pseudo-labels by clustering visual features from a pre-trained DINO representation. 2) Self-instruction: trains a segmentation model (DINO backbone + Mask2Former) on the pseudo-labels to refine segmentation. 3) Self-correction: further refines the model using a teacher-student mutual-learning framework. |
SOHES achieves state-of-the-art performance in self-supervised open-world segmentation, significantly closing the gap with supervised methods.
The method effectively segments both whole entities and their constituent parts, providing a hierarchical understanding of visual scenes.
SOHES-trained ViT backbones demonstrate improved performance on downstream dense prediction tasks like semantic segmentation and object detection. |
SOHES may struggle with discontinuous or occluded entities, text overlays, and blurry backgrounds.
Future work will explore improved pseudo-labeling strategies to address these limitations. |
self-supervised learning, open-world segmentation, hierarchical segmentation, entity segmentation, teacher-student learning |
2404.12385
Report |
MeshLRM: Large Reconstruction Model for High-Quality Mesh |
Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, Zexiang Xu |
We propose MeshLRM, a novel LRM-based approach that can reconstruct a
high-quality mesh from merely four input images in less than one second.
Different from previous large reconstruction models (LRMs) that focus on
NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction
and rendering within the LRM framework. This allows for end-to-end mesh
reconstruction by fine-tuning a pre-trained NeRF LRM with mesh rendering.
Moreover, we improve the LRM architecture by simplifying several complex
designs in previous LRMs. MeshLRM's NeRF initialization is sequentially trained
with low- and high-resolution images; this new LRM training strategy enables
significantly faster convergence and thereby leads to better quality with less
compute. Our approach achieves state-of-the-art mesh reconstruction from
sparse-view inputs and also allows for many downstream applications, including
text-to-3D and single-image-to-3D generation. Project page:
https://sarahweiii.github.io/meshlrm/ |
Presents MeshLRM, a novel LRM-based framework that integrates differentiable mesh extraction and rendering for end-to-end few-shot high-quality mesh reconstruction. |
High-quality 3D meshes are essential for various applications, and existing methods for mesh reconstruction are either time-consuming or require dense input images. |
The method leverages a transformer-based LRM architecture with simplified image tokenization and triplane decoding. It incorporates differentiable marching cubes and rendering for end-to-end mesh optimization and introduces a ray opacity loss to stabilize training. |
Achieves state-of-the-art mesh reconstruction from sparse-view inputs, outperforming existing feed-forward and optimization-based methods.
Significantly faster than per-scene optimization methods, enabling mesh reconstruction in less than one second.
Demonstrates strong generalization ability on real datasets and enables high-quality text-to-3D and image-to-3D generation. |
Limited robustness for scenes with complex materials due to the assumption of Lambertian appearance.
Requires input camera poses, which can be challenging to obtain accurately for real captures. |
sparse-view reconstruction, high-quality mesh, large reconstruction models, differentiable rendering, 3d generation |
2404.12382
Report |
Lazy Diffusion Transformer for Interactive Image Editing |
Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi |
We introduce a novel diffusion transformer, LazyDiffusion, that generates
partial image updates efficiently. Our approach targets interactive image
editing applications in which, starting from a blank canvas or an image, a user
specifies a sequence of localized image modifications using binary masks and
text prompts. Our generator operates in two phases. First, a context encoder
processes the current canvas and user mask to produce a compact global context
tailored to the region to generate. Second, conditioned on this context, a
diffusion-based transformer decoder synthesizes the masked pixels in a "lazy"
fashion, i.e., it only generates the masked region. This contrasts with
previous works that either regenerate the full canvas, wasting time and
computation, or confine processing to a tight rectangular crop around the mask,
ignoring the global image context altogether. Our decoder's runtime scales with
the mask size, which is typically small, while our encoder introduces
negligible overhead. We demonstrate that our approach is competitive with
state-of-the-art inpainting methods in terms of quality and fidelity while
providing a 10x speedup for typical user interactions, where the editing mask
represents 10% of the image. |
Introduces Gazelle, a diffusion transformer model that efficiently generates partial image updates for interactive image editing by processing only masked regions. |
Existing diffusion-based inpainting methods are computationally expensive, regenerating the entire image or relying on limited local context, hindering interactivity and global consistency. |
Gazelle employs an encoder-decoder architecture. The encoder compresses the full image and mask into a compact global context. The decoder, a diffusion transformer, then iteratively generates only the masked pixels conditioned on this context and the text prompt. |
Achieves up to 10x speedup over full-image inpainting methods for small masks typical in interactive editing.
Maintains competitive image quality and fidelity compared to state-of-the-art inpainting models.
Demonstrates the effectiveness of compressed global context in preserving semantic information for coherent inpainting. |
The context encoder's quadratic scaling with input size may limit scalability to very high-resolution images.
Occasional color discrepancies between generated and visible regions require further investigation for more principled solutions. |
image inpainting, diffusion models, transformers, interactive image editing, context encoding |
2404.12352
Report |
Point-In-Context: Understanding Point Cloud via In-Context Learning |
Mengyuan Liu, Zhongbin Fang, Xia Li, Joachim M. Buhmann, Xiangtai Li, Chen Change Loy |
With the emergence of large-scale models trained on diverse datasets,
in-context learning has emerged as a promising paradigm for multitasking,
notably in natural language processing and image processing. However, its
application in 3D point cloud tasks remains largely unexplored. In this work,
we introduce Point-In-Context (PIC), a novel framework for 3D point cloud
understanding via in-context learning. We address the technical challenge of
effectively extending masked point modeling to 3D point clouds by introducing a
Joint Sampling module and proposing a vanilla version of PIC called
Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model
for various 3D point cloud tasks, with inputs and outputs modeled as
coordinates. In this paradigm, the challenging segmentation task is achieved by
assigning label points with XYZ coordinates for each category; the final
prediction is then chosen based on the label point closest to the predictions.
To break the limitation by the fixed label-coordinate assignment, which has
poor generalization upon novel classes, we propose two novel training
strategies, In-Context Labeling and In-Context Enhancing, forming an extended
version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving
dynamic context labeling and model training. By utilizing dynamic in-context
labels and extra in-context pairs, PIC-S achieves enhanced performance and
generalization capability in and across part segmentation datasets. PIC is a
general framework so that other tasks or datasets can be seamlessly introduced
into our PIC through a unified data format. We conduct extensive experiments to
validate the versatility and adaptability of our proposed methods in handling a
wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of
generalizing unseen datasets and performing novel part segmentation by
customizing prompts. |
This paper introduces Point-In-Context (PIC), the first in-context learning framework for 3D point cloud understanding. |
In-context learning enables efficient model adaptation and generalization without parameter updates, addressing resource constraints associated with large-scale model fine-tuning. |
PIC leverages a Joint Sampling module to overcome information leakage and data disarray in 3D point clouds. Two versions are proposed: PIC-G for multitasking and PIC-S for part segmentation. PIC-S further introduces In-Context Labeling and In-Context Enhancing strategies for dynamic context-aware segmentation. |
PIC-G achieves state-of-the-art results on a multitask benchmark comprising reconstruction, denoising, registration, and part segmentation tasks.
PIC-S outperforms existing methods on a large-scale Human & Object Segmentation benchmark.
PIC-S demonstrates strong generalization capabilities, effectively segmenting unseen datasets like AKB-48. |
The performance of PIC is dependent on the quality of prompts, suggesting potential improvements through better prompt selection.
The random label assignment in PIC-S, while enabling generalization, can pose challenges for model training. |
in-context learning, point cloud analysis, multi-task learning, part segmentation, 3d vision |
2404.12347
Report |
AniClipart: Clipart Animation with Text-to-Video Priors |
Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao |
Clipart, a pre-made graphic art form, offers a convenient and efficient way
of illustrating visual content. Traditional workflows to convert static clipart
images into motion sequences are laborious and time-consuming, involving
numerous intricate steps like rigging, key animation and in-betweening. Recent
advancements in text-to-video generation hold great potential in resolving this
problem. Nevertheless, direct application of text-to-video generation models
often struggles to retain the visual identity of clipart images or generate
cartoon-style motions, resulting in unsatisfactory animation outcomes. In this
paper, we introduce AniClipart, a system that transforms static clipart images
into high-quality motion sequences guided by text-to-video priors. To generate
cartoon-style and smooth motion, we first define B\'{e}zier curves over
keypoints of the clipart image as a form of motion regularization. We then
align the motion trajectories of the keypoints with the provided text prompt by
optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes
adequate knowledge of natural motion within a pretrained text-to-video
diffusion model. With a differentiable As-Rigid-As-Possible shape deformation
algorithm, our method can be end-to-end optimized while maintaining deformation
rigidity. Experimental results show that the proposed AniClipart consistently
outperforms existing image-to-video generation models, in terms of text-video
alignment, visual identity preservation, and motion consistency. Furthermore,
we showcase the versatility of AniClipart by adapting it to generate a broader
array of animation formats, such as layered animation, which allows topological
changes. |
AniClipart is a system that transforms static clipart images into high-quality motion sequences guided by text prompts, preserving visual identity and achieving motion consistency. |
Automating clipart animation is crucial due to the increasing demand and the labor-intensive nature of traditional methods. Existing text-to-video models struggle to retain clipart's visual style and generate cartoon-style motions. |
AniClipart defines keypoints on clipart, assigns Bézier curve trajectories, and leverages Video Score Distillation Sampling (VSDS) loss to optimize trajectories based on text-to-video priors. A differentiable As-Rigid-As-Possible deformation maintains shape rigidity during animation. |
AniClipart outperforms existing image-to-video generation models in text-video alignment, visual identity preservation, and motion consistency.
Ablation studies confirm the importance of ARAP deformation, Bézier-driven animation, skeleton loss, and VSDS loss for high-quality results.
AniClipart is extended to handle layered animation, accommodating topological changes for more complex animations. |
AniClipart's animation diversity is limited by the capabilities of current text-to-video models.
Generating motions that significantly deviate from the initial clipart pose remains challenging due to limitations in video model capacity. |
clipart animation, text-to-video generation, score distillation sampling, as-rigid-as-possible deformation, bézier curves |
2404.12333
Report |
Customizing Text-to-Image Diffusion with Camera Viewpoint Control |
Nupur Kumari, Grace Su, Richard Zhang, Taesung Park, Eli Shechtman, Jun-Yan Zhu |
Model customization introduces new concepts to existing text-to-image models,
enabling the generation of the new concept in novel contexts. However, such
methods lack accurate camera view control w.r.t the object, and users must
resort to prompt engineering (e.g., adding "top-view") to achieve coarse view
control. In this work, we introduce a new task -- enabling explicit control of
camera viewpoint for model customization. This allows us to modify object
properties amongst various background scenes via text prompts, all while
incorporating the target camera pose as additional control. This new task
presents significant challenges in merging a 3D representation from the
multi-view images of the new concept with a general, 2D text-to-image model. To
bridge this gap, we propose to condition the 2D diffusion process on rendered,
view-dependent features of the new object. During training, we jointly adapt
the 2D diffusion modules and 3D feature predictions to reconstruct the object's
appearance and geometry while reducing overfitting to the input multi-view
images. Our method outperforms existing image editing and model personalization
baselines in preserving the custom object's identity while following the input
text prompt and the object's camera pose. |
This paper introduces CustomDiffusion360, a method for customizing text-to-image diffusion models with explicit control over the camera viewpoint of newly introduced objects. |
Existing model customization techniques lack precise control over the camera viewpoint of the generated objects, hindering users' ability to generate diverse and specific outputs. |
CustomDiffusion360 bridges the gap between 3D neural representations of custom objects and 2D text-to-image diffusion models by leveraging a novel pose-conditioned transformer block. This block uses FeatureNeRF, a module that learns to predict 3D features from multi-view images of the custom object and renders them into 2D features conditioned on the target camera pose. These rendered features are then fused with the diffusion model's internal features to guide the generation process. |
CustomDiffusion360 outperforms existing image editing and model personalization techniques in generating high-quality images that accurately reflect the target object's identity, camera pose, and input text prompt.
The method generalizes well to novel camera viewpoints, even those outside the training distribution.
CustomDiffusion360 can be combined with other image editing techniques for tasks like object in-painting and panorama generation, enabling more creative applications. |
The method may struggle to generalize to extreme camera poses significantly different from the training data.
Generating scenes with multiple custom objects and ensuring their accurate pose control remains an open challenge. |
text-to-image synthesis, model customization, camera pose control, diffusion models, nerf |
2404.12168
Report |
Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization |
Insoo Kim, Jae Seok Choi, Geonseok Seo, Kinam Kwon, Jinwoo Shin, Hyong-Euk Lee |
As recent advances in mobile camera technology have enabled the capability to
capture high-resolution images, such as 4K images, the demand for an efficient
deblurring model handling large motion has increased. In this paper, we
discover that the image residual errors, i.e., blur-sharp pixel differences,
can be grouped into some categories according to their motion blur type and how
complex their neighboring pixels are. Inspired by this, we decompose the
deblurring (regression) task into blur pixel discretization (pixel-level blur
classification) and discrete-to-continuous conversion (regression with blur
class map) tasks. Specifically, we generate the discretized image residual
errors by identifying the blur pixels and then transform them to a continuous
form, which is computationally more efficient than naively solving the original
regression problem with continuous values. Here, we found that the
discretization result, i.e., blur segmentation map, remarkably exhibits visual
similarity with the image residual errors. As a result, our efficient model
shows comparable performance to state-of-the-art methods in realistic
benchmarks, while our method is up to 10 times computationally more efficient. |
This paper presents a novel deblurring scheme that decomposes the regression task into two simpler tasks: blur pixel discretization (classifying blur at the pixel level) and discrete-to-continuous conversion (regression guided by a blur class map). This approach is more computationally efficient than directly solving the regression problem. |
With the increasing demand for efficient deblurring models that can handle large motion in high-resolution images, particularly on resource-constrained devices, this paper addresses the need for efficient and effective deblurring solutions. |
The authors propose a two-stage model. First, a blur pixel discretizer generates a blur segmentation map reflecting image residual errors. Second, a discrete-to-continuous (D2C) converter transforms this map into a continuous form to refine the deblurred image. The method leverages the logarithmic Fourier space to simplify the relationship between blurred and sharp images during training. |
The proposed method achieves competitive deblurring results compared to state-of-the-art methods while being up to 10 times more computationally efficient.
The generated blur segmentation map, acting as a form of ground truth, significantly improves deblurring performance, especially for efficient models.
The method shows promising results in both objective evaluations on standard benchmarks and visual comparisons against commercial deblurring applications. |
The model's performance might be affected by using different datasets for training the blur pixel discretizer and the D2C converter due to variations in image characteristics and blur types.
Further acceleration is possible by deploying the model on NPUs instead of GPUs to enhance its real-time applicability on mobile devices. |
image deblurring, motion blur, efficient deep learning, blur segmentation, discrete-to-continuous conversion |
2404.12154
Report |
StyleBooth: Image Style Editing with Multimodal Instruction |
Zhen Han, Chaojie Mao, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang |
Given an original image, image editing aims to generate an image that align
with the provided instruction. The challenges are to accept multimodal inputs
as instructions and a scarcity of high-quality training data, including crucial
triplets of source/target image pairs and multimodal (text and image)
instructions. In this paper, we focus on image style editing and present
StyleBooth, a method that proposes a comprehensive framework for image editing
and a feasible strategy for building a high-quality style editing dataset. We
integrate encoded textual instruction and image exemplar as a unified condition
for diffusion model, enabling the editing of original image following
multimodal instructions. Furthermore, by iterative style-destyle tuning and
editing and usability filtering, the StyleBooth dataset provides
content-consistent stylized/plain image pairs in various categories of styles.
To show the flexibility of StyleBooth, we conduct experiments on diverse tasks,
such as text-based style editing, exemplar-based style editing and
compositional style editing. The results demonstrate that the quality and
variety of training data significantly enhance the ability to preserve content
and improve the overall quality of generated images in editing tasks. Project
page can be found at https://ali-vilab.github.io/stylebooth-page/. |
This paper introduces StyleBooth, a novel approach for image style editing that leverages multimodal instructions, encompassing both textual descriptions and exemplar images. |
Existing image editing methods often struggle to handle both text and image-based instructions effectively or lack sufficient training data with diverse and high-quality examples. This work addresses these limitations. |
StyleBooth employs a unified conditioning scheme for diffusion models, enabling the integration of text and image exemplars as instructions. It uses a novel dataset construction pipeline based on iterative style-destyle tuning and usability filtering to ensure high-quality training data. |
StyleBooth achieves state-of-the-art performance in text-based style editing, outperforming baselines in terms of accuracy and user preference.
It excels in exemplar-based style editing, accurately transferring styles from exemplars while preserving content fidelity better than competing methods.
The multimodal instruction mechanism allows for compositional style editing, enabling users to blend and interpolate styles from different sources. |
The current dataset, although diverse, is primarily built upon textual descriptions of styles, potentially limiting the range of styles covered.
Future work will focus on expanding the dataset with a broader spectrum of styles and exploring additional image editing tasks beyond style editing. |
image style editing, multimodal instruction, diffusion models, dataset generation, style composition |
2404.11958
Report |
Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation |
Song Wang, Jiawei Yu, Wentong Li, Wenyu Liu, Xiaolu Liu, Junbo Chen, Jianke Zhu |
Semantic scene completion, also known as semantic occupancy prediction, can
provide dense geometric and semantic information for autonomous vehicles, which
attracts the increasing attention of both academia and industry. Unfortunately,
existing methods usually formulate this task as a voxel-wise classification
problem and treat each voxel equally in 3D space during training. As the hard
voxels have not been paid enough attention, the performance in some challenging
regions is limited. The 3D dense space typically contains a large number of
empty voxels, which are easy to learn but require amounts of computation due to
handling all the voxels uniformly for the existing models. Furthermore, the
voxels in the boundary region are more challenging to differentiate than those
in the interior. In this paper, we propose HASSC approach to train the semantic
scene completion model with hardness-aware design. The global hardness from the
network optimization process is defined for dynamical hard voxel selection.
Then, the local hardness with geometric anisotropy is adopted for voxel-wise
refinement. Besides, self-distillation strategy is introduced to make training
process stable and consistent. Extensive experiments show that our HASSC scheme
can effectively promote the accuracy of the baseline model without incurring
the extra inference cost. Source code is available at:
https://github.com/songw-zju/HASSC. |
This paper introduces HASSC, a hardness-aware semantic scene completion scheme designed to enhance the performance of existing methods in challenging regions. |
Existing semantic scene completion methods treat all voxels equally during training, neglecting the varying difficulty in predicting different voxels. This leads to suboptimal performance in challenging regions, especially for vision-centric methods. |
HASSC utilizes a hard voxel mining (HVM) head that identifies hard voxels based on global hardness (prediction uncertainty) and local hardness (geometric anisotropy). A refinement module then focuses on these hard voxels, improving their prediction accuracy. Additionally, a self-distillation strategy enhances training stability and consistency. |
HASSC consistently improves the accuracy of various baseline models, including VoxFormer and StereoScene, on the SemanticKITTI benchmark.
The method shows significant improvements at closer ranges, crucial for autonomous driving safety.
HASSC achieves these gains without incurring additional computational costs during inference. |
The performance gap between camera-based and LiDAR-based methods remains significant in the full range.
The method's performance is limited by inaccurate geometry estimation and the long-tail distribution of certain object categories. Future work will explore incorporating neural radiance fields (NeRFs) to improve geometric and semantic understanding from image sequences. |
semantic scene completion, hard voxel mining, self-distillation, autonomous driving, 3d vision |
2404.11949
Report |
Sketch-guided Image Inpainting with Partial Discrete Diffusion Process |
Nakul Sharma, Aditay Tripathi, Anirban Chakraborty, Anand Mishra |
In this work, we study the task of sketch-guided image inpainting. Unlike the
well-explored natural language-guided image inpainting, which excels in
capturing semantic details, the relatively less-studied sketch-guided
inpainting offers greater user control in specifying the object's shape and
pose to be inpainted. As one of the early solutions to this task, we introduce
a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP
corrupts the masked regions of the image and the backward pass reconstructs
these masked regions conditioned on hand-drawn sketches using our proposed
sketch-guided bi-directional transformer. The proposed novel transformer module
accepts two inputs -- the image containing the masked region to be inpainted
and the query sketch to model the reverse diffusion process. This strategy
effectively addresses the domain gap between sketches and natural images,
thereby, enhancing the quality of inpainting results. In the absence of a
large-scale dataset specific to this task, we synthesize a dataset from the
MS-COCO to train and extensively evaluate our proposed framework against
various competent approaches in the literature. The qualitative and
quantitative results and user studies establish that the proposed method
inpaints realistic objects that fit the context in terms of the visual
appearance of the provided sketch. To aid further research, we have made our
code publicly available at https://github.com/vl2g/Sketch-Inpainting . |
This paper introduces a novel method for sketch-guided image inpainting using a partial discrete diffusion process (PDDP), allowing users to control the shape and pose of inpainted objects. |
Existing image inpainting methods often lack control over the semantic details and visual attributes of the inpainted regions. This work provides a solution by incorporating sketch guidance, offering greater user control and addressing a gap in the field. |
The method involves training a two-stage model. The first stage learns a discrete latent space of images. The second stage utilizes this latent space to perform sketch-guided inpainting using a novel PDDP and a sketch-guided bi-directional transformer. |
The proposed method outperforms existing image inpainting approaches adapted for sketch guidance, achieving state-of-the-art results on a curated MS-COCO dataset.
The model effectively utilizes visual information from hand-drawn sketches, resulting in inpainted images with high visual fidelity and faithfulness to the query sketches.
User studies confirm the superiority of the proposed method, with participants preferring its generated inpainted images for their naturalness, visual fidelity, and alignment with the input sketches. |
The quality of inpainted images can be further improved, particularly in capturing intricate object details.
The current sketch embedding method could be enhanced to better represent stroke-level details and improve conditioning mechanisms. |
image inpainting, sketch guidance, discrete diffusion models, bidirectional transformer, generative models |
2404.11936
Report |
LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights |
Thibault Castells, Hyoung-Kyu Song, Bo-Kyeong Kim, Shinkook Choi |
Latent Diffusion Models (LDMs) have emerged as powerful generative models,
known for delivering remarkable results under constrained computational
resources. However, deploying LDMs on resource-limited devices remains a
complex issue, presenting challenges such as memory consumption and inference
speed. To address this issue, we introduce LD-Pruner, a novel
performance-preserving structured pruning method for compressing LDMs.
Traditional pruning methods for deep neural networks are not tailored to the
unique characteristics of LDMs, such as the high computational cost of training
and the absence of a fast, straightforward and task-agnostic method for
evaluating model performance. Our method tackles these challenges by leveraging
the latent space during the pruning process, enabling us to effectively
quantify the impact of pruning on model performance, independently of the task
at hand. This targeted pruning of components with minimal impact on the output
allows for faster convergence during training, as the model has less
information to re-learn, thereby addressing the high computational cost of
training. Consequently, our approach achieves a compressed model that offers
improved inference speed and reduced parameter count, while maintaining minimal
performance degradation. We demonstrate the effectiveness of our approach on
three different tasks: text-to-image (T2I) generation, Unconditional Image
Generation (UIG) and Unconditional Audio Generation (UAG). Notably, we reduce
the inference time of Stable Diffusion (SD) by 34.9% while simultaneously
improving its FID by 5.2% on MS-COCO T2I benchmark. This work paves the way for
more efficient pruning methods for LDMs, enhancing their applicability. |
This paper introduces LD-Pruner, a novel structured pruning method for compressing Latent Diffusion Models (LDMs) while preserving performance. |
Deploying LDMs on resource-limited devices is challenging due to memory consumption and inference speed. Existing pruning methods are not tailored to LDMs and lack efficient, task-agnostic performance evaluation. |
LD-Pruner leverages the latent space to evaluate the impact of pruning individual operators (e.g., convolutional layers) on model performance. It modifies each operator, generates latent representations, and quantifies the divergence from the original representations using a novel scoring formula. |
Achieves 34.9% inference speedup with 5.2% FID improvement on text-to-image generation (Stable Diffusion) compared to the original model.
Demonstrates successful compression and performance preservation for unconditional image generation (LDM-4) and unconditional audio generation (AudioDiffusion).
Highlights the importance of weight preservation during pruning for faster and better fine-tuning. |
The current method does not prune the decoder part of LDMs.
It does not explicitly account for potential dependencies between operators during pruning. |
latent diffusion models, model compression, pruning, task-agnostic, latent space |
2404.11925
Report |
EdgeFusion: On-Device Text-to-Image Generation |
Thibault Castells, Hyoung-Kyu Song, Tairen Piao, Shinkook Choi, Bo-Kyeong Kim, Hanyoung Yim, Changgwun Lee, Jae Gon Kim, Tae-Ho Kim |
The intensive computational burden of Stable Diffusion (SD) for text-to-image
generation poses a significant hurdle for its practical application. To tackle
this challenge, recent research focuses on methods to reduce sampling steps,
such as Latent Consistency Model (LCM), and on employing architectural
optimizations, including pruning and knowledge distillation. Diverging from
existing approaches, we uniquely start with a compact SD variant, BK-SDM. We
observe that directly applying LCM to BK-SDM with commonly used crawled
datasets yields unsatisfactory results. It leads us to develop two strategies:
(1) leveraging high-quality image-text pairs from leading generative models and
(2) designing an advanced distillation process tailored for LCM. Through our
thorough exploration of quantization, profiling, and on-device deployment, we
achieve rapid generation of photo-realistic, text-aligned images in just two
steps, with latency under one second on resource-limited edge devices. |
This paper presents EdgeFusion, an optimized Stable Diffusion model for fast text-to-image generation on resource-limited devices, achieving under one second latency on Samsung Exynos NPU. |
Stable Diffusion models are computationally expensive, hindering their deployment on edge devices. EdgeFusion addresses this by reducing sampling steps, optimizing architecture, and employing efficient deployment strategies. |
The study leverages Block-removed Knowledge-distilled SDM and Latent Consistency Model for model compression and step reduction. It utilizes high-quality synthetic image-text pairs for improved training and employs model-level tiling and quantization for efficient NPU deployment. |
EdgeFusion generates high-quality images in just two steps with latency under one second on edge devices.
Using high-quality synthetic data significantly improves generation quality and text-image alignment compared to using solely LAION datasets.
The proposed advanced distillation process, including fine-tuning the student model with a superior teacher and using the original large model during LCM training, significantly enhances few-step generation quality. |
The study primarily focuses on the Samsung Exynos NPU, potentially limiting the generalizability of findings to other edge devices.
Further investigation into the trade-off between dataset size and quality, particularly for manual curation, is needed. |
text-to-image generation, stable diffusion, edge computing, model compression, knowledge distillation |
2404.11895
Report |
FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models |
Wei Wu, Qingnan Fan, Shuai Qin, Hong Gu, Ruoyu Zhao, Antoni B. Chan |
Precise image editing with text-to-image models has attracted increasing
interest due to their remarkable generative capabilities and user-friendly
nature. However, such attempts face the pivotal challenge of misalignment
between the intended precise editing target regions and the broader area
impacted by the guidance in practice. Despite excellent methods leveraging
attention mechanisms that have been developed to refine the editing guidance,
these approaches necessitate modifications through complex network architecture
and are limited to specific editing tasks. In this work, we re-examine the
diffusion process and misalignment problem from a frequency perspective,
revealing that, due to the power law of natural images and the decaying noise
schedule, the denoising network primarily recovers low-frequency image
components during the earlier timesteps and thus brings excessive low-frequency
signals for editing. Leveraging this insight, we introduce a novel fine-tuning
free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy
truncation to refine the guidance of $\textbf{Diff}$usion models for universal
editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results
with state-of-the-art methods across a variety of editing tasks and on a
diverse set of images, highlighting its potential as a versatile tool in image
editing applications. |
This paper introduces FreeDiff, a novel fine-tuning free approach that refines the guidance of diffusion models for universal editing tasks by employing progressive frequency truncation. |
Existing text-guided image editing methods often struggle with misalignment between the intended precise editing target regions and the broader area impacted by the guidance, while attention manipulation methods lack versatility and generality. |
FreeDiff leverages the observation that the denoising network in diffusion models prioritizes learning frequency components in correlation with the noise level across timesteps. It then employs progressive frequency truncation on the guidance in the frequency space during the image generation process. |
FreeDiff achieves comparable results with state-of-the-art methods across various editing tasks on a diverse set of images.
Analysis of intermediate features during diffusion confirms that the network prioritizes low-frequency components, explaining the misalignment in editing.
Ablation studies confirm the effectiveness of progressive frequency truncation and its sensitivity to editing prompts. |
FreeDiff's performance relies on successful image reconstruction and can be sensitive to editing prompts, especially those describing non-target regions.
Future work includes exploring the combination of FreeDiff with attention manipulation techniques for enhanced control. |
diffusion models, image editing, frequency truncation, text-guided image editing, guidance refinement |
2404.11824
Report |
TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation |
Tianyi Liang, Jiangqi Liu, Sicheng Song, Shiqi Jiang, Yifei Huang, Changbo Wang, Chenhui Li |
Recent advancements in Text-to-image (T2I) generation have witnessed a shift
from adapting text to fixed backgrounds to creating images around text.
Traditional approaches are often limited to generate layouts within static
images for effective text placement. Our proposed approach, TextCenGen,
introduces a dynamic adaptation of the blank region for text-friendly image
generation, emphasizing text-centric design and visual harmony generation. Our
method employs force-directed attention guidance in T2I models to generate
images that strategically reserve whitespace for pre-defined text areas, even
for text or icons at the golden ratio. Observing how cross-attention maps
affect object placement, we detect and repel conflicting objects using a
force-directed graph approach, combined with a Spatial Excluding
Cross-Attention Constraint for smooth attention in whitespace areas. As a novel
task in graphic design, experiments indicate that TextCenGen outperforms
existing methods with more harmonious compositions. Furthermore, our method
significantly enhances T2I model outcomes on our specially collected prompt
datasets, catering to varied text positions. These results demonstrate the
efficacy of TextCenGen in creating more harmonious and integrated text-image
compositions. |
TextCenGen is a novel, training-free framework for text-centric text-to-image generation. It dynamically adapts image composition around predefined text regions for visually harmonious text integration, addressing a gap in existing methods that struggle with text-background conflicts. |
Effective text-image synergy is crucial in graphic design. Traditional methods often result in text-background competition. TextCenGen addresses this by prioritizing text placement in image generation, ensuring clear communication and aesthetic appeal. |
TextCenGen utilizes cross-attention maps and force-directed graphs to guide object placement during the denoising process of text-to-image generation. It identifies and relocates objects conflicting with designated text regions and applies a spatial constraint for smooth attention in those areas. |
TextCenGen outperforms existing state-of-the-art methods in quantitative metrics, demonstrating superior performance in background smoothness, saliency harmony, and semantic fidelity.
Qualitative analysis highlights TextCenGen's ability to create more natural and harmonious text layouts while preserving image content and quality.
The ablation study confirms the significant contribution of both the Force-Directed Cross-Attention Guidance and Spatial Excluding Cross-Attention Constraint in achieving these results. |
The assumption of convex object shapes in the force-directed guidance may not be suitable for all scenarios.
The generation of unintended objects in blank areas requires further investigation and refinement. |
text-to-image generation, text-centric design, force-directed attention, cross-attention maps, graphic design |
2404.11778
Report |
CU-Mamba: Selective State Space Models with Channel Learning for Image Restoration |
Rui Deng, Tianpei Gu |
Reconstructing degraded images is a critical task in image processing.
Although CNN and Transformer-based models are prevalent in this field, they
exhibit inherent limitations, such as inadequate long-range dependency modeling
and high computational costs. To overcome these issues, we introduce the
Channel-Aware U-Shaped Mamba (CU-Mamba) model, which incorporates a dual State
Space Model (SSM) framework into the U-Net architecture. CU-Mamba employs a
Spatial SSM module for global context encoding and a Channel SSM component to
preserve channel correlation features, both in linear computational complexity
relative to the feature map size. Extensive experimental results validate
CU-Mamba's superiority over existing state-of-the-art methods, underscoring the
importance of integrating both spatial and channel contexts in image
restoration. |
Introduced Channel-Aware U-Shaped Mamba (CU-Mamba), integrating dual State Space Models (SSM) within a U-Net for image restoration, capturing long-range dependencies and preserving channel correlations. |
Addresses limitations of CNNs (limited receptive fields) and Transformers (high computational cost) in image restoration by efficiently encoding global and channel-specific features. |
Employs Spatial SSM for global context encoding in linear complexity and Channel SSM to enhance feature mixing across channels within the U-Net architecture. |
Outperforms state-of-the-art methods on image denoising (SIDD, DND) and deblurring (GoPro, HIDE, RealBlur-R, RealBlur-J) benchmarks.
Demonstrates faster inference speed compared to Transformer-based methods while achieving superior restoration quality.
Ablation studies validate the effectiveness of both Spatial and Channel SSM modules in enhancing model performance. |
Exploration of alternative SSM discretization techniques for potential performance improvement.
Investigating the application of CU-Mamba to other image restoration tasks beyond denoising and deblurring. |
image restoration, state space models, u-net, channel learning, deep learning |
2404.11615
Report |
Factorized Diffusion: Perceptual Illusions by Noise Decomposition |
Daniel Geng, Inbum Park, Andrew Owens |
Given a factorization of an image into a sum of linear components, we present
a zero-shot method to control each individual component through diffusion model
sampling. For example, we can decompose an image into low and high spatial
frequencies and condition these components on different text prompts. This
produces hybrid images, which change appearance depending on viewing distance.
By decomposing an image into three frequency subbands, we can generate hybrid
images with three prompts. We also use a decomposition into grayscale and color
components to produce images whose appearance changes when they are viewed in
grayscale, a phenomena that naturally occurs under dim lighting. And we explore
a decomposition by a motion blur kernel, which produces images that change
appearance under motion blurring. Our method works by denoising with a
composite noise estimate, built from the components of noise estimates
conditioned on different prompts. We also show that for certain decompositions,
our method recovers prior approaches to compositional generation and spatial
control. Finally, we show that we can extend our approach to generate hybrid
images from real images. We do this by holding one component fixed and
generating the remaining components, effectively solving an inverse problem. |
This paper introduces Factorized Diffusion, a zero-shot method to control individual components of an image during generation with diffusion models by manipulating the noise estimates for different image decompositions. |
This method enables the creation of various perceptual illusions, like hybrid images that change with viewing distance, color hybrids that change under different lighting, and motion hybrids that change when blurred, offering insights into human and machine perception. |
The method decomposes an image into components (e.g., frequency bands, color spaces) and generates separate noise estimates for each component conditioned on different text prompts. These noise estimates are then combined to guide the denoising process, resulting in an image where each component reflects its corresponding prompt. |
Factorized Diffusion successfully synthesizes hybrid images that outperform traditional methods in quality and alignment with prompts, as evaluated through human studies and CLIP score.
The method generalizes to other decompositions, generating color hybrids and motion hybrids, showcasing its ability to create new classes of perceptual illusions.
The technique can be extended to solve inverse problems by fixing one component and generating others, demonstrated by creating hybrid images from real images and performing text-guided colorization. |
The success rate of generating high-quality illusions can be low due to the out-of-distribution nature of the generated images and the lack of control over prompt interactions.
Future work includes improving robustness, exploring other decompositions, and addressing ethical considerations related to the generation of potentially deceptive content. |
diffusion models, perceptual illusions, hybrid images, image decomposition, text-conditional image generation |
2404.11614
Report |
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior |
Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, Huamin Qu |
Text animation serves as an expressive medium, transforming static
communication into dynamic experiences by infusing words with motion to evoke
emotions, emphasize meanings, and construct compelling narratives. Crafting
animations that are semantically aware poses significant challenges, demanding
expertise in graphic design and animation. We present an automated text
animation scheme, termed "Dynamic Typography", which combines two challenging
tasks. It deforms letters to convey semantic meaning and infuses them with
vibrant movements based on user prompts. Our technique harnesses vector
graphics representations and an end-to-end optimization-based framework. This
framework employs neural displacement fields to convert letters into base
shapes and applies per-frame motion, encouraging coherence with the intended
textual concept. Shape preservation techniques and perceptual loss
regularization are employed to maintain legibility and structural integrity
throughout the animation process. We demonstrate the generalizability of our
approach across various text-to-video models and highlight the superiority of
our end-to-end methodology over baseline methods, which might comprise separate
tasks. Through quantitative and qualitative evaluations, we demonstrate the
effectiveness of our framework in generating coherent text animations that
faithfully interpret user prompts while maintaining readability. Our code is
available at: https://animate-your-word.github.io/demo/. |
This paper introduces "Dynamic Typography," an automated system that animates individual letters within words based on user prompts, deforming them to embody semantic meaning while maintaining legibility. |
This technique addresses the challenge of creating semantically aware and visually engaging text animations, a task typically requiring significant design and animation expertise. |
The system uses an end-to-end optimization framework with two neural displacement fields: one for shaping the letter to reflect the prompt's meaning and another for applying per-frame motion. It leverages score-distillation sampling with a text-to-video model, incorporates legibility regularization using LPIPS, and employs mesh-based structure preservation to maintain visual consistency. |
The generated animations accurately and aesthetically interpret text prompts while preserving letter readability.
Quantitative evaluation demonstrates superiority over baseline methods in maintaining legibility and prompt-video alignment.
The framework is generalizable across different text-to-video models, allowing for improvements with future advancements in video generation. |
Motion quality is limited by the capabilities of the video foundation model.
Balancing semantic accuracy with legibility becomes challenging when prompts significantly deviate from original letter forms. |
text animation, kinetic typography, video diffusion prior, svg, text-to-video generation |
2404.11613
Report |
InFusion: Inpainting 3D Gaussians via Learning Depth Completion from Diffusion Prior |
Zhiheng Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jie Xiao, Kai Zhu, Nan Xue, Yu Liu, Yujun Shen, Yang Cao |
3D Gaussians have recently emerged as an efficient representation for novel
view synthesis. This work studies its editability with a particular focus on
the inpainting task, which aims to supplement an incomplete set of 3D Gaussians
with additional points for visually harmonious rendering. Compared to 2D
inpainting, the crux of inpainting 3D Gaussians is to figure out the
rendering-relevant properties of the introduced points, whose optimization
largely benefits from their initial 3D positions. To this end, we propose to
guide the point initialization with an image-conditioned depth completion
model, which learns to directly restore the depth map based on the observed
image. Such a design allows our model to fill in depth values at an aligned
scale with the original depth, and also to harness strong generalizability from
largescale diffusion prior. Thanks to the more accurate depth completion, our
approach, dubbed InFusion, surpasses existing alternatives with sufficiently
better fidelity and efficiency under various complex scenarios. We further
demonstrate the effectiveness of InFusion with several practical applications,
such as inpainting with user-specific texture or with novel object insertion. |
Presents InFusion, a novel approach for inpainting 3D Gaussian representations by leveraging depth completion learned from diffusion priors, enabling efficient and photorealistic editing of 3D scenes. |
Addresses the limitations of existing 3D Gaussian inpainting methods that often produce blurry textures or misaligned depth, hindering the seamless integration of edited elements. |
Inpaints the reference image and depth map, unprojects them to initialize 3D points, and fine-tunes the Gaussian model using a diffusion-based depth completion model trained on a large-scale dataset. |
Achieves superior image quality with sharper textures and better 3D consistency compared to baseline methods.
Demonstrates significant speed improvements, being up to 20 times faster than existing techniques.
Enables practical applications such as user-interactive texture editing and object insertion. |
Faces challenges in scenarios with significant lighting variations across different views, leading to inconsistencies in the inpainted regions.
Limited in text-guided inpainting of highly complex objects within 360-degree scenes due to the current constraints of inpainting models. |
gaussian splatting, 3d inpainting, depth completion, diffusion models, novel view synthesis |
2404.11593
Report |
IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination |
Xi Chen, Sida Peng, Dongchen Yang, Yuan Liu, Bowen Pan, Chengfei Lv, Xiaowei Zhou |
This paper aims to recover object materials from posed images captured under
an unknown static lighting condition. Recent methods solve this task by
optimizing material parameters through differentiable physically based
rendering. However, due to the coupling between object geometry, materials, and
environment lighting, there is inherent ambiguity during the inverse rendering
process, preventing previous methods from obtaining accurate results. To
overcome this ill-posed problem, our key idea is to learn the material prior
with a generative model for regularizing the optimization process. We observe
that the general rendering equation can be split into diffuse and specular
shading terms, and thus formulate the material prior as diffusion models of
albedo and specular. Thanks to this design, our model can be trained using the
existing abundant 3D object data, and naturally acts as a versatile tool to
resolve the ambiguity when recovering material representations from RGB images.
In addition, we develop a coarse-to-fine training strategy that leverages
estimated materials to guide diffusion models to satisfy multi-view consistent
constraints, leading to more stable and accurate results. Extensive experiments
on real-world and synthetic datasets demonstrate that our approach achieves
state-of-the-art performance on material recovery. The code will be available
at https://zju3dv.github.io/IntrinsicAnything. |
This paper introduces IntrinsicAnything, a novel method that leverages diffusion models to learn material priors for single-view inverse rendering under unknown lighting, effectively addressing the inherent ambiguities in material and lighting decomposition. |
Inverse rendering under unknown lighting is crucial for various applications like VR/AR and video games but suffers from inherent ambiguities that hinder accurate material recovery. |
IntrinsicAnything utilizes conditional diffusion models to learn priors for albedo and specular shading. It employs a two-stage optimization process: first recovering coarse material and lighting, then using them to guide the diffusion model for refined, multi-view consistent results. |
IntrinsicAnything achieves state-of-the-art performance on both synthetic and real-world datasets, outperforming existing optimization-based and data-driven methods.
The method effectively disentangles materials and lighting, avoiding common issues like baking shadows or shading into the albedo.
IntrinsicAnything demonstrates strong generalization capabilities, enabling high-quality single-view intrinsic image decomposition for diverse objects and scenes, including challenging in-the-wild images. |
The current method doesn't handle transparent objects, necessitating further exploration of geometry representations and joint optimization.
Performance relies on the accuracy of reconstructed geometry, suggesting future research on using diffusion models for improved geometry priors in 3D reconstruction. |
inverse rendering, diffusion models, material prior, generative models, single-view reconstruction |
2404.11589
Report |
Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept Understanding |
Zezhong Fan, Xiaohan Li, Chenhao Fang, Topojoy Biswas, Kaushiki Nag, Jianpeng Xu, Kannan Achan |
The rapid evolution of text-to-image diffusion models has opened the door of
generative AI, enabling the translation of textual descriptions into visually
compelling images with remarkable quality. However, a persistent challenge
within this domain is the optimization of prompts to effectively convey
abstract concepts into concrete objects. For example, text encoders can hardly
express "peace", while can easily illustrate olive branches and white doves.
This paper introduces a novel approach named Prompt Optimizer for Abstract
Concepts (POAC) specifically designed to enhance the performance of
text-to-image diffusion models in interpreting and generating images from
abstract concepts. We propose a Prompt Language Model (PLM), which is
initialized from a pre-trained language model, and then fine-tuned with a
curated dataset of abstract concept prompts. The dataset is created with GPT-4
to extend the abstract concept to a scene and concrete objects. Our framework
employs a Reinforcement Learning (RL)-based optimization strategy, focusing on
the alignment between the generated images by a stable diffusion model and
optimized prompts. Through extensive experiments, we demonstrate that our
proposed POAC significantly improves the accuracy and aesthetic quality of
generated images, particularly in the description of abstract concepts and
alignment with optimized prompts. We also present a comprehensive analysis of
our model's performance across diffusion models under different settings,
showcasing its versatility and effectiveness in enhancing abstract concept
representation. |
This paper introduces POAC (Prompt Optimizer for Abstract Concepts) to improve how text-to-image models understand and generate images from abstract concepts. |
Existing text-to-image models struggle to depict abstract ideas because they are trained mainly on concrete objects and lack a mapping between abstract and concrete representations. |
POAC uses a two-stage approach: 1) It fine-tunes a Prompt Language Model (PLM) to rewrite prompts containing abstract concepts into prompts with concrete objects using GPT-4 and a curated dataset. 2) It uses Reward Feedback Learning (ReFL) to fine-tune a Stable Diffusion XL model to align with the optimized prompts and improve image quality. |
POAC enables the generation of images that are more faithful to abstract concept prompts, including relevant concrete details.
Fine-tuning with ReFL further improves the alignment between optimized prompts and generated images, leading to more accurate depictions.
Quantitative evaluation shows improvements in both relevance and aesthetic scores of generated images compared to baseline SDXL. |
Future work will address broader alignment challenges beyond abstract concepts, such as mitigating biases in generated images.
The authors will explore improving the prompt language model to optimize for balanced and fair representations across different demographic groups. |
image generation, diffusion models, prompt optimization, abstract concepts, reinforcement learning |
2404.11554
Report |
Predicting Long-horizon Futures by Conditioning on Geometry and Time |
Tarasha Khurana, Deva Ramanan |
Our work explores the task of generating future sensor observations
conditioned on the past. We are motivated by `predictive coding' concepts from
neuroscience as well as robotic applications such as self-driving vehicles.
Predictive video modeling is challenging because the future may be multi-modal
and learning at scale remains computationally expensive for video processing.
To address both challenges, our key insight is to leverage the large-scale
pretraining of image diffusion models which can handle multi-modality. We
repurpose image models for video prediction by conditioning on new frame
timestamps. Such models can be trained with videos of both static and dynamic
scenes. To allow them to be trained with modestly-sized datasets, we introduce
invariances by factoring out illumination and texture by forcing the model to
predict (pseudo) depth, readily obtained for in-the-wild videos via
off-the-shelf monocular depth networks. In fact, we show that simply modifying
networks to predict grayscale pixels already improves the accuracy of video
prediction. Given the extra controllability with timestamp conditioning, we
propose sampling schedules that work better than the traditional autoregressive
and hierarchical sampling strategies. Motivated by probabilistic metrics from
the object forecasting literature, we create a benchmark for video prediction
on a diverse set of videos spanning indoor and outdoor scenes and a large
vocabulary of objects. Our experiments illustrate the effectiveness of learning
to condition on timestamps, and show the importance of predicting the future
with invariant modalities. |
This paper presents a video prediction model that leverages pre-trained 2D image diffusion models and incorporates timestamp conditioning to generate future frames from past observations. |
Predicting future sensor observations is crucial for robotics applications like self-driving, and this work offers an efficient solution by repurposing readily available image diffusion models for video prediction. |
The method involves fine-tuning pre-trained image diffusion models by adding (1) conditioning on input context frames using a two-stream approach with CLIP embeddings and (2) conditioning on frame timestamps using positional encoding. The model is trained to predict frames at random timestamps, enabling flexible sampling strategies at inference time. |
The model outperforms state-of-the-art video prediction methods on short-horizon forecasting tasks.
Introducing invariances in the data, such as using pseudo-depth or luminance instead of RGB, significantly improves performance.
The proposed mixed sampling strategy, enabled by timestamp conditioning, outperforms traditional autoregressive and hierarchical sampling for long-horizon forecasting. |
The model exhibits bias towards hallucinating commonly seen object categories like people and cars due to dataset bias.
The generated pseudo-depth lacks high-frequency details, potentially due to the limitations of neural networks in modeling such functions. |
video prediction, diffusion models, timestamp conditioning, pseudo-depth, forecasting |
2404.11475
Report |
AdaIR: Exploiting Underlying Similarities of Image Restoration Tasks with Adapters |
Hao-Wei Chen, Yu-Syuan Xu, Kelvin C. K. Chan, Hsien-Kai Kuo, Chun-Yi Lee, Ming-Hsuan Yang |
Existing image restoration approaches typically employ extensive networks
specifically trained for designated degradations. Despite being effective, such
methods inevitably entail considerable storage costs and computational
overheads due to the reliance on task-specific networks. In this work, we go
beyond this well-established framework and exploit the inherent commonalities
among image restoration tasks. The primary objective is to identify components
that are shareable across restoration tasks and augment the shared components
with modules specifically trained for individual tasks. Towards this goal, we
propose AdaIR, a novel framework that enables low storage cost and efficient
training without sacrificing performance. Specifically, a generic restoration
network is first constructed through self-supervised pre-training using
synthetic degradations. Subsequent to the pre-training phase, adapters are
trained to adapt the pre-trained network to specific degradations. AdaIR
requires solely the training of lightweight, task-specific modules, ensuring a
more efficient storage and training regimen. We have conducted extensive
experiments to validate the effectiveness of AdaIR and analyze the influence of
the pre-training strategy on discovering shareable components. Extensive
experimental results show that AdaIR achieves outstanding results on multi-task
restoration while utilizing significantly fewer parameters (1.9 MB) and less
training time (7 hours) for each restoration task. The source codes and trained
models will be released. |
This paper proposes AdaIR, a novel framework for multi-task image restoration that leverages adapters for efficient adaptation to unseen degradations. |
Existing image restoration methods often rely on separate, computationally expensive models for each degradation type. AdaIR aims to improve efficiency by exploiting shareable components across restoration tasks. |
AdaIR uses a two-phase training strategy. First, a generic restoration network (using Restormer architecture) is pre-trained with synthetic degradations. Second, lightweight adapters are fine-tuned to adapt the pre-trained model to specific degradation tasks. |
AdaIR achieves comparable performance to state-of-the-art multi-task restoration methods like Restormer and PromptIR.
AdaIR demonstrates significant reduction in training time (7 hours per task) and trainable parameters (1.9MB) compared to training from scratch.
Analysis of pre-training strategies shows that using diverse degradations during pre-training improves performance on downstream tasks. |
The performance gap between AdaIR and other methods is smaller on simpler tasks, suggesting potential limitations in handling complex degradation types.
Further research could explore different adapter architectures and pre-training schemes to improve performance on highly challenging degradations. |
image restoration, multi-task learning, adapter, parameter-efficient tuning, low-level vision |
2404.11419
Report |
SLAIM: Robust Dense Neural SLAM for Online Tracking and Mapping |
Vincent Cartillier, Grant Schindler, Irfan Essa |
We present SLAIM - Simultaneous Localization and Implicit Mapping. We propose
a novel coarse-to-fine tracking model tailored for Neural Radiance Field SLAM
(NeRF-SLAM) to achieve state-of-the-art tracking performance. Notably, existing
NeRF-SLAM systems consistently exhibit inferior tracking performance compared
to traditional SLAM algorithms. NeRF-SLAM methods solve camera tracking via
image alignment and photometric bundle-adjustment. Such optimization processes
are difficult to optimize due to the narrow basin of attraction of the
optimization loss in image space (local minima) and the lack of initial
correspondences. We mitigate these limitations by implementing a Gaussian
pyramid filter on top of NeRF, facilitating a coarse-to-fine tracking
optimization strategy. Furthermore, NeRF systems encounter challenges in
converging to the right geometry with limited input views. While prior
approaches use a Signed-Distance Function (SDF)-based NeRF and directly
supervise SDF values by approximating ground truth SDF through depth
measurements, this often results in suboptimal geometry. In contrast, our
method employs a volume density representation and introduces a novel KL
regularizer on the ray termination distribution, constraining scene geometry to
consist of empty space and opaque surfaces. Our solution implements both local
and global bundle-adjustment to produce a robust (coarse-to-fine) and accurate
(KL regularizer) SLAM solution. We conduct experiments on multiple datasets
(ScanNet, TUM, Replica) showing state-of-the-art results in tracking and in
reconstruction accuracy. |
SLAIM, a novel coarse-to-fine tracking model for NeRF-SLAM achieving state-of-the-art tracking performance. |
Existing NeRF-SLAM systems have inferior tracking performance compared to traditional SLAM algorithms due to the narrow basin of attraction in image alignment and lack of initial correspondences. |
Implements a Gaussian pyramid filter on top of NeRF for coarse-to-fine tracking, and introduces a KL regularizer on the ray termination distribution to constrain scene geometry. |
Achieves state-of-the-art tracking performance on ScanNet and TUM datasets.
Shows superior reconstruction accuracy on Replica dataset compared to previous NeRF-SLAM methods.
Demonstrates the effectiveness of the Gaussian Pyramid filter and the custom KL regularizer through ablation studies. |
Struggles with reconstructing completely unobserved regions.
Performance slightly degrades with Gaussian Pyramid levels beyond a certain threshold. |
slam, nerf, 3d reconstruction, tracking, computer vision |
2404.11375
Report |
Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion |
Xinghan Wang, Zixi Kang, Yadong Mu |
Human motion understanding is a fundamental task with diverse practical
applications, facilitated by the availability of large-scale motion capture
datasets. Recent studies focus on text-motion tasks, such as text-based motion
generation, editing and question answering. In this study, we introduce the
novel task of text-based human motion grounding (THMG), aimed at precisely
localizing temporal segments corresponding to given textual descriptions within
untrimmed motion sequences. Capturing global temporal information is crucial
for the THMG task. However, transformer-based models that rely on global
temporal self-attention face challenges when handling long untrimmed sequences
due to the quadratic computational cost. We address these challenges by
proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that
integrates temporal global context, language query control, and spatial graph
topology with only linear memory cost. The core of the model is a
text-controlled selection mechanism which dynamically incorporates global
temporal information based on text query. The model is further enhanced to be
topology-aware through the integration of relational embeddings. For
evaluation, we introduce BABEL-Grounding, the first text-motion dataset that
provides detailed textual descriptions of human actions along with their
corresponding temporal segments. Extensive evaluations demonstrate the
effectiveness of TM-Mamba on BABEL-Grounding. |
This paper introduces a new task called text-based human motion grounding (THMG) and proposes TM-Mamba, a novel state-space model with linear memory cost to address this task. |
THMG seeks to locate temporal segments in untrimmed motion sequences matching textual descriptions, which is crucial for real-world applications where actions occur sparsely within long sequences. Existing methods struggle with this due to quadratic memory requirements for handling long sequences. |
The authors propose TM-Mamba, which incorporates a text-controlled selection mechanism into the Mamba algorithm, allowing dynamic information propagation based on text queries to extract relevant global context. Additionally, relational embeddings are integrated to model the human skeleton's graph topology. A new dataset, BABEL-Grounding, is also introduced for evaluation. |
TM-Mamba outperforms baseline methods, including those adapted from video moment retrieval and those based on SSMs or graph convolutions, demonstrating its effectiveness for THMG.
Ablation studies confirm the benefits of the text-controlled selection mechanism, bidirectional modeling, and relational embeddings.
Analysis of memory consumption shows that TM-Mamba maintains linear memory usage with increasing sequence length, unlike transformer-based models which quickly run out of memory. |
The performance of TM-Mamba, though superior, degrades with increasing sequence length, suggesting further research on handling very long sequences.
The current work focuses on single-person motion; future work could explore extending TM-Mamba to multi-person scenarios for grounding actions involving interactions. |
human motion analysis, temporal grounding, state space models, mamba, text-motion multi-modal learning |
2404.11358
Report |
DeblurGS: Gaussian Splatting for Camera Motion Blur |
Jeongtaek Oh, Jaeyoung Chung, Dongwoo Lee, Kyoung Mu Lee |
Although significant progress has been made in reconstructing sharp 3D scenes
from motion-blurred images, a transition to real-world applications remains
challenging. The primary obstacle stems from the severe blur which leads to
inaccuracies in the acquisition of initial camera poses through
Structure-from-Motion, a critical aspect often overlooked by previous
approaches. To address this challenge, we propose DeblurGS, a method to
optimize sharp 3D Gaussian Splatting from motion-blurred images, even with the
noisy camera pose initialization. We restore a fine-grained sharp scene by
leveraging the remarkable reconstruction capability of 3D Gaussian Splatting.
Our approach estimates the 6-Degree-of-Freedom camera motion for each blurry
observation and synthesizes corresponding blurry renderings for the
optimization process. Furthermore, we propose Gaussian Densification Annealing
strategy to prevent the generation of inaccurate Gaussians at erroneous
locations during the early training stages when camera motion is still
imprecise. Comprehensive experiments demonstrate that our DeblurGS achieves
state-of-the-art performance in deblurring and novel view synthesis for
real-world and synthetic benchmark datasets, as well as field-captured blurry
smartphone videos. |
DeblurGS: a novel method to reconstruct sharp 3D scenes from motion-blurred images using Gaussian Splatting, addressing the challenge of inaccurate camera pose initialization from SfM. |
Existing NeRF-based methods struggle with inaccurate camera poses common in real-world blurry images, limiting their practical application. |
Jointly optimizes 3D Gaussian Splatting and camera motion (trajectory and sub-frame alignment) from blurry inputs, utilizing a Gaussian Densification Annealing strategy for robust optimization under noisy pose initialization. |
Outperforms state-of-the-art methods in novel view synthesis and deblurring on benchmark datasets.
Achieves high-quality deblurring even with noisy camera poses from SfM, unlike previous methods.
Demonstrates successful application on real-world blurry videos captured by smartphones. |
Assumes a constant amount of blur throughout the exposure time.
Future work includes investigating varying blur kernels within a single exposure. |
3d gaussian splatting, camera motion deblurring, novel view synthesis, structure-from-motion, blurry image restoration |
2404.11207
Report |
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models |
Yichi Zhang, Yinpeng Dong, Siyuan Zhang, Tianzan Min, Hang Su, Jun Zhu |
Although Multimodal Large Language Models (MLLMs) have demonstrated promising
versatile capabilities, their performance is still inferior to specialized
models on downstream tasks, which makes adaptation necessary to enhance their
utility. However, fine-tuning methods require independent training for every
model, leading to huge computation and memory overheads. In this paper, we
propose a novel setting where we aim to improve the performance of diverse
MLLMs with a group of shared parameters optimized for a downstream task. To
achieve this, we propose Transferable Visual Prompting (TVP), a simple and
effective approach to generate visual prompts that can transfer to different
models and improve their performance on downstream tasks after trained on only
one model. We introduce two strategies to address the issue of cross-model
feature corruption of existing visual prompting methods and enhance the
transferability of the learned prompts, including 1) Feature Consistency
Alignment: which imposes constraints to the prompted feature changes to
maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which
encourages the prompted images to contain richer task-specific semantics with
language guidance. We validate the effectiveness of TVP through extensive
experiments with 6 modern MLLMs on a wide variety of tasks ranging from object
recognition and counting to multimodal reasoning and hallucination correction. |
The paper proposes Transferable Visual Prompting (TVP), a method for adapting Multimodal Large Language Models (MLLMs) to downstream tasks using transferable visual prompts. This approach aims to improve the performance of various MLLMs with a single set of shared parameters, reducing computation and storage overheads compared to fine-tuning. |
Current adaptation methods for MLLMs require individual fine-tuning for each model, leading to significant resource demands. This work aims to develop a more efficient and flexible solution for adapting multiple MLLMs simultaneously. |
TVP integrates two key strategies: 1) Feature Consistency Alignment (FCA) to mitigate cross-model feature corruption by aligning prompted features with original features, preserving general knowledge; 2) Task Semantics Enrichment (TSE) to enhance task-specific information in visual prompts by leveraging CLIP's image-text alignment. |
TVP effectively improves the performance of 6 diverse MLLMs on 10 datasets across various tasks, including recognition, counting, reasoning, and hallucination correction.
TVP demonstrates superior performance compared to existing visual prompting methods (VP and EVP), especially when transferring prompts to unseen models.
Model ensembling further enhances the transferability of visual prompts, leading to even greater performance improvements. |
Transferability to models with significantly different architectures (e.g., different language models) remains challenging.
TVP introduces additional computation overheads for forward passes through vision encoders compared to baseline visual prompting methods, but the increase is relatively small. |
multimodal large language models, visual prompting, transferability, parameter-efficient fine-tuning, model adaptation |
2404.11151
Report |
REACTO: Reconstructing Articulated Objects from a Single Video |
Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo, Guosheng Lin, Fayao Liu |
In this paper, we address the challenge of reconstructing general articulated
3D objects from a single video. Existing works employing dynamic neural
radiance fields have advanced the modeling of articulated objects like humans
and animals from videos, but face challenges with piece-wise rigid general
articulated objects due to limitations in their deformation models. To tackle
this, we propose Quasi-Rigid Blend Skinning, a novel deformation model that
enhances the rigidity of each part while maintaining flexible deformation of
the joints. Our primary insight combines three distinct approaches: 1) an
enhanced bone rigging system for improved component modeling, 2) the use of
quasi-sparse skinning weights to boost part rigidity and reconstruction
fidelity, and 3) the application of geodesic point assignment for precise
motion and seamless deformation. Our method outperforms previous works in
producing higher-fidelity 3D reconstructions of general articulated objects, as
demonstrated on both real and synthetic datasets. Project page:
https://chaoyuesong.github.io/REACTO. |
This paper proposes REACTO, a novel method for reconstructing general articulated 3D objects from single casual videos by employing Quasi-Rigid Blend Skinning (QRBS) and a new rigging system defined on bones. |
Existing methods struggle to model the piece-wise rigidity and complex motion of general articulated objects in casual videos, often leading to artifacts and inaccuracies. |
REACTO defines a rig on bones for each rigid part and utilizes QRBS to combine the rigidity of Rigid Skinning with the flexibility of Dual Quaternion Blend Skinning. Geodesic distance is employed for precise point assignment to bones or joints. |
REACTO outperforms state-of-the-art methods in reconstructing detailed shapes and motions of articulated objects.
QRBS effectively models the piece-wise rigidity and smooth deformation on the joints.
Defining rig on bones enhances the rigidity and motion integrity of each component compared to defining rig on joints. |
Reconstruction quality may degrade on the unseen sides of objects due to partial views in casual videos.
Future work could explore extending REACTO to handle more complex object interactions and occlusions. |
3d reconstruction, articulated objects, single-view reconstruction, deformation modeling, quasi-rigid blend skinning |
2404.11120
Report |
TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing |
Sherry X. Chen, Yaron Vaxman, Elad Ben Baruch, David Asulin, Aviad Moreshet, Kuo-Chin Lien, Misha Sra, Pradeep Sen |
Despite many attempts to leverage pre-trained text-to-image models (T2I) like
Stable Diffusion (SD) for controllable image editing, producing good
predictable results remains a challenge. Previous approaches have focused on
either fine-tuning pre-trained T2I models on specific datasets to generate
certain kinds of images (e.g., with a specific object or person), or on
optimizing the weights, text prompts, and/or learning features for each input
image in an attempt to coax the image generator to produce the desired result.
However, these approaches all have shortcomings and fail to produce good
results in a predictable and controllable manner. To address this problem, we
present TiNO-Edit, an SD-based method that focuses on optimizing the noise
patterns and diffusion timesteps during editing, something previously
unexplored in the literature. With this simple change, we are able to generate
results that both better align with the original images and reflect the desired
result. Furthermore, we propose a set of new loss functions that operate in the
latent domain of SD, greatly speeding up the optimization when compared to
prior approaches, which operate in the pixel domain. Our method can be easily
applied to variations of SD including Textual Inversion and DreamBooth that
encode new concepts and incorporate them into the edited results. We present a
host of image-editing capabilities enabled by our approach. Our code is
publicly available at https://github.com/SherryXTChen/TiNO-Edit. |
This paper presents TiNO-Edit, a novel Stable Diffusion-based image editing method that optimizes noise patterns and diffusion timesteps for improved controllability and predictability. |
Controllable and predictable image editing with pre-trained text-to-image models remains a challenge, and this method aims to address this by exploring a previously unexplored area. |
The method optimizes the noise and timesteps used in the Stable Diffusion denoising process by minimizing a set of loss functions operating in the latent domain. |
TiNO-Edit demonstrates robust performance across various image editing tasks, including object replacement, addition, style transfer, stroke-based editing, and image composition.
The method outperforms existing baselines in both qualitative and quantitative comparisons, showing better alignment with user intent and image content.
By operating in the latent domain, the method offers significant computational advantages over pixel-domain optimization approaches. |
The reliance on CLIP for semantic guidance might limit the method's ability to capture complex or nuanced semantic relationships.
Further exploration of different optimization strategies and loss functions could potentially enhance the method's performance further. |
image editing, stable diffusion, diffusion models, text-to-image synthesis, latent space optimization |
2404.11098
Report |
LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models |
Dingkun Zhang, Sijia Li, Chen Chen, Qingsong Xie, Haonan Lu |
In the era of AIGC, the demand for low-budget or even on-device applications
of diffusion models emerged. In terms of compressing the Stable Diffusion
models (SDMs), several approaches have been proposed, and most of them
leveraged the handcrafted layer removal methods to obtain smaller U-Nets, along
with knowledge distillation to recover the network performance. However, such a
handcrafting manner of layer removal is inefficient and lacks scalability and
generalization, and the feature distillation employed in the retraining phase
faces an imbalance issue that a few numerically significant feature loss terms
dominate over others throughout the retraining process. To this end, we
proposed the layer pruning and normalized distillation for compressing
diffusion models (LAPTOP-Diff). We, 1) introduced the layer pruning method to
compress SDM's U-Net automatically and proposed an effective one-shot pruning
criterion whose one-shot performance is guaranteed by its good additivity
property, surpassing other layer pruning and handcrafted layer removal methods,
2) proposed the normalized feature distillation for retraining, alleviated the
imbalance issue. Using the proposed LAPTOP-Diff, we compressed the U-Nets of
SDXL and SDM-v1.5 for the most advanced performance, achieving a minimal 4.0%
decline in PickScore at a pruning ratio of 50% while the comparative methods'
minimal PickScore decline is 8.2%. We will release our code. |
Presents LAPTOP-Diff, a method for compressing Stable Diffusion Models (SDMs) using layer pruning and normalized distillation. |
SDMs, while powerful, have high memory consumption and latency, limiting their deployment on resource-constrained devices. Existing compression methods are often handcrafted, inefficient, and lack scalability. |
1. Formulates layer pruning as a combinatorial optimization problem and solves it using a one-shot approach with an output loss based pruning criterion. 2. Introduces normalized feature distillation during retraining to alleviate the imbalance issue in feature loss terms. |
Achieves state-of-the-art performance, outperforming handcrafted layer removal methods.
Demonstrates the effectiveness of the output loss criterion, attributed to its strong additivity property.
Shows that normalized feature distillation significantly improves performance compared to vanilla distillation. |
The additivity assumption might not hold for other downstream tasks or datasets.
Exploring alternative pruning criteria beyond output loss, task loss, and CLIP score. |
model compression, diffusion models, layer pruning, knowledge distillation, stable diffusion |
2404.10947
Report |
Residual Connections Harm Self-Supervised Abstract Feature Learning |
Xiao Zhang, Ruoxi Jiang, William Gao, Rebecca Willett, Michael Maire |
We demonstrate that adding a weighting factor to decay the strength of
identity shortcuts within residual networks substantially improves semantic
feature learning in the state-of-the-art self-supervised masked autoencoding
(MAE) paradigm. Our modification to the identity shortcuts within a VIT-B/16
backbone of an MAE boosts linear probing accuracy on ImageNet from 67.3% to
72.3%. This significant gap suggests that, while residual connection structure
serves an essential role in facilitating gradient propagation, it may have a
harmful side effect of reducing capacity for abstract learning by virtue of
injecting an echo of shallower representations into deeper layers. We
ameliorate this downside via a fixed formula for monotonically decreasing the
contribution of identity connections as layer depth increases. Our design
promotes the gradual development of feature abstractions, without impacting
network trainability. Analyzing the representations learned by our modified
residual networks, we find correlation between low effective feature rank and
downstream task performance. |
This paper proposes decayed identity shortcuts for residual networks, improving semantic feature learning in self-supervised masked autoencoding. |
Residual connections, while good for gradient propagation, can hinder abstract feature learning by injecting shallow representations into deeper layers. |
The authors introduce a depth-dependent scaling factor to gradually decrease the weight of identity shortcuts as layer depth increases. |
Boosting linear probing accuracy on ImageNet from 67.3% to 72.3% for a VIT-B/16 backbone in an MAE framework.
Smaller models with decayed identity shortcuts outperform larger models with standard residual connections (VIT-S/16 outperforms baseline VIT-B/16).
Correlation between low effective feature rank and improved downstream task performance is observed. |
The optimal decay rate might require tuning for different architectures and datasets.
Further theoretical analysis on the relationship between low effective rank and abstract representation learning is needed. |
self-supervised learning, masked autoencoding, residual networks, representation learning, low-rank features |
2404.10864
Report |
Vocabulary-free Image Classification and Semantic Segmentation |
Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci |
Large vision-language models revolutionized image classification and semantic
segmentation paradigms. However, they typically assume a pre-defined set of
categories, or vocabulary, at test time for composing textual prompts. This
assumption is impractical in scenarios with unknown or evolving semantic
context. Here, we address this issue and introduce the Vocabulary-free Image
Classification (VIC) task, which aims to assign a class from an unconstrained
language-induced semantic space to an input image without needing a known
vocabulary. VIC is challenging due to the vastness of the semantic space, which
contains millions of concepts, including fine-grained categories. To address
VIC, we propose Category Search from External Databases (CaSED), a
training-free method that leverages a pre-trained vision-language model and an
external database. CaSED first extracts the set of candidate categories from
the most semantically similar captions in the database and then assigns the
image to the best-matching candidate category according to the same
vision-language model. Furthermore, we demonstrate that CaSED can be applied
locally to generate a coarse segmentation mask that classifies image regions,
introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its
variants outperform other more complex vision-language models, on
classification and semantic segmentation benchmarks, while using much fewer
parameters. |
The paper introduces two novel tasks: Vocabulary-free Image Classification (VIC) and Vocabulary-free Semantic Segmentation (VSS), aiming to classify and segment images without predefined categories. |
These tasks are crucial for handling scenarios with unknown or evolving semantic contexts, common in real-world applications like autonomous agents in unconstrained environments. |
The proposed method, Category Search from External Databases (CaSED), leverages a pre-trained vision-language model (VLM) and an external database of image captions to extract candidate categories and score them based on multimodal similarity. CaSED is extended to VSS through various strategies, including DenseCaSED, which processes multi-scale image patches with the VLM and performs local category retrieval and scoring. |
CaSED and its variants outperform other VLMs in classification benchmarks, achieving higher cluster accuracy, semantic similarity, and semantic IoU.
For VSS, CaSED combined with an open-vocabulary segmentation model performs best, while DenseCaSED shows promise despite lacking a dedicated segmentation component.
Prompt ensembling consistently improves performance across datasets and tasks. |
The effectiveness of CaSED depends on the quality and coverage of the retrieval database.
Future work includes addressing label inconsistencies, handling class granularity, and improving the computational efficiency of DenseCaSED. |
vision and language, vocabulary-free classification, vocabulary-free segmentation, cased, densecased |
2404.10772
Report |
Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes |
Zehao Yu, Torsten Sattler, Andreas Geiger |
Recently, 3D Gaussian Splatting (3DGS) has demonstrated impressive novel view
synthesis results, while allowing the rendering of high-resolution images in
real-time. However, leveraging 3D Gaussians for surface reconstruction poses
significant challenges due to the explicit and disconnected nature of 3D
Gaussians. In this work, we present Gaussian Opacity Fields (GOF), a novel
approach for efficient, high-quality, and compact surface reconstruction in
unbounded scenes. Our GOF is derived from ray-tracing-based volume rendering of
3D Gaussians, enabling direct geometry extraction from 3D Gaussians by
identifying its levelset, without resorting to Poisson reconstruction or TSDF
fusion as in previous work. We approximate the surface normal of Gaussians as
the normal of the ray-Gaussian intersection plane, enabling the application of
regularization that significantly enhances geometry. Furthermore, we develop an
efficient geometry extraction method utilizing marching tetrahedra, where the
tetrahedral grids are induced from 3D Gaussians and thus adapt to the scene's
complexity. Our evaluations reveal that GOF surpasses existing 3DGS-based
methods in surface reconstruction and novel view synthesis. Further, it
compares favorably to, or even outperforms, neural implicit methods in both
quality and speed. |
Presents Gaussian Opacity Fields (GOF), a novel approach for efficient, high-quality, and compact surface reconstruction in unbounded scenes using 3D Gaussians. |
Addresses limitations of existing 3D Gaussian surface reconstruction methods that struggle with fine-grained geometry, background reconstruction, and rely on computationally expensive or inconsistent post-processing techniques like Poisson reconstruction or TSDF fusion. |
1. Establishes a Gaussian opacity field consistent with volume rendering, enabling direct surface extraction via level set identification.
2. Employs ray-Gaussian intersection normals for regularization, enhancing geometry reconstruction.
3. Develops an efficient tetrahedra-based mesh extraction method using 3D Gaussian positions and scales, resulting in compact and adaptive meshes. |
GOF outperforms existing 3DGS-based methods in surface reconstruction and novel view synthesis on Tanks and Temples, DTU, and Mip-NeRF 360 datasets.
GOF achieves competitive surface reconstruction quality compared to SOTA neural implicit methods while being significantly faster.
Ablation studies confirm the effectiveness of GOF's mesh extraction, regularization, and decoupled appearance modeling. |
Delaunay triangulation for tetrahedral grid generation poses a computational bottleneck.
Opacity evaluation during marching tetrahedra binary search could be optimized. |
3d gaussian splatting, surface reconstruction, novel view synthesis, unbounded scenes, mesh extraction |
2404.10765
Report |
RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting |
Ashkan Mirzaei, Riccardo De Lutio, Seung Wook Kim, David Acuna, Jonathan Kelly, Sanja Fidler, Igor Gilitschenski, Zan Gojcic |
Neural reconstruction approaches are rapidly emerging as the preferred
representation for 3D scenes, but their limited editability is still posing a
challenge. In this work, we propose an approach for 3D scene inpainting -- the
task of coherently replacing parts of the reconstructed scene with desired
content. Scene inpainting is an inherently ill-posed task as there exist many
solutions that plausibly replace the missing content. A good inpainting method
should therefore not only enable high-quality synthesis but also a high degree
of control. Based on this observation, we focus on enabling explicit control
over the inpainted content and leverage a reference image as an efficient means
to achieve this goal. Specifically, we introduce RefFusion, a novel 3D
inpainting method based on a multi-scale personalization of an image inpainting
diffusion model to the given reference view. The personalization effectively
adapts the prior distribution to the target scene, resulting in a lower
variance of score distillation objective and hence significantly sharper
details. Our framework achieves state-of-the-art results for object removal
while maintaining high controllability. We further demonstrate the generality
of our formulation on other downstream tasks such as object insertion, scene
outpainting, and sparse view reconstruction. |
Introduces RefFusion, a novel 3D scene inpainting method using multi-scale personalization of an image inpainting diffusion model, achieving high-quality, controllable inpaintings. |
3D scene inpainting is crucial for editing neural scene representations, but existing methods struggle with balancing controllability, detail, and multi-view consistency. |
RefFusion adapts an inpainting diffusion model to a reference view, then distills its priors to the 3D scene using a multi-scale score distillation objective. It further leverages Gaussian splatting to isolate masked regions and applies depth and adversarial regularization. |
Outperforms previous 3D inpainting methods on the SPIn-NeRF dataset in both quantitative metrics and user studies.
Demonstrates superior performance on scenes with large camera motion compared to single-view reference-based approaches.
Shows generalization capabilities for object insertion, sparse view reconstruction, and scene outpainting. |
Removing large objects covering significant portions of the reference image remains challenging.
Personalizing the diffusion model can be time-consuming. |
3d inpainting, diffusion models, score distillation sampling, gaussian splatting, neural scene representation |
2404.10763
Report |
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? |
Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu Sun |
Diffusion models have exhibited remarkable capabilities in text-to-image
generation. However, their performance in image-to-text generation,
specifically image captioning, has lagged behind Auto-Regressive (AR) models,
casting doubt on their applicability for such tasks. In this work, we revisit
diffusion models, highlighting their capacity for holistic context modeling and
parallel decoding. With these benefits, diffusion models can alleviate the
inherent limitations of AR methods, including their slow inference speed, error
propagation, and unidirectional constraints. Furthermore, we identify the prior
underperformance of diffusion models stemming from the absence of an effective
latent space for image-text alignment, and the discrepancy between continuous
diffusion processes and discrete textual data. In response, we introduce a
novel architecture, LaDiC, which utilizes a split BERT to create a dedicated
latent space for captions and integrates a regularization module to manage
varying text lengths. Our framework also includes a diffuser for semantic
image-to-text conversion and a Back&Refine technique to enhance token
interactivity during inference. LaDiC achieves state-of-the-art performance for
diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2
CIDEr, demonstrating exceptional performance without pre-training or ancillary
modules. This indicates strong competitiveness with AR models, revealing the
previously untapped potential of diffusion models in image-to-text generation. |
Introduces LaDiC, a novel diffusion-based image captioning model utilizing a split BERT for a dedicated text latent space and a regularization module for variable text lengths, outperforming previous diffusion methods. |
Addresses the limitations of auto-regressive models in image captioning, such as slow inference speed, error propagation, and unidirectional constraints, while overcoming the shortcomings of existing diffusion-based methods. |
Employs a split BERT to create a text latent space, trains a diffuser to map image representations to this latent space, and uses a Non-Auto-Regressive (NAR) decoder to generate captions. Introduces Back&Refine for improved token interactivity during inference. |
Achieves state-of-the-art performance for diffusion-based methods on MS COCO, with 38.2 BLEU@4 and 126.2 CIDEr.
Demonstrates faster inference speed compared to auto-regressive models, especially for longer captions.
Exhibits flexibility in caption generation, enabling custom generation based on tokens in nearly any position. |
Limited exploration of other modalities and pure text generation.
Reliance on relatively small model parameters and datasets compared to large-scale autoregressive models. |
image captioning, diffusion models, non-autoregressive generation, vision-language models, bert |
2404.10716
Report |
MOWA: Multiple-in-One Image Warping Model |
Kang Liao, Zongsheng Yue, Zhonghua Wu, Chen Change Loy |
While recent image warping approaches achieved remarkable success on existing
benchmarks, they still require training separate models for each specific task
and cannot generalize well to different camera models or customized
manipulations. To address diverse types of warping in practice, we propose a
Multiple-in-One image WArping model (named MOWA) in this work. Specifically, we
mitigate the difficulty of multi-task learning by disentangling the motion
estimation at both the region level and pixel level. To further enable dynamic
task-aware image warping, we introduce a lightweight point-based classifier
that predicts the task type, serving as prompts to modulate the feature maps
for better estimation. To our knowledge, this is the first work that solves
multiple practical warping tasks in one single model. Extensive experiments
demonstrate that our MOWA, which is trained on six tasks for multiple-in-one
image warping, outperforms state-of-the-art task-specific models across most
tasks. Moreover, MOWA also exhibits promising potential to generalize into
unseen scenes, as evidenced by cross-domain and zero-shot evaluations. The code
will be made publicly available. |
This paper proposes MOWA, the first practical multiple-in-one image warping framework that can address various warping tasks within a single model. |
Existing image warping approaches require training separate models for each task and lack generalization ability. MOWA tackles these limitations by enabling a single model to handle diverse warping tasks. |
MOWA disentangles motion estimation at region and pixel levels using TPS transformation and residual flow. It employs a lightweight point-based classifier for task-type prediction and a prompt learning module for task-aware warping. |
MOWA outperforms state-of-the-art task-specific models on most of the six evaluated warping tasks.
It exhibits promising generalization to unseen scenes, as demonstrated by cross-domain and zero-shot evaluations.
The hierarchical motion estimation and task-aware prompt learning strategy contribute to MOWA's effectiveness in multi-task image warping. |
MOWA may struggle with extremely complex image boundaries due to the limited number of control points.
Scaling up the input resolution could potentially improve the warping performance, which is left for future work. |
image warping, multiple-in-one model, prompt learning, tps transformation, computational photography |
2404.10700
Report |
Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs |
Georgy Perevozchikov, Nancy Mehta, Mahmoud Afifi, Radu Timofte |
Modern smartphone camera quality heavily relies on the image signal processor
(ISP) to enhance captured raw images, utilizing carefully designed modules to
produce final output images encoded in a standard color space (e.g., sRGB).
Neural-based end-to-end learnable ISPs offer promising advancements,
potentially replacing traditional ISPs with their ability to adapt without
requiring extensive tuning for each new camera model, as is often the case for
nearly every module in traditional ISPs. However, the key challenge with the
recent learning-based ISPs is the urge to collect large paired datasets for
each distinct camera model due to the influence of intrinsic camera
characteristics on the formation of input raw images. This paper tackles this
challenge by introducing a novel method for unpaired learning of raw-to-raw
translation across diverse cameras. Specifically, we propose Rawformer, an
unsupervised Transformer-based encoder-decoder method for raw-to-raw
translation. It accurately maps raw images captured by a certain camera to the
target camera, facilitating the generalization of learnable ISPs to new unseen
cameras. Our method demonstrates superior performance on real camera datasets,
achieving higher accuracy compared to previous state-of-the-art techniques, and
preserving a more robust correlation between the original and translated raw
images. |
This paper introduces Rawformer, a novel unsupervised Transformer-based method for unpaired raw-to-raw image translation across diverse cameras, enabling generalization of learnable ISPs to unseen cameras without retraining. |
Modern smartphone camera ISPs require extensive tuning for each new camera model. Rawformer addresses this by enabling the use of pre-trained neural-based ISPs on new cameras without the need for paired datasets from each new model, simplifying ISP development and reducing costs. |
Rawformer utilizes an unsupervised encoder-decoder Transformer architecture with contextual-scale aware downsampler and upsampler blocks for efficient encoding of global and local image information. It also introduces a cross-domain attention-driven discriminator for stable training. |
Rawformer achieves state-of-the-art results on raw-to-raw translation benchmarks, significantly outperforming previous methods.
The method effectively maps raw images to the target camera's raw space, enabling accurate rendering using neural-based ISPs trained on different camera models.
Rawformer demonstrates robust improvement in cross-camera ISP rendering, with only a marginal reduction in accuracy compared to camera-specific ISP models. |
The model's inference time, while feasible on GPUs, may be impractical for real-time rendering on devices with limited computational power.
Future work will focus on developing lighter models for real-time performance on CPUs, broadening its applicability. |
raw image processing, image signal processor (isp), unsupervised learning, domain adaptation, transformers |
2404.10690
Report |
MathWriting: A Dataset For Handwritten Mathematical Expression Recognition |
Philippe Gervais, Asya Fadeeva, Andrii Maksai |
We introduce MathWriting, the largest online handwritten mathematical
expression dataset to date. It consists of 230k human-written samples and an
additional 400k synthetic ones. MathWriting can also be used for offline HME
recognition and is larger than all existing offline HME datasets like
IM2LATEX-100K. We introduce a benchmark based on MathWriting data in order to
advance research on both online and offline HME recognition. |
Introduces MathWriting, the largest online handwritten mathematical expression dataset to date, containing 230k human-written and 400k synthetic samples, along with a benchmark for online and offline HME recognition. |
Addresses the lack of large, diverse datasets for handwritten mathematical expression recognition, crucial for advancing research and development in this area. |
Collected human-written expressions using an Android app, synthesized expressions by stitching together isolated symbols, normalized LaTeX labels, and split data into train/validation/test sets. |
MathWriting significantly expands the size and symbol coverage compared to existing datasets like CROHME23.
Benchmark results show superior performance of online recognition models (CTC Transformer, PaLI) over offline methods (OCR).
Analysis reveals common recognition errors include character confusion and incorrect subexpression nesting. |
Label normalization, while improving model performance, could be further refined for specific applications.
Inherent ambiguities in handwritten expressions pose challenges for achieving human-level recognition accuracy. |
handwriting recognition, mathematical expressions, dataset, benchmark, latex |
2404.10685
Report |
Generating Human Interaction Motions in Scenes with Text Control |
Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, Davis Rempe |
We present TeSMo, a method for text-controlled scene-aware motion generation
based on denoising diffusion models. Previous text-to-motion methods focus on
characters in isolation without considering scenes due to the limited
availability of datasets that include motion, text descriptions, and
interactive scenes. Our approach begins with pre-training a scene-agnostic
text-to-motion diffusion model, emphasizing goal-reaching constraints on
large-scale motion-capture datasets. We then enhance this model with a
scene-aware component, fine-tuned using data augmented with detailed scene
information, including ground plane and object shapes. To facilitate training,
we embed annotated navigation and interaction motions within scenes. The
proposed method produces realistic and diverse human-object interactions, such
as navigation and sitting, in different scenes with various object shapes,
orientations, initial body positions, and poses. Extensive experiments
demonstrate that our approach surpasses prior techniques in terms of the
plausibility of human-scene interactions, as well as the realism and variety of
the generated motions. Code will be released upon publication of this work at
https://research.nvidia.com/labs/toronto-ai/tesmo. |
This paper introduces \modelname, a novel text-controlled and scene-aware method for generating human-scene interaction motions based on denoising diffusion models. |
Generating realistic and controllable human motion within 3D scenes is crucial for various applications, from gaming to embodied AI. Previous methods struggle to simultaneously offer text controllability and scene awareness with high motion quality, especially for complex interactions. |
The approach decomposes the task into navigation and interaction stages, each using a diffusion model with an augmented scene-aware branch. First, a scene-agnostic text-to-motion model is trained on large-scale motion capture data. Then, a separate branch is fine-tuned with scene information (2D floor maps for navigation and 3D object geometry for interaction) using data augmented with realistic interactions in scenes. |
The method generates plausible motions that navigate through scenes, avoid obstacles, and interact realistically with objects, all while adhering to textual descriptions.
Experiments demonstrate superior goal-reaching accuracy and fewer object penetrations compared to state-of-the-art methods.
A user study reveals a preference for interactions generated by \modelname over a leading reinforcement learning approach. |
The two-stage navigation approach can lead to inconsistencies between the generated pelvis trajectory and full-body poses.
The current method is limited to static objects and a fixed set of actions. |
motion synthesis, human-scene interaction, diffusion models, text-to-motion, scene-aware motion generation |
2404.10667
Report |
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time |
Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo |
We introduce VASA, a framework for generating lifelike talking faces with
appealing visual affective skills (VAS) given a single static image and a
speech audio clip. Our premiere model, VASA-1, is capable of not only producing
lip movements that are exquisitely synchronized with the audio, but also
capturing a large spectrum of facial nuances and natural head motions that
contribute to the perception of authenticity and liveliness. The core
innovations include a holistic facial dynamics and head movement generation
model that works in a face latent space, and the development of such an
expressive and disentangled face latent space using videos. Through extensive
experiments including evaluation on a set of new metrics, we show that our
method significantly outperforms previous methods along various dimensions
comprehensively. Our method not only delivers high video quality with realistic
facial and head dynamics but also supports the online generation of 512x512
videos at up to 40 FPS with negligible starting latency. It paves the way for
real-time engagements with lifelike avatars that emulate human conversational
behaviors. |
This paper introduces VASA, a framework that generates realistic talking face videos from a single static image and a speech audio clip, featuring accurate lip synchronization, expressive facial dynamics, and natural head movements. |
Realistic AI-generated talking faces have broad applications in communication, education, healthcare, and beyond, enhancing human-computer interaction and accessibility. |
The method constructs an expressive and disentangled face latent space from videos. It then uses a Diffusion Transformer model to generate holistic facial dynamics and head movements in this latent space, conditioned on audio and optional control signals like gaze direction and emotion offset. A face decoder then generates video frames based on these latent motions and the input image. |
The method achieves superior audio-lip synchronization, significantly outperforming existing methods.
It generates more natural and varied head movements synchronized with the audio compared to previous approaches.
VASA produces high-quality videos with realistic facial expressions and subtle nuances like eye blinks and gaze shifts, achieving state-of-the-art video quality scores (FVD). |
The method currently only models human regions up to the torso and lacks explicit modeling of non-rigid elements like hair.
Incorporating more diverse talking styles and emotions in the training data could further enhance expressiveness and control. |
talking face generation, audio-driven animation, diffusion models, latent space representation, visual affective skills |
2404.10625
Report |
Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks |
Florian Barthel, Arian Beckmann, Wieland Morgenstern, Anna Hilsmann, Peter Eisert |
NeRF-based 3D-aware Generative Adversarial Networks (GANs) like EG3D or
GIRAFFE have shown very high rendering quality under large representational
variety. However, rendering with Neural Radiance Fields poses challenges for 3D
applications: First, the significant computational demands of NeRF rendering
preclude its use on low-power devices, such as mobiles and VR/AR headsets.
Second, implicit representations based on neural networks are difficult to
incorporate into explicit 3D scenes, such as VR environments or video games. 3D
Gaussian Splatting (3DGS) overcomes these limitations by providing an explicit
3D representation that can be rendered efficiently at high frame rates. In this
work, we present a novel approach that combines the high rendering quality of
NeRF-based 3D-aware GANs with the flexibility and computational advantages of
3DGS. By training a decoder that maps implicit NeRF representations to explicit
3D Gaussian Splatting attributes, we can integrate the representational
diversity and quality of 3D GANs into the ecosystem of 3D Gaussian Splatting
for the first time. Additionally, our approach allows for a high resolution GAN
inversion and real-time GAN editing with 3D Gaussian Splatting scenes. |
This paper presents a novel method for synthesizing explicit 3D scenes of human heads from a latent space by combining the advantages of 3D-aware GANs (high quality and representational variety) and 3D Gaussian Splatting (efficient rendering and flexibility). |
This approach addresses the limitations of NeRF-based GANs, which are difficult to integrate into 3D modeling environments due to their implicit representations and slow rendering speeds. |
The method involves training a sequential decoder network that maps implicit NeRF representations from a pre-trained 3D GAN to explicit 3D Gaussian Splatting attributes (position, color, rotation, scale, and opacity). The decoder leverages the geometric information from the GAN's tri-plane features for position initialization and employs a combination of loss functions for training. |
The proposed method achieves high visual similarity between the decoded Gaussian Splatting scenes and the target GAN renderings.
It achieves rendering speeds up to 5 times faster than the target GANs with the flexibility of arbitrary rendering resolutions.
It enables the application of GAN editing and inversion methods to explicit 3D Gaussian Splatting scenes. |
The output fidelity of the method is currently limited by the fidelity of the pre-trained 3D GAN used.
The lack of view-dependent spherical harmonics in the decoder can lead to uncanny or blurry eye renderings. |
3d gaussian splatting, 3d-aware gans, neural radiance fields, 3d head synthesis, real-time rendering |
2404.10618
Report |
Private Attribute Inference from Images with Vision-Language Models |
Batuhan Tömekçe, Mark Vero, Robin Staab, Martin Vechev |
As large language models (LLMs) become ubiquitous in our daily tasks and
digital interactions, associated privacy risks are increasingly in focus. While
LLM privacy research has primarily focused on the leakage of model training
data, it has recently been shown that the increase in models' capabilities has
enabled LLMs to make accurate privacy-infringing inferences from previously
unseen texts. With the rise of multimodal vision-language models (VLMs),
capable of understanding both images and text, a pertinent question is whether
such results transfer to the previously unexplored domain of benign images
posted online. To investigate the risks associated with the image reasoning
capabilities of newly emerging VLMs, we compile an image dataset with
human-annotated labels of the image owner's personal attributes. In order to
understand the additional privacy risk posed by VLMs beyond traditional human
attribute recognition, our dataset consists of images where the inferable
private attributes do not stem from direct depictions of humans. On this
dataset, we evaluate the inferential capabilities of 7 state-of-the-art VLMs,
finding that they can infer various personal attributes at up to 77.6%
accuracy. Concerningly, we observe that accuracy scales with the general
capabilities of the models, implying that future models can be misused as
stronger adversaries, establishing an imperative for the development of
adequate defenses. |
This paper presents the first investigation into the privacy risks posed by Vision-Language Models (VLMs) inferring personal information from images posted on pseudonymized platforms. |
With the increasing adoption of VLMs, their ability to deduce private information from seemingly innocuous images raises significant privacy concerns that challenge current online privacy understandings. |
The authors created a dataset of images and annotated them with personal attributes, then tested 7 state-of-the-art VLMs on their ability to infer these attributes. They also developed methods to circumvent safety filters and enhance inference accuracy. |
Both proprietary and open-source VLMs demonstrated the ability to infer private attributes from images with high accuracy (up to 77.6%).
Current safety filters in VLMs are easily circumvented, even with simple prompt engineering techniques.
Inference accuracy is strongly correlated with a model's general capabilities, suggesting future, more powerful models will pose a greater privacy risk. |
The dataset used for evaluation, while reflecting real-world data, was not released publicly due to privacy concerns.
Future work could focus on developing robust defenses against VLM-based privacy inferences, potentially through user-side and model provider-side mitigations. |
privacy, vision-language models, personal attribute inference, safety filters, online privacy |
2404.10603
Report |
Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences |
Seungwook Kim, Kejie Li, Xueqing Deng, Yichun Shi, Minsu Cho, Peng Wang |
Leveraging multi-view diffusion models as priors for 3D optimization have
alleviated the problem of 3D consistency, e.g., the Janus face problem or the
content drift problem, in zero-shot text-to-3D models. However, the 3D
geometric fidelity of the output remains an unresolved issue; albeit the
rendered 2D views are realistic, the underlying geometry may contain errors
such as unreasonable concavities. In this work, we propose CorrespondentDream,
an effective method to leverage annotation-free, cross-view correspondences
yielded from the diffusion U-Net to provide additional 3D prior to the NeRF
optimization process. We find that these correspondences are strongly
consistent with human perception, and by adopting it in our loss design, we are
able to produce NeRF models with geometries that are more coherent with common
sense, e.g., more smoothed object surface, yielding higher 3D fidelity. We
demonstrate the efficacy of our approach through various comparative
qualitative results and a solid user study. |
This paper introduces CorrespondentDream, a method that improves the 3D geometric fidelity of text-to-3D generation by leveraging cross-view correspondences from multi-view diffusion models. |
Existing text-to-3D methods often produce models with geometric inconsistencies despite generating realistic 2D views. CorrespondentDream addresses this issue by incorporating 3D geometric priors during the optimization process. |
CorrespondentDream extracts features from a pre-trained multi-view diffusion model and computes cross-view correspondences between adjacent NeRF-rendered views. These correspondences are then used to guide the NeRF optimization process and correct geometric inconsistencies. |
CorrespondentDream effectively removes 3D geometric errors such as unnatural concavities and missing surfaces, as demonstrated through qualitative results.
The method outperforms baseline models in a user study, with participants preferring its output in terms of 3D fidelity and overall quality.
Analysis shows that alternating optimization using both Score Distillation Sampling (SDS) loss and cross-view correspondence loss is crucial for the method's effectiveness. |
The alternating optimization strategy increases the number of optimization iterations, potentially affecting computational efficiency.
The method may struggle with objects containing shiny homogeneous surfaces or repetitive patterns, as it becomes challenging to establish robust correspondences in such cases. |
text-to-3d generation, diffusion models, nerf, cross-view correspondence, 3d geometric fidelity |
2404.10518
Report |
MobileNetV4 -- Universal Models for the Mobile Ecosystem |
Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, Vaibhav Aggarwal, Tenghui Zhu, Daniele Moro, Andrew Howard |
We present the latest generation of MobileNets, known as MobileNetV4 (MNv4),
featuring universally efficient architecture designs for mobile devices. At its
core, we introduce the Universal Inverted Bottleneck (UIB) search block, a
unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext,
Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant.
Alongside UIB, we present Mobile MQA, an attention block tailored for mobile
accelerators, delivering a significant 39% speedup. An optimized neural
architecture search (NAS) recipe is also introduced which improves MNv4 search
effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe
results in a new suite of MNv4 models that are mostly Pareto optimal across
mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural
Engine and Google Pixel EdgeTPU - a characteristic not found in any other
models tested. Finally, to further boost accuracy, we introduce a novel
distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model
delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just
3.8ms. |
Introduces MobileNetV4 (MNv4), a series of universally efficient architecture designs for mobile devices, featuring the Universal Inverted Bottleneck (UIB) and Mobile MQA, achieving mostly Pareto optimal performance across diverse hardware platforms. |
Efficient on-device neural networks are crucial for fast, real-time, and interactive experiences on mobile devices while addressing privacy concerns by avoiding streaming of private data. |
Develops UIB block unifying prominent micro-architectures and Mobile MQA optimized for mobile accelerators. Employs a refined two-phase NAS approach for efficient architecture search and introduces a novel distillation technique mixing datasets with different augmentations. |
MNv4 models demonstrate mostly Pareto-optimal performance across CPUs, DSPs, GPUs, and specialized accelerators.
MNv4-Conv-M achieves over 50% speedup compared to MobileOne-S4 and FastViT-S12 on EdgeTPUs at a similar accuracy level.
MNv4-Hybrid-L achieves 87% top-1 accuracy on ImageNet-1K, only a 0.5% drop compared to its teacher model, EfficientNet-L2, despite having 39x less MACs. |
MNv4-Hybrid models lack compatibility with DSPs.
Future work can explore integrating other state-of-the-art techniques and exploring model scaling for even higher accuracy. |
mobilenet, neural architecture search, model efficiency, on-device ai, computer vision |
2404.10484
Report |
AbsGS: Recovering Fine Details for 3D Gaussian Splatting |
Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou |
3D Gaussian Splatting (3D-GS) technique couples 3D Gaussian primitives with
differentiable rasterization to achieve high-quality novel view synthesis
results while providing advanced real-time rendering performance. However, due
to the flaw of its adaptive density control strategy in 3D-GS, it frequently
suffers from over-reconstruction issue in intricate scenes containing
high-frequency details, leading to blurry rendered images. The underlying
reason for the flaw has still been under-explored. In this work, we present a
comprehensive analysis of the cause of aforementioned artifacts, namely
gradient collision, which prevents large Gaussians in over-reconstructed
regions from splitting. To address this issue, we propose the novel
homodirectional view-space positional gradient as the criterion for
densification. Our strategy efficiently identifies large Gaussians in
over-reconstructed regions, and recovers fine details by splitting. We evaluate
our proposed method on various challenging datasets. The experimental results
indicate that our approach achieves the best rendering quality with reduced or
similar memory consumption. Our method is easy to implement and can be
incorporated into a wide variety of most recent Gaussian Splatting-based
methods. We will open source our codes upon formal publication. Our project
page is available at: https://ty424.github.io/AbsGS.github.io/ |
This paper proposes AbsGS, a novel method to recover fine details in 3D Gaussian Splatting by addressing the issue of over-reconstruction, where large Gaussians inadequately represent high-frequency details, leading to blurry rendering. |
3D Gaussian Splatting (3D-GS) is a powerful technique for novel view synthesis, but its adaptive density control strategy struggles to accurately represent intricate scenes due to over-reconstruction artifacts. |
AbsGS introduces the use of a "homodirectional view-space positional gradient" to guide the densification process. By taking the absolute value of each pixel-wise sub-gradient before summation, this method mitigates "gradient collision," allowing for accurate identification and splitting of large Gaussians in over-reconstructed regions. |
AbsGS consistently outperforms baselines like Mip-NeRF360 and Instant-NGP in novel view synthesis quality, as evidenced by higher SSIM, PSNR, and LPIPS scores on datasets like Mip-NeRF 360, Tanks & Temples, and Deep Blending.
The method effectively eliminates large Gaussians in over-reconstructed areas, resulting in sharper details and less blurriness compared to 3D-GS, as visualized through point cloud and ellipsoid representations.
AbsGS achieves superior results with similar or even reduced memory consumption compared to 3D-GS, demonstrating its efficiency in addressing over-reconstruction without relying on a significantly larger number of Gaussians. |
The paper primarily focuses on improving the split operation for densification, leaving the exploration of applying the homodirectional gradient to the clone operation for future work.
Further investigation into the impact of scale threshold and gradient threshold on the performance of AbsGS, particularly in relation to different scene complexities, is warranted. |
novel view synthesis, 3d gaussian splatting, point-based radiance field, 3d reconstruction, densification strategy |
2404.10441
Report |
1st Place Solution for ICCV 2023 OmniObject3D Challenge: Sparse-View Reconstruction |
Hang Du, Yaping Xue, Weidong Dai, Xuejun Yan, Jingjing Wang |
In this report, we present the 1st place solution for ICCV 2023 OmniObject3D
Challenge: Sparse-View Reconstruction. The challenge aims to evaluate
approaches for novel view synthesis and surface reconstruction using only a few
posed images of each object. We utilize Pixel-NeRF as the basic model, and
apply depth supervision as well as coarse-to-fine positional encoding. The
experiments demonstrate the effectiveness of our approach in improving
sparse-view reconstruction quality. We ranked first in the final test with a
PSNR of 25.44614. |
The paper presents the 1st place solution for the ICCV 2023 OmniObject3D Challenge Track-1 Sparse-View Reconstruction, achieving a PSNR of 25.44614 on the final test set. |
The challenge addresses the difficult task of novel view synthesis and surface reconstruction from a limited number of input images (1-3), which has significant implications for various applications. |
The solution utilizes a Pixel-NeRF model pre-trained on a curated subset of the OmniObject3D dataset and fine-tuned on each test scene. It incorporates depth supervision and coarse-to-fine positional encoding to improve reconstruction quality. |
Training on a representative subset of 48 object categories selected from the OmniObject3D dataset outperforms training on a smaller, less diverse subset.
Depth supervision and coarse-to-fine positional encoding further improve the fidelity of surface reconstruction.
Fine-tuning the pre-trained model on each test scene significantly boosts performance, highlighting the importance of adapting to scene-specific characteristics. |
The study is limited by computational resources, particularly when evaluating different test-time optimization strategies.
Future work could explore more advanced techniques for pre-training and fine-tuning NeRF models, as well as investigate alternative network architectures. |
sparse-view reconstruction, novel view synthesis, nerf, pixel-nerf, omniobject3d challenge |
2404.10438
Report |
The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement |
Gabriele Trivigno, Carlo Masone, Barbara Caputo, Torsten Sattler |
Pose refinement is an interesting and practically relevant research
direction. Pose refinement can be used to (1) obtain a more accurate pose
estimate from an initial prior (e.g., from retrieval), (2) as pre-processing,
i.e., to provide a better starting point to a more expensive pose estimator,
(3) as post-processing of a more accurate localizer. Existing approaches focus
on learning features / scene representations for the pose refinement task. This
involves training an implicit scene representation or learning features while
optimizing a camera pose-based loss. A natural question is whether training
specific features / representations is truly necessary or whether similar
results can be already achieved with more generic features. In this work, we
present a simple approach that combines pre-trained features with a particle
filter and a renderable representation of the scene. Despite its simplicity, it
achieves state-of-the-art results, demonstrating that one can easily build a
pose refiner without the need for specific training. The code is at
https://github.com/ga1i13o/mcloc_poseref |
This paper introduces a novel pose refinement approach that leverages pre-trained, generic deep features for visual localization, challenging the necessity of specialized features in pose refinement. |
Existing pose refinement techniques often rely on computationally expensive and scene-specific feature learning. This work explores the use of readily available pre-trained features for a more efficient and generalizable solution. |
The method integrates a pre-trained CNN with a particle filter optimizer within a render-and-compare framework. It utilizes a coarse-to-fine strategy, progressively refining pose estimates by comparing rendered views with query images using features from deeper to shallower layers. |
Despite its simplicity, the approach achieves state-of-the-art results on benchmark datasets, demonstrating the effectiveness of generic features for pose refinement.
The method proves robust to rendering domain shifts, indicating its applicability across diverse scene representations.
It can be seamlessly integrated with existing localization pipelines, either as pre-processing for coarse pose estimation or post-processing for refinement, further enhancing their performance. |
The method's performance faces challenges in indoor environments with repetitive, texture-less surfaces, where perceptual similarity cues are limited.
Future work can explore incorporating task-specific fine-tuning during test time to potentially further improve accuracy, capitalizing on the strengths of both generic and specialized features. |
visual localization, pose refinement, deep features, particle filter, render-and-compare |
2404.10394
Report |
Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior |
Yiqian Wu, Hao Xu, Xiangjun Tang, Xien Chen, Siyu Tang, Zhebin Zhang, Chen Li, Xiaogang Jin |
Existing neural rendering-based text-to-3D-portrait generation methods
typically make use of human geometry prior and diffusion models to obtain
guidance. However, relying solely on geometry information introduces issues
such as the Janus problem, over-saturation, and over-smoothing. We present
Portrait3D, a novel neural rendering-based framework with a novel joint
geometry-appearance prior to achieve text-to-3D-portrait generation that
overcomes the aforementioned issues. To accomplish this, we train a 3D portrait
generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable
of producing 360{\deg} canonical 3D portraits, serving as a starting point for
the subsequent diffusion-based generation process. To mitigate the "grid-like"
artifact caused by the high-frequency information in the feature-map-based 3D
representation commonly used by most 3D-aware GANs, we integrate a novel
pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D
portraits from text, we first project a randomly generated image aligned with
the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The
resulting latent code is then used to synthesize a pyramid tri-grid. Beginning
with the obtained pyramid tri-grid, we use score distillation sampling to
distill the diffusion model's knowledge into the pyramid tri-grid. Following
that, we utilize the diffusion model to refine the rendered images of the 3D
portrait and then use these refined images as training data to further optimize
the pyramid tri-grid, effectively eliminating issues with unrealistic color and
unnatural artifacts. Our experimental results show that Portrait3D can produce
realistic, high-quality, and canonical 3D portraits that align with the prompt. |
This paper proposes \textit{\ourname}, a text-guided 3D portrait generation framework that leverages 3D-aware GANs to provide robust joint geometry-appearance prior information. |
Existing neural rendering-based text-to-3D-portrait generation methods often lead to issues like inconsistent textures, over-saturation, and over-smoothing due to relying solely on geometry priors. |
The method trains a 3D portrait generator (\ourgenerator) with a novel \textit{pyramid tri-grid} 3D representation to alleviate artifacts. For text-to-3D generation, it first projects a randomly generated image (aligned with the text prompt) to \ourgenerator's latent space. The resulting latent code synthesizes a \textit{pyramid tri-grid}, which is then refined via score distillation sampling and a diffusion model. |
Generates high-quality, realistic 3D portraits consistent with text prompts.
Successfully mitigates issues like over-saturation and the Janus problem.
Demonstrates superior performance compared to state-of-the-art methods in both qualitative and quantitative evaluations. |
Generated portraits may exhibit distortions if the initial inversion from the 3D portrait generator is not perfectly canonical.
Semantic attributes of the background in the input prompt can sometimes influence the final 3D portrait results. |
3d portrait generation, 3d-aware gans, diffusion models, neural rendering, text-to-3d synthesis |
2404.10342
Report |
Referring Flexible Image Restoration |
Runwei Guan, Rongsheng Hu, Zhuhao Zhou, Tianlang Xue, Ka Lok Man, Jeremy Smith, Eng Gee Lim, Weiping Ding, Yutao Yue |
In reality, images often exhibit multiple degradations, such as rain and fog
at night (triple degradations). However, in many cases, individuals may not
want to remove all degradations, for instance, a blurry lens revealing a
beautiful snowy landscape (double degradations). In such scenarios, people may
only desire to deblur. These situations and requirements shed light on a new
challenge in image restoration, where a model must perceive and remove specific
degradation types specified by human commands in images with multiple
degradations. We term this task Referring Flexible Image Restoration (RFIR). To
address this, we first construct a large-scale synthetic dataset called RFIR,
comprising 153,423 samples with the degraded image, text prompt for specific
degradation removal and restored image. RFIR consists of five basic degradation
types: blur, rain, haze, low light and snow while six main sub-categories are
included for varying degrees of degradation removal. To tackle the challenge,
we propose a novel transformer-based multi-task model named TransRFIR, which
simultaneously perceives degradation types in the degraded image and removes
specific degradation upon text prompt. TransRFIR is based on two devised
attention modules, Multi-Head Agent Self-Attention (MHASA) and Multi-Head Agent
Cross Attention (MHACA), where MHASA and MHACA introduce the agent token and
reach the linear complexity, achieving lower computation cost than vanilla
self-attention and cross-attention and obtaining competitive performances. Our
TransRFIR achieves state-of-the-art performances compared with other
counterparts and is proven as an effective architecture for image restoration.
We release our project at https://github.com/GuanRunwei/FIR-CP. |
This paper introduces Referring Flexible Image Restoration (RFIR), a novel task focused on removing specific image degradations based on user-provided text prompts. |
Current image restoration models struggle to selectively remove degradations in images with multiple degradation types. RFIR aims to address this limitation by enabling user-controlled, flexible restoration according to specific preferences. |
The authors create a large-scale synthetic dataset, RFIR, containing images with single, double, and triple degradations along with text prompts specifying which degradation(s) to remove. They also propose TransRFIR, a multi-task transformer-based model, that simultaneously predicts degradation types and performs text-guided image restoration. TransRFIR utilizes novel, computationally efficient attention modules, MHASA and MHACA. |
TransRFIR achieves state-of-the-art performance on the RFIR dataset, outperforming adapted task-agnostic, all-in-one, and text-driven models.
The proposed MHASA and MHACA modules demonstrate both computational efficiency and effectiveness for self-attention and feature fusion, respectively.
The TransRFIR pipeline exhibits good generalization capabilities, achieving competitive results on benchmark datasets for deblurring, deraining, low-light enhancement, and dehazing. |
The current pipeline is primarily designed for U-Net-based architectures and may not be directly applicable to GAN-based or diffusion-based models.
The RFIR dataset is synthetic, potentially limiting its ability to fully represent real-world degradation complexities. |
referring flexible image restoration, multi-modal learning, cross attention, prompt learning, image restoration |
2404.10318
Report |
SRGS: Super-Resolution 3D Gaussian Splatting |
Xiang Feng, Yongbo He, Yubo Wang, Yan Yang, Zhenzhong Kuang, Yu Jun, Jianping Fan, Jiajun ding |
Recently, 3D Gaussian Splatting (3DGS) has gained popularity as a novel
explicit 3D representation. This approach relies on the representation power of
Gaussian primitives to provide a high-quality rendering. However, primitives
optimized at low resolution inevitably exhibit sparsity and texture deficiency,
posing a challenge for achieving high-resolution novel view synthesis (HRNVS).
To address this problem, we propose Super-Resolution 3D Gaussian Splatting
(SRGS) to perform the optimization in a high-resolution (HR) space. The
sub-pixel constraint is introduced for the increased viewpoints in HR space,
exploiting the sub-pixel cross-view information of the multiple low-resolution
(LR) views. The gradient accumulated from more viewpoints will facilitate the
densification of primitives. Furthermore, a pre-trained 2D super-resolution
model is integrated with the sub-pixel constraint, enabling these dense
primitives to learn faithful texture features. In general, our method focuses
on densification and texture learning to effectively enhance the representation
ability of primitives. Experimentally, our method achieves high rendering
quality on HRNVS only with LR inputs, outperforming state-of-the-art methods on
challenging datasets such as Mip-NeRF 360 and Tanks & Temples. Related codes
will be released upon acceptance. |
Introduces Super-Resolution 3D Gaussian Splatting (SRGS) for high-resolution novel view synthesis (HRNVS) using only low-resolution (LR) inputs. |
Addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods, which struggle to render high-resolution details and textures when trained on LR images. |
Employs a two-pronged strategy: (1) Super-Resolution Gaussian Densification increases the density of Gaussian primitives in HR space through super-splatting and sub-pixel constraints. (2) Texture-Guided Gaussian Learning leverages a pre-trained 2D super-resolution model to guide Gaussian primitives in learning faithful texture features, while sub-pixel constraints ensure spatial consistency. |
Significantly improves rendering quality on HRNVS tasks compared to baseline 3DGS and other state-of-the-art methods.
Achieves high PSNR, SSIM, and LPIPS scores on benchmark datasets, including Synthetic NeRF, Tanks & Temples, and Mip-NeRF 360.
Effectively reconstructs fine-grained details and textures, even with large super-resolution factors (e.g., 4x and 8x). |
Reliance on a 2D super-resolution model, which might introduce limitations based on the model's capabilities.
Future work could explore HRNVS without relying on a 2D super-resolution model. |
3d gaussian splatting, super-resolution, novel view synthesis, texture synthesis, gaussian densification |
2404.10267
Report |
OneActor: Consistent Character Generation via Cluster-Conditioned Guidance |
Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang |
Text-to-image diffusion models benefit artists with high-quality image
generation. Yet its stochastic nature prevent artists from creating consistent
images of the same character. Existing methods try to tackle this challenge and
generate consistent content in various ways. However, they either depend on
external data or require expensive tuning of the diffusion model. For this
issue, we argue that a lightweight but intricate guidance is enough to
function. Aiming at this, we lead the way to formalize the objective of
consistent generation, derive a clustering-based score function and propose a
novel paradigm, OneActor. We design a cluster-conditioned model which
incorporates posterior samples to guide the denoising trajectories towards the
target cluster. To overcome the overfitting challenge shared by one-shot tuning
pipelines, we devise auxiliary components to simultaneously augment the tuning
and regulate the inference. This technique is later verified to significantly
enhance the content diversity of generated images. Comprehensive experiments
show that our method outperforms a variety of baselines with satisfactory
character consistency, superior prompt conformity as well as high image
quality. And our method is at least 4 times faster than tuning-based baselines.
Furthermore, to our best knowledge, we first prove that the semantic space has
the same interpolation property as the latent space dose. This property can
serve as another promising tool for fine generation control. |
This paper proposes OneActor, a novel one-shot tuning paradigm for consistent character generation in text-to-image diffusion models, achieving faster and more efficient results compared to existing methods. |
Existing methods for consistent character generation in text-to-image synthesis either rely on external data or require time-consuming fine-tuning of the entire diffusion model, limiting their practicality. This paper addresses these limitations. |
The authors formalize consistent generation mathematically, derive a cluster-based score function, and introduce a cluster-conditioned model. They utilize semantic representations of target and auxiliary images to guide the denoising process towards a desired character cluster while maintaining diversity. |
OneActor achieves superior character consistency and prompt conformity compared to baseline methods, establishing a new Pareto front.
The method maintains high image quality and diversity without compromising the original diffusion model's capabilities.
OneActor significantly reduces tuning time, requiring only 3-8 minutes compared to 20-60 minutes for existing methods. |
The study primarily focuses on character-centric generation, and further research is needed to extend its applicability to other domains.
Future work could explore alternative clustering methods or incorporate user feedback for improved control over character generation. |
text-to-image synthesis, diffusion models, consistent character generation, semantic control, one-shot tuning |
2404.10157
Report |
Salient Object-Aware Background Generation using Text-Guided Diffusion Models |
Amir Erfan Eshratifar, Joao V. B. Soares, Kapil Thadani, Shaunak Mishra, Mikhail Kuznetsov, Yueh-Ning Ku, Paloma de Juan |
Generating background scenes for salient objects plays a crucial role across
various domains including creative design and e-commerce, as it enhances the
presentation and context of subjects by integrating them into tailored
environments. Background generation can be framed as a task of text-conditioned
outpainting, where the goal is to extend image content beyond a salient
object's boundaries on a blank background. Although popular diffusion models
for text-guided inpainting can also be used for outpainting by mask inversion,
they are trained to fill in missing parts of an image rather than to place an
object into a scene. Consequently, when used for background creation,
inpainting models frequently extend the salient object's boundaries and thereby
change the object's identity, which is a phenomenon we call "object expansion."
This paper introduces a model for adapting inpainting diffusion models to the
salient object outpainting task using Stable Diffusion and ControlNet
architectures. We present a series of qualitative and quantitative results
across models and datasets, including a newly proposed metric to measure object
expansion that does not require any human labeling. Compared to Stable
Diffusion 2.0 Inpainting, our proposed approach reduces object expansion by
3.6x on average with no degradation in standard visual metrics across multiple
datasets. |
This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task, focusing on generating natural backgrounds while preserving the object's identity and boundaries. |
Salient object outpainting is crucial for applications like e-commerce and design, enabling personalized backgrounds and enhancing visual presentation. |
The proposed model leverages Stable Diffusion and ControlNet architectures. It utilizes a salient object mask as input to ControlNet, guiding the inpainting process to maintain object boundaries and prevent unwanted modifications. |
Reduces object expansion by 3.6x compared to Stable Diffusion 2.0 Inpainting.
Achieves comparable or superior performance on standard visual metrics (FID, LPIPS).
Demonstrates effectiveness across different datasets and types of text prompts. |
Reliance on synthetic captions for some training data may impact prompt alignment.
Background diversity can be further improved. |
image outpainting, salient objects, diffusion models, controlnet, object expansion |
2404.09995
Report |
Taming Latent Diffusion Model for Neural Radiance Field Inpainting |
Chieh Hubert Lin, Changil Kim, Jia-Bin Huang, Qinbo Li, Chih-Yao Ma, Johannes Kopf, Ming-Hsuan Yang, Hung-Yu Tseng |
Neural Radiance Field (NeRF) is a representation for 3D reconstruction from
multi-view images. Despite some recent work showing preliminary success in
editing a reconstructed NeRF with diffusion prior, they remain struggling to
synthesize reasonable geometry in completely uncovered regions. One major
reason is the high diversity of synthetic contents from the diffusion model,
which hinders the radiance field from converging to a crisp and deterministic
geometry. Moreover, applying latent diffusion models on real data often yields
a textural shift incoherent to the image condition due to auto-encoding errors.
These two problems are further reinforced with the use of pixel-distance
losses. To address these issues, we propose tempering the diffusion model's
stochasticity with per-scene customization and mitigating the textural shift
with masked adversarial training. During the analyses, we also found the
commonly used pixel and perceptual losses are harmful in the NeRF inpainting
task. Through rigorous experiments, our framework yields state-of-the-art NeRF
inpainting results on various real-world scenes. Project page:
https://hubert0527.github.io/MALD-NeRF |
This paper introduces MALD-NeRF, a novel framework for NeRF inpainting that utilizes latent diffusion models and masked adversarial training to generate high-quality novel views with realistic inpainted regions. |
Existing NeRF inpainting methods struggle to synthesize reasonable geometry and textures in completely uncovered regions due to the high diversity of synthetic content from diffusion models and textural inconsistencies with real data. |
The proposed method employs a per-scene customized latent diffusion model for 2D image inpainting and a masked adversarial training scheme during NeRF optimization to address 3D and textural inconsistencies. It also utilizes iterative dataset updates and partial DDIM for improved convergence. |
MALD-NeRF achieves state-of-the-art NeRF inpainting performance on SPIn-NeRF and LLFF datasets, outperforming existing methods in both qualitative and quantitative evaluations.
The per-scene customization effectively guides the latent diffusion model to generate consistent and in-context contents across different viewpoints.
The masked adversarial training scheme proves crucial in enhancing the visual quality and reducing texture discrepancies between the reconstructed and inpainted regions. |
The performance of MALD-NeRF can be unstable due to the inherent stochasticity of adversarial training.
The method may not generalize well to low-shot NeRF reconstructions or scenarios with large inpainting masks.
Future work could explore strategies to further reduce the blurriness of the inpainted textures compared to real-world textures. |
nerf, neural radiance fields, 3d inpainting, latent diffusion model, adversarial training |
2404.09990
Report |
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing |
Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, Cihang Xie |
This study introduces HQ-Edit, a high-quality instruction-based image editing
dataset with around 200,000 edits. Unlike prior approaches relying on attribute
guidance or human feedback on building datasets, we devise a scalable data
collection pipeline leveraging advanced foundation models, namely GPT-4V and
DALL-E 3. To ensure its high quality, diverse examples are first collected
online, expanded, and then used to create high-quality diptychs featuring input
and output images with detailed text prompts, followed by precise alignment
ensured through post-processing. In addition, we propose two evaluation
metrics, Alignment and Coherence, to quantitatively assess the quality of image
edit pairs using GPT-4V. HQ-Edits high-resolution images, rich in detail and
accompanied by comprehensive editing prompts, substantially enhance the
capabilities of existing image editing models. For example, an HQ-Edit
finetuned InstructPix2Pix can attain state-of-the-art image editing
performance, even surpassing those models fine-tuned with human-annotated data.
The project page is https://thefllood.github.io/HQEdit_web. |
Introduces HQ-Edit, a high-quality dataset for instruction-based image editing, featuring ~200,000 edits with high-resolution images and detailed prompts, generated using GPT-4V and DALL-E 3. |
Existing datasets for instruction-based image editing lack high-quality, high-resolution images paired with detailed editing instructions, hindering the training of robust editing models. |
A three-stage pipeline: 1) **Expansion:** Seed image/instruction triplets are expanded using GPT-4. 2) **Generation:** Expanded triplets are refined by GPT-4 into prompts for DALL-E 3 to generate image diptychs. 3) **Post-processing:** Diptychs are split, aligned, and instructions are refined with GPT-4V. |
HQ-Edit significantly outperforms existing datasets in Alignment and Coherence, two proposed metrics for evaluating image-edit instruction alignment.
Fine-tuning InstructPix2Pix on HQ-Edit surpasses models trained on human-annotated datasets, demonstrating its high quality.
The proposed Alignment metric shows stronger correlation with human preference than CLIP Directional Similarity. |
Reliance on DALL-E 3 API limits prompt control and potential diversity.
Future work could explore user-interactive editing and generating edits across multiple images. |
image editing, generative models, dataset, gpt-4v, dall-e 3 |
2404.09977
Report |
MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models |
Nithin Gopalakrishnan Nair, Jeya Maria Jose Valanarasu, Vishal M Patel |
Large diffusion-based Text-to-Image (T2I) models have shown impressive
generative powers for text-to-image generation as well as spatially conditioned
image generation. For most applications, we can train the model end-toend with
paired data to obtain photorealistic generation quality. However, to add an
additional task, one often needs to retrain the model from scratch using paired
data across all modalities to retain good generation performance. In this
paper, we tackle this issue and propose a novel strategy to scale a generative
model across new tasks with minimal compute. During our experiments, we
discovered that the variance maps of intermediate feature maps of diffusion
models capture the intensity of conditioning. Utilizing this prior information,
we propose MaxFusion, an efficient strategy to scale up text-to-image
generation models to accommodate new modality conditions. Specifically, we
combine aligned features of multiple models, hence bringing a compositional
effect. Our fusion strategy can be integrated into off-the-shelf models to
enhance their generative prowess. |
Proposes MaxFusion, a training-free method to scale text-to-image diffusion models for multi-modal generation by leveraging variance maps of intermediate features to fuse information from single-task models. |
Addresses the limitations of retraining diffusion models for multi-modal generation, which is data-intensive and prone to catastrophic forgetting. |
Analyzes variance maps of diffusion models to estimate conditioning intensity and uses this information to fuse features from different models based on correlation and variance. |
Enables zero-shot multi-modal generation by combining information from models trained on separate tasks.
Outperforms single-modal and multi-modal baselines (SPADE, PITI, T2I-Adapter, ControlNet) in qualitative and quantitative evaluations.
Demonstrates scalability beyond two modalities and effectiveness with both spatial and style conditioning. |
Inherits limitations of Stable Diffusion (e.g., generating hands and faces) and may exhibit discrepancies with semantic maps.
A trade-off between conditioning strength and sampling fidelity arises as the number of modalities increases. |
multimodal generation, diffusion models, text-to-image synthesis, zero-shot learning, feature fusion |
2404.09976
Report |
Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers |
Nithin Gopalakrishnan Nair, Jeya Maria Jose Valanarasu, Vishal M. Patel |
Recently, diffusion transformers have gained wide attention with its
excellent performance in text-to-image and text-to-vidoe models, emphasizing
the need for transformers as backbone for diffusion models. Transformer-based
models have shown better generalization capability compared to CNN-based models
for general vision tasks. However, much less has been explored in the existing
literature regarding the capabilities of transformer-based diffusion backbones
and expanding their generative prowess to other datasets. This paper focuses on
enabling a single pre-trained diffusion transformer model to scale across
multiple datasets swiftly, allowing for the completion of diverse generative
tasks using just one model. To this end, we propose DiffScaler, an efficient
scaling strategy for diffusion models where we train a minimal amount of
parameters to adapt to different tasks. In particular, we learn task-specific
transformations at each layer by incorporating the ability to utilize the
learned subspaces of the pre-trained model, as well as the ability to learn
additional task-specific subspaces, which may be absent in the pre-training
dataset. As these parameters are independent, a single diffusion model with
these task-specific parameters can be used to perform multiple tasks
simultaneously. Moreover, we find that transformer-based diffusion models
significantly outperform CNN-based diffusion models methods while performing
fine-tuning over smaller datasets. We perform experiments on four unconditional
image generation datasets. We show that using our proposed method, a single
pre-trained model can scale up to perform these conditional and unconditional
tasks, respectively, with minimal parameter tuning while performing as close as
fine-tuning an entire diffusion model for that particular task. |
This paper proposes DiffScaler, a novel scaling strategy for diffusion models that enables a single pre-trained model to be adapted to various image generation tasks and datasets with minimal parameter tuning. |
Current diffusion models are typically trained separately for each dataset or task, demanding significant computational resources and potentially leading to catastrophic forgetting when fine-tuned. DiffScaler addresses this by allowing a single model to handle diverse tasks effectively. |
DiffScaler introduces a lightweight module called 'Affiner' to each trainable layer of the diffusion model. Affiner learns task-specific transformations by scaling and shifting weights and biases of existing subspaces while also having the capability to learn additional task-specific subspaces. By training only these Affiner parameters, DiffScaler enables efficient adaptation to new tasks and datasets. |
DiffScaler achieves high-quality image generation across diverse datasets (FFHQ, Oxford Flowers, CUB-200, Caltech-101) and for various conditional generation tasks (using Canny edges, HED, depth, and segmentation maps).
The method demonstrates superior performance compared to existing efficient fine-tuning techniques like DiffFit and LORA, particularly in high-resolution image generation.
Experiments show that transformer-based diffusion backbones adapt better than CNN-based models to smaller datasets during parameter-efficient fine-tuning. |
The paper primarily focuses on image generation tasks, leaving exploration for other applications as future work.
Potential misuse of the technology for generating harmful content needs careful consideration. |
diffusion models, transformers, parameter efficient finetuning, conditional image generation, unconditional image generation |
2404.09884
Report |
Map-Relative Pose Regression for Visual Re-Localization |
Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann |
Pose regression networks predict the camera pose of a query image relative to
a known environment. Within this family of methods, absolute pose regression
(APR) has recently shown promising accuracy in the range of a few centimeters
in position error. APR networks encode the scene geometry implicitly in their
weights. To achieve high accuracy, they require vast amounts of training data
that, realistically, can only be created using novel view synthesis in a
days-long process. This process has to be repeated for each new scene again and
again. We present a new approach to pose regression, map-relative pose
regression (marepo), that satisfies the data hunger of the pose regression
network in a scene-agnostic fashion. We condition the pose regressor on a
scene-specific map representation such that its pose predictions are relative
to the scene map. This allows us to train the pose regressor across hundreds of
scenes to learn the generic relation between a scene-specific map
representation and the camera pose. Our map-relative pose regressor can be
applied to new map representations immediately or after mere minutes of
fine-tuning for the highest accuracy. Our approach outperforms previous pose
regression methods by far on two public datasets, indoor and outdoor. Code is
available: https://nianticlabs.github.io/marepo |
This paper introduces \marepo, a novel absolute pose regression (APR) approach for visual relocalization that achieves state-of-the-art accuracy by leveraging a scene-agnostic map-relative pose regressor conditioned on a scene-specific metric map representation. |
Existing APR methods often suffer from low accuracy due to limited training data and struggle to generalize to unseen scenes. \marepo addresses these limitations by training a generic pose regressor on a large dataset of scene coordinates, allowing for fast adaptation to new scenes with high accuracy. |
\marepo consists of two main components: (1) a scene-specific geometry prediction network that predicts 3D scene coordinates for an input image and (2) a scene-agnostic map-relative pose regressor that takes the predicted coordinates and estimates the camera pose. The pose regressor is trained on a large dataset of scene coordinates and can generalize to new scenes after a short fine-tuning step. |
\marepo outperforms previous APR methods on the indoor 7-Scenes dataset and the outdoor Wayspots dataset, achieving accuracy comparable to structure-based methods.
The method exhibits fast mapping times (minutes) compared to traditional APR approaches (hours or days).
The proposed architecture, featuring a transformer-based regressor with dynamic positional encoding, effectively leverages 3D geometric information for accurate and robust pose estimation. |
The reliance on a separate scene-specific coordinate regression network introduces an additional training step, albeit a quick one.
While the scene-agnostic nature of the pose regressor shows strong generalization, its performance may vary depending on the quality of the input scene coordinates. |
visual relocalization, pose regression, scene coordinate regression, transformers, deep learning |
2404.09833
Report |
Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video |
Hongchi Xia, Zhi-Hao Lin, Wei-Chiu Ma, Shenlong Wang |
Creating high-quality and interactive virtual environments, such as games and
simulators, often involves complex and costly manual modeling processes. In
this paper, we present Video2Game, a novel approach that automatically converts
videos of real-world scenes into realistic and interactive game environments.
At the heart of our system are three core components:(i) a neural radiance
fields (NeRF) module that effectively captures the geometry and visual
appearance of the scene; (ii) a mesh module that distills the knowledge from
NeRF for faster rendering; and (iii) a physics module that models the
interactions and physical dynamics among the objects. By following the
carefully designed pipeline, one can construct an interactable and actionable
digital replica of the real world. We benchmark our system on both indoor and
large-scale outdoor scenes. We show that we can not only produce
highly-realistic renderings in real-time, but also build interactive games on
top. |
Video2Game: a novel approach that automatically transforms real-world videos into interactive and realistic game environments. |
Creating realistic and interactive virtual environments is crucial for immersive experiences but traditionally involves complex and costly manual modeling. |
The system uses three core components: 1) a neural radiance fields (NeRF) module to capture scene geometry and appearance; 2) a mesh module for efficient rendering; 3) a physics module to model object interactions. |
The approach produces high-fidelity renderings in real-time, enabling interactive experiences.
The system supports object-level interaction through scene decomposition and rigid-body physics.
The generated environments are compatible with game engines and run smoothly in web browsers. |
The system currently doesn't model physics-informed relighting, such as simulating object's metallic properties.
Creating unbounded, relightable scenes from single videos remains an open challenge for future work. |
neural rendering, nerf, video game development, physics simulation, interactive environments |
2404.09732
Report |
Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models |
Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön |
Though diffusion models have been successfully applied to various image
restoration (IR) tasks, their performance is sensitive to the choice of
training datasets. Typically, diffusion models trained in specific datasets
fail to recover images that have out-of-distribution degradations. To address
this problem, this work leverages a capable vision-language model and a
synthetic degradation pipeline to learn image restoration in the wild (wild
IR). More specifically, all low-quality images are simulated with a synthetic
degradation pipeline that contains multiple common degradations such as blur,
resize, noise, and JPEG compression. Then we introduce robust training for a
degradation-aware CLIP model to extract enriched image content features to
assist high-quality image restoration. Our base diffusion model is the image
restoration SDE (IR-SDE). Built upon it, we further present a posterior
sampling strategy for fast noise-free image generation. We evaluate our model
on both synthetic and real-world degradation datasets. Moreover, experiments on
the unified image restoration task illustrate that the proposed posterior
sampling improves image generation quality for various degradations. |
This paper introduces a new method for photo-realistic image restoration in the wild using a degradation-aware CLIP model and a synthetic degradation pipeline. |
Existing diffusion models for image restoration are often trained on specific datasets and struggle to generalize to real-world images with complex and unknown degradations. This work aims to improve the robustness and generalization ability of these models. |
The authors propose a new synthetic image degradation pipeline with diverse degradations and a random shuffle strategy. They also introduce a robust degradation-aware CLIP (DACLIP) model that minimizes the embedding distance between low-quality and high-quality image pairs. Additionally, they present an optimal posterior sampling approach for the IR-SDE model to enhance image generation. |
The proposed method achieves state-of-the-art performance on both synthetic and real-world image restoration benchmarks.
The introduced degradation pipeline effectively simulates complex real-world degradations, improving model generalization.
The optimal posterior sampling strategy significantly enhances the performance of unified image restoration by improving the efficiency of the reverse diffusion process. |
The model's performance heavily relies on the quality of the synthetic degradation pipeline and its ability to represent real-world degradations.
Further research is needed to explore the use of larger and more powerful vision-language models for improved guidance in image restoration. |
image restoration, diffusion models, vision-language models, clip, synthetic data |
2404.09632
Report |
Bridging Vision and Language Spaces with Assignment Prediction |
Jungin Park, Jiyoung Lee, Kwanghoon Sohn |
This paper introduces VLAP, a novel approach that bridges pretrained vision
models and large language models (LLMs) to make frozen LLMs understand the
visual world. VLAP transforms the embedding space of pretrained vision models
into the LLMs' word embedding space using a single linear layer for efficient
and general-purpose visual and language understanding. Specifically, we harness
well-established word embeddings to bridge two modality embedding spaces. The
visual and text representations are simultaneously assigned to a set of word
embeddings within pretrained LLMs by formulating the assigning procedure as an
optimal transport problem. We predict the assignment of one modality from the
representation of another modality data, enforcing consistent assignments for
paired multimodal data. This allows vision and language representations to
contain the same information, grounding the frozen LLMs' word embedding space
in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved
with visual data since the LLMs interpret and reason linguistic information
from correlations between word embeddings. Experimental results show that VLAP
achieves substantial improvements over the previous linear transformation-based
approaches across a range of vision-language tasks, including image captioning,
visual question answering, and cross-modal retrieval. We also demonstrate the
learned visual representations hold a semantic taxonomy of LLMs, making visual
semantic arithmetic possible. |
This paper introduces VLAP, a novel approach that bridges pretrained vision models and frozen large language models (LLMs) for visual understanding, using a single linear layer to map visual embeddings into the LLMs' word embedding space. |
Bridging the gap between independently pretrained vision and language models is crucial for efficient and general-purpose visual and language understanding without the high cost of training large multimodal models from scratch. |
VLAP utilizes an optimal transport-based assignment prediction objective. It assigns both visual and text representations to a set of word embeddings within pretrained LLMs, enforcing consistent assignments for paired multimodal data. This grounds the LLMs' word embedding space in visual data, allowing the LLMs to interpret visual inputs. |
VLAP achieves substantial improvements over previous linear transformation-based approaches on image captioning, significantly outperforming methods like LiMBeR.
VLAP demonstrates strong performance on visual question answering, surpassing previous methods in both zero-shot and few-shot settings.
VLAP excels in cross-modal retrieval tasks, achieving competitive results on image-and-text-to-text retrieval and outperforming prior works on text-to-image retrieval. |
While computationally efficient, VLAP still lags behind modular-based methods (e.g., Flamingo, BLIP-2) in terms of performance, potentially due to the limited capacity of a single linear layer and smaller training datasets.
Future work could explore scaling VLAP with modular-based models and larger multimodal datasets to further enhance performance. |
vision-language models, large language models, optimal transport, zero-shot learning, cross-modal retrieval |
2404.09619
Report |
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark |
Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang |
As an alternative to expensive expert evaluation, Image Aesthetic Assessment
(IAA) stands out as a crucial task in computer vision. However, traditional IAA
methods are typically constrained to a single data source or task, restricting
the universality and broader application. In this work, to better align with
human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment
(UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named
UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs
with both visual perception and language ability for IAA and establish a
low-cost paradigm for transforming the existing datasets into unified and
high-quality visual instruction tuning data, from which the UNIAA-LLaVA is
trained. To further evaluate the IAA capability of MLLMs, we construct the
UNIAA-Bench, which consists of three aesthetic levels: Perception, Description,
and Assessment. Extensive experiments validate the effectiveness and
rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all
levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model
performs better than GPT-4V in aesthetic perception and even approaches the
junior-level human. We find MLLMs have great potential in IAA, yet there
remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench
will be released. |
The paper introduces Unified Multi-modal Image Aesthetic Assessment (UNIIAA), a framework designed to enhance and evaluate the visual aesthetic capabilities of Multi-modal Large Language Models (MLLMs). |
Existing Image Aesthetic Assessment (IAA) methods are limited to single datasets or tasks, hindering their universality. UNIIAA aims to align with human aesthetic processes and integrate diverse aesthetic data for holistic image evaluation. |
The authors propose a novel IAA Datasets Conversion Paradigm (IDCP) to transform existing datasets into MLLM-compatible formats. They introduce UNIIAA-Model, an MLLM fine-tuned on converted aesthetic data, and UNIIAA-Bench, a benchmark to evaluate aesthetic perception, description, and assessment abilities of MLLMs. |
UNIIAA-Model achieves superior performance compared to other MLLMs on UNIIAA-Bench across aesthetic perception, description, and assessment tasks.
IDCP effectively converts existing aesthetic datasets, leading to significant improvement in MLLMs' aesthetic capabilities.
Despite progress, MLLMs still lag behind human experts in visual aesthetics, highlighting the need for further research. |
The converted IDCP dataset primarily comprises natural images, limiting the model's generalization to other image types like artistic works or AI-generated content.
Evaluating aesthetic description remains subjective, and while a 5-round GPT-assisted protocol is used, potential hallucinations from GPT might affect evaluation accuracy. |
image aesthetics assessment, multi-modal large language model, instruct tuning, benchmarking, visual aesthetics |
2404.09591
Report |
3D Gaussian Splatting as Markov Chain Monte Carlo |
Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Weiwei Sun, Jeff Tseng, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi |
While 3D Gaussian Splatting has recently become popular for neural rendering,
current methods rely on carefully engineered cloning and splitting strategies
for placing Gaussians, which does not always generalize and may lead to
poor-quality renderings. In addition, for real-world scenes, they rely on a
good initial point cloud to perform well. In this work, we rethink 3D Gaussians
as random samples drawn from an underlying probability distribution describing
the physical representation of the scene -- in other words, Markov Chain Monte
Carlo (MCMC) samples. Under this view, we show that the 3D Gaussian updates are
strikingly similar to a Stochastic Langevin Gradient Descent (SGLD) update. As
with MCMC, samples are nothing but past visit locations, adding new Gaussians
under our framework can simply be realized without heuristics as placing
Gaussians at existing Gaussian locations. To encourage using fewer Gaussians
for efficiency, we introduce an L1-regularizer on the Gaussians. On various
standard evaluation scenes, we show that our method provides improved rendering
quality, easy control over the number of Gaussians, and robustness to
initialization. |
Reformulates 3D Gaussian Splatting (3DGS) as a Markov Chain Monte Carlo (MCMC) sampling process using Stochastic Gradient Langevin Dynamics (SGLD), removing the reliance on heuristics for Gaussian placement. |
Current 3DGS methods rely on engineered heuristics for Gaussian placement, leading to suboptimal results and requiring careful tuning. This work aims to develop a more principled and robust approach. |
The authors reinterpret Gaussians as samples from a distribution representing the 3D scene. They then reformulate the 3DGS update rule as an SGLD update, enabling a more natural and theoretically grounded exploration of the scene. |
Achieves improved rendering quality compared to conventional 3DGS, especially with random Gaussian initialization.
Demonstrates robustness to initialization, eliminating the need for a good initial point cloud.
Provides easy control over the number of Gaussians used through L1 regularization on opacity and scale. |
The method's performance with a very limited number of Gaussians is not explored.
Future work could investigate extensions to dynamic scenes. |
3d gaussian splatting, neural rendering, markov chain monte carlo, stochastic gradient langevin dynamics, novel view synthesis |
2404.09570
Report |
The revenge of BiSeNet: Efficient Multi-Task Image Segmentation |
Gabriele Rosi, Claudia Cuttano, Niccolò Cavagnero, Giuseppe Averta, Fabio Cermelli |
Recent advancements in image segmentation have focused on enhancing the
efficiency of the models to meet the demands of real-time applications,
especially on edge devices. However, existing research has primarily
concentrated on single-task settings, especially on semantic segmentation,
leading to redundant efforts and specialized architectures for different tasks.
To address this limitation, we propose a novel architecture for efficient
multi-task image segmentation, capable of handling various segmentation tasks
without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that
leverages the efficiency of two-stream semantic segmentation architectures and
it extends them into a mask classification framework. Our approach maintains
the efficient spatial and context paths to capture detailed and semantic
information, respectively, while leveraging an efficient transformed-based
segmentation head that computes the binary masks and class probabilities. By
seamlessly supporting multiple tasks, namely semantic and panoptic
segmentation, BiSeNetFormer offers a versatile solution for multi-task
segmentation. We evaluate our approach on popular datasets, Cityscapes and
ADE20K, demonstrating impressive inference speeds while maintaining competitive
accuracy compared to state-of-the-art architectures. Our results indicate that
BiSeNetFormer represents a significant advancement towards fast, efficient, and
multi-task segmentation networks, bridging the gap between model efficiency and
task adaptability. |
The paper proposes \architecture, a novel, efficient architecture for multi-task image segmentation that leverages two-stream semantic segmentation and mask classification. |
Existing image segmentation models are either task-specific (limiting their application) or computationally intensive (hindering real-time performance). This work bridges this gap. |
\architecture uses a spatial path for detailed information and a context path for semantic information. It then employs a transformer decoder and segmentation head to generate binary masks and class probabilities for each segment. |
\architecture achieves impressive inference speeds (up to 100 FPS) while maintaining competitive accuracy compared to state-of-the-art models.
It demonstrates strong performance on both semantic and panoptic segmentation tasks on Cityscapes and ADE20K datasets.
The architecture exhibits remarkable adaptability across various hardware, including edge devices like Jetson ORIN. |
While \architecture excels in speed, it shows a slight performance drop in panoptic segmentation on ADE20K, warranting further investigation and optimization.
Future work will focus on refining \architecture and exploring its application on additional tasks and datasets. |
image segmentation, multi-task learning, efficient architectures, real-time segmentation, mask classification |
2404.09512
Report |
Magic Clothing: Controllable Garment-Driven Image Synthesis |
Weifeng Chen, Tao Gu, Yuhao Xu, Chengcai Chen |
We propose Magic Clothing, a latent diffusion model (LDM)-based network
architecture for an unexplored garment-driven image synthesis task. Aiming at
generating customized characters wearing the target garments with diverse text
prompts, the image controllability is the most critical issue, i.e., to
preserve the garment details and maintain faithfulness to the text prompts. To
this end, we introduce a garment extractor to capture the detailed garment
features, and employ self-attention fusion to incorporate them into the
pretrained LDMs, ensuring that the garment details remain unchanged on the
target character. Then, we leverage the joint classifier-free guidance to
balance the control of garment features and text prompts over the generated
results. Meanwhile, the proposed garment extractor is a plug-in module
applicable to various finetuned LDMs, and it can be combined with other
extensions like ControlNet and IP-Adapter to enhance the diversity and
controllability of the generated characters. Furthermore, we design
Matched-Points-LPIPS (MP-LPIPS), a robust metric for evaluating the consistency
of the target image to the source garment. Extensive experiments demonstrate
that our Magic Clothing achieves state-of-the-art results under various
conditional controls for garment-driven image synthesis. Our source code is
available at https://github.com/ShineChen1024/MagicClothing. |
Presents Magic Clothing, an LDM-based architecture for garment-driven image synthesis, enabling character generation with specific garments and text prompts. |
Addresses the unexplored task of garment-driven image synthesis, crucial for e-commerce and metaverse, with a focus on preserving garment details and text prompt fidelity. |
Introduces a garment extractor to capture detailed features, fused into pretrained LDMs via self-attention. Employs joint classifier-free guidance to balance garment and text control. Proposes MP-LPIPS metric for robust evaluation. |
Outperforms state-of-the-art subject-driven methods in preserving garment details and text prompt adherence.
Demonstrates high controllability by seamlessly integrating with various finetuned LDMs and extensions like ControlNet and IP-Adapter.
Proposes a robust metric MP-LPIPS for evaluating garment consistency while mitigating the influence of pose and background. |
Image quality relies on the base diffusion model, improvement possible with stronger pretrained models.
Limited training data may hinder performance on complex garments, necessitating more comprehensive datasets. |
image synthesis, latent diffusion models, garment-driven, controllable generation, virtual try-on |
2404.09504
Report |
Learning Tracking Representations from Single Point Annotations |
Qiangqiang Wu, Antoni B. Chan |
Existing deep trackers are typically trained with largescale video frames
with annotated bounding boxes. However, these bounding boxes are expensive and
time-consuming to annotate, in particular for large scale datasets. In this
paper, we propose to learn tracking representations from single point
annotations (i.e., 4.5x faster to annotate than the traditional bounding box)
in a weakly supervised manner. Specifically, we propose a soft contrastive
learning (SoCL) framework that incorporates target objectness prior into
end-to-end contrastive learning. Our SoCL consists of adaptive positive and
negative sample generation, which is memory-efficient and effective for
learning tracking representations. We apply the learned representation of SoCL
to visual tracking and show that our method can 1) achieve better performance
than the fully supervised baseline trained with box annotations under the same
annotation time cost; 2) achieve comparable performance of the fully supervised
baseline by using the same number of training frames and meanwhile reducing
annotation time cost by 78% and total fees by 85%; 3) be robust to annotation
noise. |
This paper proposes SoCL, a soft contrastive learning framework, to learn tracking representations from single point annotations instead of expensive bounding boxes. |
Bounding box annotations are expensive and time-consuming. Point annotations are significantly faster and cheaper to obtain, enabling efficient training of deep trackers. |
SoCL leverages a target objectness prior (TOP) map to generate soft templates and negative samples for contrastive learning. It uses global and local soft templates to represent the target, and generates hard negative samples from the background for discrimination. |
SoCL-Siam, using SoCL representations, achieves comparable performance to fully supervised baseline (Box-Siam) trained with bounding boxes on GOT-10k, while reducing annotation time by 78%.
Under the same annotation time budget, SoCL-Siam consistently outperforms Box-Siam on various benchmarks.
SoCL-TransT, trained with a hybrid annotation scheme using SoCL, achieves state-of-the-art performance with significantly lower annotation cost compared to other trackers. |
The impact of using a projection head in SoCL for different tracker architectures (Siamese vs. CF) is not fully understood.
Future work could explore more sophisticated methods to generate pseudo bounding boxes from point annotations for improved scale estimation. |
visual tracking, weakly supervised learning, contrastive learning, point annotation, siamese tracker |
2404.09502
Report |
SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction |
Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma |
Vision-based perception for autonomous driving requires an explicit modeling
of a 3D space, where 2D latent representations are mapped and subsequent 3D
operators are applied. However, operating on dense latent spaces introduces a
cubic time and space complexity, which limits scalability in terms of
perception range or spatial resolution. Existing approaches compress the dense
representation using projections like Bird's Eye View (BEV) or Tri-Perspective
View (TPV). Although efficient, these projections result in information loss,
especially for tasks like semantic occupancy prediction. To address this, we
propose SparseOcc, an efficient occupancy network inspired by sparse point
cloud processing. It utilizes a lossless sparse latent representation with
three key innovations. Firstly, a 3D sparse diffuser performs latent completion
using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature
pyramid and sparse interpolation enhance scales with information from others.
Finally, the transformer head is redesigned as a sparse variant. SparseOcc
achieves a remarkable 74.9% reduction on FLOPs over the dense baseline.
Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in
part can be attributed to the sparse representation's ability to avoid
hallucinations on empty voxels. |
This paper proposes SparseOcc, an efficient occupancy network for autonomous driving that leverages a lossless sparse latent representation, inspired by sparse point cloud processing, to reduce computational cost without sacrificing accuracy. |
Operating on dense 3D latent spaces for vision-based perception in autonomous driving is computationally expensive, and existing compression methods like BEV and TPV result in information loss. SparseOcc addresses this by using a sparse representation, enabling efficiency and potentially higher accuracy. |
SparseOcc utilizes a 3D sparse diffuser with spatially decomposed convolutional kernels for efficient latent completion. It incorporates a sparse feature pyramid with interpolation for multi-scale feature enhancement and employs a sparse transformer head for occupancy prediction, focusing on occupied voxels. |
SparseOcc achieves a 74.9% reduction in FLOPs compared to dense baselines on nuScenes-Occupancy.
It outperforms state-of-the-art methods on nuScenes-Occupancy, achieving 21.8% IoU and 14.1% mIoU.
The sparse representation naturally avoids hallucinations on empty voxels, potentially contributing to improved accuracy. |
The improvement in IoU with higher image resolution is limited due to potential hallucinations on empty voxels caused by dense features.
Further investigation is needed to address the hallucination issue and explore the application of SparseOcc in dynamic scenarios. |
autonomous driving, occupancy prediction, sparse representation, 3d vision, deep learning |
2404.09476
Report |
FreqMamba: Viewing Mamba from a Frequency Perspective for Image Deraining |
Zou Zhen, Yu Hu, Zhao Feng |
Images corrupted by rain streaks often lose vital frequency information for
perception, and image deraining aims to solve this issue which relies on global
and local degradation modeling. Recent studies have witnessed the effectiveness
and efficiency of Mamba for perceiving global and local information based on
its exploiting local correlation among patches, however, rarely attempts have
been explored to extend it with frequency analysis for image deraining,
limiting its ability to perceive global degradation that is relevant to
frequency modeling (e.g. Fourier transform). In this paper, we propose
FreqMamba, an effective and efficient paradigm that leverages the complementary
between Mamba and frequency analysis for image deraining. The core of our
method lies in extending Mamba with frequency analysis from two perspectives:
extending it with frequency-band for exploiting frequency correlation, and
connecting it with Fourier transform for global degradation modeling.
Specifically, FreqMamba introduces complementary triple interaction structures
including spatial Mamba, frequency band Mamba, and Fourier global modeling.
Frequency band Mamba decomposes the image into sub-bands of different
frequencies to allow 2D scanning from the frequency dimension. Furthermore,
leveraging Mamba's unique data-dependent properties, we use rainy images at
different scales to provide degradation priors to the network, thereby
facilitating efficient training. Extensive experiments show that our method
outperforms state-of-the-art methods both visually and quantitatively. |
Presents FreqMamba, a novel image deraining network that integrates spatial domain sequence modeling with frequency domain global modeling through a unique Frequency-SSM block and a multi-scale degradation prior attention mechanism. |
Image deraining is crucial for improving visual quality and the performance of computer vision tasks, and existing methods often struggle to effectively handle both global and local degradation caused by rain. |
FreqMamba employs a three-branch architecture: spatial Mamba for local detail extraction, frequency band Mamba for bridging spatial and frequency domains, and Fourier modeling for global degradation correction. It also uses Mamba's data-dependent property to generate attention maps from multi-scale input for guiding degradation-aware training. |
FreqMamba achieves state-of-the-art performance on benchmark datasets, outperforming existing methods in terms of both PSNR and SSIM.
The method effectively removes rain streaks while preserving scene details and fidelity, as demonstrated by visual comparisons.
FreqMamba demonstrates versatility and strong generalization ability by extending to other image restoration tasks like low-light image enhancement and real-world image dehazing. |
The current model primarily focuses on single-image deraining and could be extended to video deraining in future work.
Exploring alternative frequency analysis techniques beyond Fourier and wavelet transforms might further enhance performance. |
image deraining, frequency analysis, state space model, deep learning, computer vision |
2404.09469
Report |
Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation? |
Dmitry Ignatov, Andrey Ignatov, Radu Timofte |
We present ANYU, a new virtually augmented version of the NYU depth v2
dataset, designed for monocular depth estimation. In contrast to the well-known
approach where full 3D scenes of a virtual world are utilized to generate
artificial datasets, ANYU was created by incorporating RGB-D representations of
virtual reality objects into the original NYU depth v2 images. We specifically
did not match each generated virtual object with an appropriate texture and a
suitable location within the real-world image. Instead, an assignment of
texture, location, lighting, and other rendering parameters was randomized to
maximize a diversity of the training data, and to show that it is randomness
that can improve the generalizing ability of a dataset. By conducting extensive
experiments with our virtually modified dataset and validating on the original
NYU depth v2 and iBims-1 benchmarks, we show that ANYU improves the monocular
depth estimation performance and generalization of deep neural networks with
considerably different architectures, especially for the current
state-of-the-art VPD model. To the best of our knowledge, this is the first
work that augments a real-world dataset with randomly generated virtual 3D
objects for monocular depth estimation. We make our ANYU dataset publicly
available in two training configurations with 10% and 100% additional
synthetically enriched RGB-D pairs of training images, respectively, for
efficient training and empirical exploration of virtual augmentation at
https://github.com/ABrain-One/ANYU |
This paper introduces ANYU, a virtually augmented version of the NYU Depth V2 dataset designed for monocular depth estimation. ANYU is created by incorporating randomly generated virtual 3D objects into real-world images, enhancing training data diversity without relying solely on full virtual scenes. |
The standard NYU Depth V2 dataset, despite its popularity, suffers from depth map inaccuracies and limited training data diversity. ANYU aims to address these limitations by introducing virtual objects, leading to more accurate depth values and improved model generalization. |
ANYU leverages a game engine to generate virtual 3D objects. These objects are randomly assigned textures, placed randomly within real NYU Depth V2 images, and rendered with varying lighting and shadow parameters, maximizing data diversity. |
Augmenting the NYU Depth V2 dataset with ANYU consistently improves depth prediction metrics for different model architectures, including the state-of-the-art VPD model.
The benefits of ANYU are particularly pronounced when training data is limited, highlighting the importance of diversity in smaller datasets.
Models trained on ANYU exhibit improved generalization, as demonstrated by cross-dataset validation on the iBims-1 benchmark. |
The rendering quality of virtual objects might not fully match real-world objects, potentially limiting performance gains at very high augmentation levels.
Future work could explore more sophisticated methods for integrating virtual objects, such as aligning them with the scene context or using more realistic rendering techniques. |
monocular depth estimation, data augmentation, virtual reality, dataset, nyu depth v2 |
2404.09465
Report |
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI |
Yandan Yang, Baoxiong Jia, Peiyuan Zhi, Siyuan Huang |
With recent developments in Embodied Artificial Intelligence (EAI) research,
there has been a growing demand for high-quality, large-scale interactive scene
generation. While prior methods in scene synthesis have prioritized the
naturalness and realism of the generated scenes, the physical plausibility and
interactivity of scenes have been largely left unexplored. To address this
disparity, we introduce PhyScene, a novel method dedicated to generating
interactive 3D scenes characterized by realistic layouts, articulated objects,
and rich physical interactivity tailored for embodied agents. Based on a
conditional diffusion model for capturing scene layouts, we devise novel
physics- and interactivity-based guidance mechanisms that integrate constraints
from object collision, room layout, and object reachability. Through extensive
experiments, we demonstrate that PhyScene effectively leverages these guidance
functions for physically interactable scene synthesis, outperforming existing
state-of-the-art scene synthesis methods by a large margin. Our findings
suggest that the scenes generated by PhyScene hold considerable potential for
facilitating diverse skill acquisition among agents within interactive
environments, thereby catalyzing further advancements in embodied AI research.
Project website: http://physcene.github.io. |
Introduces PHYSCENE, a guided diffusion model for creating interactive 3D scenes with realistic layouts, articulated objects, and strong adherence to physical constraints. |
Addresses the growing need in Embodied AI for large-scale, physically plausible, and interactive 3D scenes that go beyond visual realism, enabling agents to learn diverse skills within simulated environments. |
Leverages a conditional diffusion model guided by three novel functions ensuring: 1) collision avoidance between objects, 2) adherence to room layouts, and 3) object reachability for embodied agents. |
Achieves state-of-the-art results on traditional scene synthesis metrics (FID, KID, etc.) while significantly improving physical plausibility.
Significantly outperforms existing methods in generating interactive scenes with reduced object collisions and improved object reachability.
Demonstrates the ability to effectively integrate articulated objects into scenes, further enhancing interactivity. |
Currently limited to a limited number of room types due to data constraints.
Lacks the inclusion of small objects, posing challenges for tasks involving fine-grained manipulation. |
scene synthesis, embodied ai, diffusion models, physical plausibility, interactive environments |
2404.09458
Report |
CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting |
Xiangrui Liu, Xinju Wu, Pingping Zhang, Shiqi Wang, Zhu Li, Sam Kwong |
Gaussian splatting, renowned for its exceptional rendering quality and
efficiency, has emerged as a prominent technique in 3D scene representation.
However, the substantial data volume of Gaussian splatting impedes its
practical utility in real-world applications. Herein, we propose an efficient
3D scene representation, named Compressed Gaussian Splatting (CompGS), which
harnesses compact Gaussian primitives for faithful 3D scene modeling with a
remarkably reduced data size. To ensure the compactness of Gaussian primitives,
we devise a hybrid primitive structure that captures predictive relationships
between each other. Then, we exploit a small set of anchor primitives for
prediction, allowing the majority of primitives to be encapsulated into highly
compact residual forms. Moreover, we develop a rate-constrained optimization
scheme to eliminate redundancies within such hybrid primitives, steering our
CompGS towards an optimal trade-off between bitrate consumption and
representation efficacy. Experimental results show that the proposed CompGS
significantly outperforms existing methods, achieving superior compactness in
3D scene representation without compromising model accuracy and rendering
quality. Our code will be released on GitHub for further research. |
This paper introduces CompGS, a novel 3D scene representation method using compressed Gaussian splatting, achieving efficient representation with significantly reduced data size. |
Gaussian splatting suffers from large data volumes, hindering its practicality. Existing compression methods overlook inherent redundancies in Gaussian primitives, leading to suboptimal compression efficiency. |
CompGS leverages a hybrid primitive structure with anchor primitives to predict coupled primitives' attributes, enabling compact residual representations. It also employs rate-constrained optimization, minimizing rendering distortion and bitrate costs for optimal compactness. |
CompGS achieves up to 110x compression ratio on popular datasets without sacrificing rendering quality.
The hybrid primitive structure significantly reduces bitstream size by exploiting inter-primitive redundancies.
Rate-constrained optimization further enhances compactness by learning efficient primitive representations. |
Training time is slightly longer than some existing methods due to joint optimization of primitives and neural networks.
The impact of varying the number of coupled primitives per anchor on different scenes requires further investigation. |
3d scene representation, gaussian splatting, compression, rate-distortion optimization, hybrid primitive structure |
2404.09447
Report |
kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies |
Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr |
Rapid advancements in continual segmentation have yet to bridge the gap of
scaling to large continually expanding vocabularies under compute-constrained
scenarios. We discover that traditional continual training leads to
catastrophic forgetting under compute constraints, unable to outperform
zero-shot segmentation methods. We introduce a novel strategy for semantic and
panoptic segmentation with zero forgetting, capable of adapting to continually
growing vocabularies without the need for retraining or large memory costs. Our
training-free approach, kNN-CLIP, leverages a database of instance embeddings
to enable open-vocabulary segmentation approaches to continually expand their
vocabulary on any given domain with a single-pass through data, while only
storing embeddings minimizing both compute and memory costs. This method
achieves state-of-the-art mIoU performance across large-vocabulary semantic and
panoptic segmentation datasets. We hope kNN-CLIP represents a step forward in
enabling more efficient and adaptable continual segmentation, paving the way
for advances in real-world large-vocabulary continual segmentation methods. |
This paper introduces kNN-CLIP, a training-free method for continual vocabulary expansion in semantic and panoptic segmentation. It utilizes a retrieval database of instance embeddings, enabling the adaptation to new concepts without retraining or large memory costs. |
Existing continual segmentation methods struggle to scale to large vocabularies and often suffer from catastrophic forgetting under compute constraints, limiting their practical applicability. |
kNN-CLIP leverages a database of instance embeddings generated using a pre-trained DINOv2 model. At inference, query mask embeddings are matched against the database, and retrieved information augments the base model's predictions, enhancing performance on novel concepts. |
kNN-CLIP achieves state-of-the-art mIoU performance across large-vocabulary semantic segmentation datasets (A-847, PC-459, A-150).
The method demonstrates significant improvements in panoptic segmentation on ADE20K and COCO Panoptic datasets.
kNN-CLIP effectively addresses catastrophic forgetting, outperforming traditional continual learning approaches while maintaining efficiency. |
The reliance on brute-force kNN search can lead to slower inference times for large databases.
Future work can explore approximate nearest neighbor search methods to balance speed and accuracy. |
continual learning, open-vocabulary segmentation, semantic segmentation, panoptic segmentation, image retrieval |
2404.09426
Report |
ViFu: Multiple 360$^\circ$ Objects Reconstruction with Clean Background via Visible Part Fusion |
Tianhan Xu, Takuya Ikeda, Koichi Nishiwaki |
In this paper, we propose a method to segment and recover a static, clean
background and multiple 360$^\circ$ objects from observations of scenes at
different timestamps. Recent works have used neural radiance fields to model 3D
scenes and improved the quality of novel view synthesis, while few studies have
focused on modeling the invisible or occluded parts of the training images.
These under-reconstruction parts constrain both scene editing and rendering
view selection, thereby limiting their utility for synthetic data generation
for downstream tasks. Our basic idea is that, by observing the same set of
objects in various arrangement, so that parts that are invisible in one scene
may become visible in others. By fusing the visible parts from each scene,
occlusion-free rendering of both background and foreground objects can be
achieved.
We decompose the multi-scene fusion task into two main components: (1)
objects/background segmentation and alignment, where we leverage point
cloud-based methods tailored to our novel problem formulation; (2) radiance
fields fusion, where we introduce visibility field to quantify the visible
information of radiance fields, and propose visibility-aware rendering for the
fusion of series of scenes, ultimately obtaining clean background and
360$^\circ$ object rendering. Comprehensive experiments were conducted on
synthetic and real datasets, and the results demonstrate the effectiveness of
our method. |
This paper presents ViFu, a method to recover clean backgrounds and 360° foreground objects from multi-timestamp scene observations, addressing the issue of unseen part reconstruction in NeRF. |
This is important for generating high-quality synthetic data for robot learning tasks, such as pose estimation and object detection, as it allows for diverse object placement and occlusion-free rendering. |
The method uses point cloud registration for object/background alignment, introduces a novel "visibility field" to quantify visibility in radiance fields, and proposes "visibility-aware rendering" for fusing visible parts from different scenes. |
ViFu can automatically segment backgrounds and recover clean, 360° renderings of multiple objects.
The proposed visibility field effectively quantifies visibility in 3D scenes.
Experiments on synthetic and real datasets demonstrate the effectiveness of the method. |
The method doesn't explicitly consider lighting conditions, which may affect rendering quality under extreme lighting.
It relies on accurate scene segmentation and pose alignment, which may be challenging for closely placed objects or simple shapes. |
neural radiance fields, 3d reconstruction, scene segmentation, visibility field, synthetic data generation |
2404.09412
Report |
DeferredGS: Decoupled and Editable Gaussian Splatting with Deferred Shading |
Tong Wu, Jia-Mu Sun, Yu-Kun Lai, Yuewen Ma, Leif Kobbelt, Lin Gao |
Reconstructing and editing 3D objects and scenes both play crucial roles in
computer graphics and computer vision. Neural radiance fields (NeRFs) can
achieve realistic reconstruction and editing results but suffer from
inefficiency in rendering. Gaussian splatting significantly accelerates
rendering by rasterizing Gaussian ellipsoids. However, Gaussian splatting
utilizes a single Spherical Harmonic (SH) function to model both texture and
lighting, limiting independent editing capabilities of these components.
Recently, attempts have been made to decouple texture and lighting with the
Gaussian splatting representation but may fail to produce plausible geometry
and decomposition results on reflective scenes. Additionally, the forward
shading technique they employ introduces noticeable blending artifacts during
relighting, as the geometry attributes of Gaussians are optimized under the
original illumination and may not be suitable for novel lighting conditions. To
address these issues, we introduce DeferredGS, a method for decoupling and
editing the Gaussian splatting representation using deferred shading. To
achieve successful decoupling, we model the illumination with a learnable
environment map and define additional attributes such as texture parameters and
normal direction on Gaussians, where the normal is distilled from a jointly
trained signed distance function. More importantly, we apply deferred shading,
resulting in more realistic relighting effects compared to previous methods.
Both qualitative and quantitative experiments demonstrate the superior
performance of DeferredGS in novel view synthesis and editing tasks. |
ame is a novel method that introduces a decoupled and editable Gaussian Splatting representation using deferred shading, enabling independent editing of geometry, texture, and lighting. |
Existing Gaussian Splatting methods struggle with independent texture and lighting editing and suffer from blending artifacts during relighting.
ame addresses these limitations, enhancing editing capabilities and relighting quality. |
The method uses a normal distillation module to enhance geometry reconstruction by leveraging an SDF network. It employs deferred shading for realistic relighting effects, rasterizing geometry and texture buffers before shading calculation at the pixel level. |
ame shows superior novel view synthesis quality compared to previous methods, particularly on challenging scenes with reflections.
It enables more faithful decomposition of geometry, texture, and lighting, evident in the high-quality normal maps and diffuse albedo estimations.
ame demonstrates improved relighting quality by mitigating blending artifacts common in previous Gaussian Splatting methods that use forward shading. |
The method exhibits limitations in scenes with strong shadows, where shadows might be baked into the texture.
Texture editing can introduce noise due to the global nature of Gaussian Splatting representation. |
gaussian splatting, inverse rendering, deferred shading, 3d scene reconstruction, scene editing |
2404.09401
Report |
Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models |
Peifei Zhu, Tsubasa Takahashi, Hirokatsu Kataoka |
Diffusion Models (DMs) have shown remarkable capabilities in various
image-generation tasks. However, there are growing concerns that DMs could be
used to imitate unauthorized creations and thus raise copyright issues. To
address this issue, we propose a novel framework that embeds personal
watermarks in the generation of adversarial examples. Such examples can force
DMs to generate images with visible watermarks and prevent DMs from imitating
unauthorized images. We construct a generator based on conditional adversarial
networks and design three losses (adversarial loss, GAN loss, and perturbation
loss) to generate adversarial examples that have subtle perturbation but can
effectively attack DMs to prevent copyright violations. Training a generator
for a personal watermark by our method only requires 5-10 samples within 2-3
minutes, and once the generator is trained, it can generate adversarial
examples with that watermark significantly fast (0.2s per image). We conduct
extensive experiments in various conditional image-generation scenarios.
Compared to existing methods that generate images with chaotic textures, our
method adds visible watermarks on the generated images, which is a more
straightforward way to indicate copyright violations. We also observe that our
adversarial examples exhibit good transferability across unknown generative
models. Therefore, this work provides a simple yet powerful way to protect
copyright from DM-based imitation. |
This paper introduces a novel method for embedding personal watermarks into adversarial examples to prevent copyright infringement by diffusion models (DMs). |
The widespread use of DMs raises concerns about copyright violations as they can be used to imitate unauthorized creations, potentially leading to illegal revenue generation. |
The authors propose a conditional GAN architecture with a generator, discriminator, and a target DM. They design three losses: a GAN loss for image quality, a perturbation loss to control perturbation visibility, and an adversarial loss to target the latent space of LDMs for improved attack transferability. |
The method effectively embeds visible watermarks in images generated by DMs, hindering unauthorized imitation and providing a clear indication of copyright violation.
The generation process is significantly faster (0.2s per image) than existing iterative optimization methods.
The generated adversarial examples exhibit good transferability across various DMs and image generation scenarios, including textual inversion and DreamBooth. |
There is a trade-off between watermark visibility and adversarial example quality.
Further investigation is needed on the effectiveness against more advanced defenses. |
copyright protection, diffusion models, adversarial examples, watermarking, generative models |
2404.09326
Report |
Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers |
Diana-Nicoleta Grigore, Mariana-Iuliana Georgescu, Jon Alvarez Justo, Tor Johansen, Andreea Iuliana Ionescu, Radu Tudor Ionescu |
Few-shot knowledge distillation recently emerged as a viable approach to
harness the knowledge of large-scale pre-trained models, using limited data and
computational resources. In this paper, we propose a novel few-shot feature
distillation approach for vision transformers. Our approach is based on two key
steps. Leveraging the fact that vision transformers have a consistent
depth-wise structure, we first copy the weights from intermittent layers of
existing pre-trained vision transformers (teachers) into shallower
architectures (students), where the intermittence factor controls the
complexity of the student transformer with respect to its teacher. Next, we
employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge
into the student in a few-shot scenario, aiming to recover the information
processing carried out by the skipped teacher layers. We present comprehensive
experiments with supervised and self-supervised transformers as teachers, on
five data sets from various domains, including natural, medical and satellite
images. The empirical results confirm the superiority of our approach over
competitive baselines. Moreover, the ablation results demonstrate the
usefulness of each component of the proposed pipeline. |
This paper introduces WeCoLoRA, a novel few-shot unsupervised feature distillation method for vision transformers, which combines intermittent weight copying from a teacher model with an enhanced low-rank adaptation (LoRA) approach. |
Training large-scale vision transformers demands extensive computational resources and data. WeCoLoRA addresses this by enabling efficient learning from limited data, making it valuable for resource-constrained environments and domains with data scarcity. |
WeCoLoRA operates in two steps: 1) intermittently copying weights from a pre-trained teacher transformer to a smaller student, 2) using an enhanced LoRA, applied to all components of the transformer block, to distill knowledge from the teacher to the student in a few-shot setting. |
WeCoLoRA outperforms state-of-the-art few-shot knowledge distillation methods, including DeiT and DMAE, achieving significantly higher accuracy on benchmarks like ImageNet.
The method demonstrates robustness across different compression ratios, teacher models (both supervised and self-supervised), and varying sizes of training data.
Visualization of the learned feature space reveals that WeCoLoRA produces more discriminative and robust embeddings compared to baseline approaches. |
The current design of WeCoLoRA, specifically the weight copying mechanism, limits its applicability to architectures with consistent configurations across layers, like transformers and ResNets.
Future work will focus on generalizing the weight copying mechanism using adaptor blocks to extend the method’s compatibility with a wider range of model architectures. |
knowledge distillation, low rank adaptation, vision transformers, few-shot learning, unsupervised learning |
2404.09227
Report |
DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling |
Xuening Yuan, Hongyu Yang, Yueming Zhao, Di Huang |
Recent progress in text-to-3D creation has been propelled by integrating the
potent prior of Diffusion Models from text-to-image generation into the 3D
domain. Nevertheless, generating 3D scenes characterized by multiple instances
and intricate arrangements remains challenging. In this study, we present
DreamScape, a method for creating highly consistent 3D scenes solely from
textual descriptions, leveraging the strong 3D representation capabilities of
Gaussian Splatting and the complex arrangement abilities of large language
models (LLMs). Our approach involves a 3D Gaussian Guide ($3{DG^2}$) for scene
representation, consisting of semantic primitives (objects) and their spatial
transformations and relationships derived directly from text prompts using
LLMs. This compositional representation allows for local-to-global optimization
of the entire scene. A progressive scale control is tailored during local
object generation, ensuring that objects of different sizes and densities adapt
to the scene, which addresses training instability issue arising from simple
blending in the subsequent global optimization stage. To mitigate potential
biases of LLM priors, we model collision relationships between objects at the
global level, enhancing physical correctness and overall realism. Additionally,
to generate pervasive objects like rain and snow distributed extensively across
the scene, we introduce a sparse initialization and densification strategy.
Experiments demonstrate that DreamScape offers high usability and
controllability, enabling the generation of high-fidelity 3D scenes from only
text prompts and achieving state-of-the-art performance compared to other
methods. |
This LaTeX code doesn't present any research or findings. It's a snippet for managing citations and bibliography in a LaTeX document. |
N/A |
N/A |
|
|
latex, bibliography, citations, acm style, academic writing |
2404.09216
Report |
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection |
Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu |
Existing open-vocabulary object detectors typically require a predefined set
of categories from users, significantly confining their application scenarios.
In this paper, we introduce DetCLIPv3, a high-performing detector that excels
not only at both open-vocabulary object detection, but also generating
hierarchical labels for detected objects. DetCLIPv3 is characterized by three
core designs: 1. Versatile model architecture: we derive a robust open-set
detection framework which is further empowered with generation ability via the
integration of a caption head. 2. High information density data: we develop an
auto-annotation pipeline leveraging visual large language model to refine
captions for large-scale image-text pairs, providing rich, multi-granular
object labels to enhance the training. 3. Efficient training strategy: we
employ a pre-training stage with low-resolution inputs that enables the object
captioner to efficiently learn a broad spectrum of visual concepts from
extensive image-text paired data. This is followed by a fine-tuning stage that
leverages a small number of high-resolution samples to further enhance
detection performance. With these effective designs, DetCLIPv3 demonstrates
superior open-vocabulary detection performance, \eg, our Swin-T backbone model
achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark,
outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP,
respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense
captioning task on VG dataset, showcasing its strong generative capability. |
This paper introduces DetCLIPv3, an open-vocabulary object detector that can also generate hierarchical labels for detected objects. |
Existing open-vocabulary object detectors rely on predefined categories, limiting their real-world applicability. DetCLIPv3 overcomes this by generating object labels even without category input, allowing for richer interpretation of visual content. |
DetCLIPv3 leverages a versatile architecture with an open-vocabulary detector and an object captioner. It utilizes an auto-annotation pipeline with VLLMs to create a large-scale dataset (GranuCap50M) with rich object labels. A multi-stage training strategy (low-resolution pretraining and high-resolution fine-tuning) ensures efficient learning from massive image-text pairs. |
Achieves 47.0 zero-shot fixed AP on LVIS minival, outperforming prior arts like GLIPv2 and DetCLIPv2.
Achieves state-of-the-art 19.7 AP in dense captioning on VG, showcasing its strong generative capability.
Shows superior domain generalization, with Swin-L model achieving 48.8 AP on COCO-O, surpassing its COCO performance. |
Evaluation of generative capability is limited by existing benchmarks.
Current model lacks instruction-controlled detection. |
open-vocabulary object detection, generative detection, hierarchical object labels, auto-annotation pipeline, vision-language models |
2404.09204
Report |
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models |
Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, Wei Zeng |
Multimodal Large Language Models (MLLMs) have shown impressive results on
various multimodal tasks. However, most existing MLLMs are not well suited for
document-oriented tasks, which require fine-grained image perception and
information compression. In this paper, we present TextHawk, a MLLM that is
specifically designed for document-oriented tasks, while preserving the general
capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained
perception by designing four dedicated components. Firstly, a ReSampling and
ReArrangement (ReSA) module is proposed to reduce the redundancy in the
document texts and lower the computational cost of the MLLM. We explore
encoding the positions of each local feature by presenting Scalable Positional
Embeddings (SPEs), which can preserve the scalability of various image sizes. A
Query Proposal Network (QPN) is then adopted to initialize the queries
dynamically among different sub-images. To further enhance the fine-grained
visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention
(MLCA) mechanism that captures the hierarchical structure and semantic
relations of document images. Furthermore, we create a new instruction-tuning
dataset for document-oriented tasks by enriching the multimodal document data
with Gemini Pro. We conduct extensive experiments on both general and
document-oriented MLLM benchmarks, and show that TextHawk outperforms the
state-of-the-art methods, demonstrating its effectiveness and superiority in
fine-grained document perception and general abilities. |
This paper introduces TextHawk, a novel Multimodal Large Language Model (MLLM) specifically designed to address the challenges of document-oriented tasks while retaining strong general vision-language capabilities. |
Document images, with their high resolution and information density, pose significant challenges for MLLMs, necessitating improved fine-grained visual perception and efficient information compression. |
TextHawk incorporates several key components: a ReSampling and ReArrangement (ReSA) module for information compression, Scalable Positional Embeddings (SPEs) for sub-image representation, a Query Proposal Network (QPN) for dynamic query generation, a Multi-Level Cross-Attention (MLCA) mechanism for enhanced perception, and a new instruction-tuning dataset enriched with Gemini Pro. |
TextHawk outperforms state-of-the-art methods on both document-oriented and general MLLM benchmarks.
The model excels in fine-grained tasks like document understanding and referring expression comprehension.
TextHawk achieves a good balance between general vision-language tasks and specialized document-oriented tasks. |
The visual encoder in TextHawk is frozen during training, potentially limiting its adaptability to new visual data.
Future work will focus on training the vision encoder to further enhance perception capabilities. |
multimodal large language models, document understanding, visual question answering, fine-grained visual perception, information compression |
2404.09172
Report |
LoopAnimate: Loopable Salient Object Animation |
Fanyi Wang, Peng Liu, Haotian Hu, Dan Meng, Jingwen Su, Jinjin Xu, Yanhao Zhang, Xiaoming Ren, Zhiwang Zhang |
Research on diffusion model-based video generation has advanced rapidly.
However, limitations in object fidelity and generation length hinder its
practical applications. Additionally, specific domains like animated wallpapers
require seamless looping, where the first and last frames of the video match
seamlessly. To address these challenges, this paper proposes LoopAnimate, a
novel method for generating videos with consistent start and end frames. To
enhance object fidelity, we introduce a framework that decouples multi-level
image appearance and textual semantic information. Building upon an
image-to-image diffusion model, our approach incorporates both pixel-level and
feature-level information from the input image, injecting image appearance and
textual semantic embeddings at different positions of the diffusion model.
Existing UNet-based video generation models require to input the entire videos
during training to encode temporal and positional information at once. However,
due to limitations in GPU memory, the number of frames is typically restricted
to 16. To address this, this paper proposes a three-stage training strategy
with progressively increasing frame numbers and reducing fine-tuning modules.
Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend
the capacity for encoding temporal and positional information up to 36 frames.
The proposed LoopAnimate, which for the first time extends the single-pass
generation length of UNet-based video generation models to 35 frames while
maintaining high-quality video generation. Experiments demonstrate that
LoopAnimate achieves state-of-the-art performance in both objective metrics,
such as fidelity and temporal consistency, and subjective evaluation results. |
LoopAnimate is a novel image-to-video generation method that generates loopable videos with a length of 35 frames, improving object fidelity and extending generation length. |
Existing video generation models have limitations in object fidelity, generation length and lack the ability to create seamlessly looping videos which are needed in applications like dynamic wallpapers. |
The paper proposes a multi-level image representation and textual semantic decoupling framework, a three-stage training strategy progressively increasing the number of frames, and an Asymmetric Loop Sampling Strategy for loopable video generation. |
LoopAnimate outperforms state-of-the-art methods in object fidelity and motion quality, particularly for human portraits.
The three-stage training strategy successfully extends the generation length to 35 frames while preserving video quality.
A specially designed condition initialization method and asymmetric loop sampling strategy enables generation of loopable videos. |
The model relies on accurate salient object detection for optimal performance.
Further research can explore extending the generation length beyond the current 35-frame limit. |
diffusion models, image-to-video generation, loopable video, long video generation, object fidelity |
2404.09111
Report |
Exploring Generative AI for Sim2Real in Driving Data Synthesis |
Haonan Zhao, Yiting Wang, Thomas Bashford-Rogers, Valentina Donzella, Kurt Debattista |
Datasets are essential for training and testing vehicle perception
algorithms. However, the collection and annotation of real-world images is
time-consuming and expensive. Driving simulators offer a solution by
automatically generating various driving scenarios with corresponding
annotations, but the simulation-to-reality (Sim2Real) domain gap remains a
challenge. While most of the Generative Artificial Intelligence (AI) follows
the de facto Generative Adversarial Nets (GANs)-based methods, the recent
emerging diffusion probabilistic models have not been fully explored in
mitigating Sim2Real challenges for driving data synthesis. To explore the
performance, this paper applied three different generative AI methods to
leverage semantic label maps from a driving simulator as a bridge for the
creation of realistic datasets. A comparative analysis of these methods is
presented from the perspective of image quality and perception. New synthetic
datasets, which include driving images and auto-generated high-quality
annotations, are produced with low costs and high scene variability. The
experimental results show that although GAN-based methods are adept at
generating high-quality images when provided with manually annotated labels,
ControlNet produces synthetic datasets with fewer artefacts and more structural
fidelity when using simulator-generated labels. This suggests that the
diffusion-based approach may provide improved stability and an alternative
method for addressing Sim2Real challenges. |
This paper explores three generative AI methods (two GAN-based, one diffusion-based) to generate realistic driving datasets from simulator semantic label maps, aiming to bridge the simulation-to-reality gap. |
Collecting and annotating real-world driving data is expensive and limited in scenario diversity. Simulators can generate diverse scenarios but often lack realism, hindering their use in training robust perception algorithms. |
The paper leverages semantic label maps from the CARLA simulator and Cityscapes dataset. It trains Pix2pixHD, OASIS (GAN-based), and ControlNet (diffusion-based) models to translate these maps into realistic images. |
GAN-based methods excel in image quality when trained on manually annotated Cityscapes labels but struggle with simulator labels.
ControlNet, while stylistically different from Cityscapes, generates images with fewer artefacts and better structural fidelity, especially with simulator labels.
Findings suggest ControlNet's diffusion process may offer better stability and robustness in handling variations in label accuracy. |
The study primarily focuses on semantic segmentation, limiting the assessment of other perception tasks.
Future work could explore modifying ControlNet to improve the realism and diversity of generated images while preserving structural accuracy. |
generative ai, sim2real, driving data synthesis, diffusion models, semantic segmentation |
2404.09105
Report |
EGGS: Edge Guided Gaussian Splatting for Radiance Fields |
Yuanhao Gong |
The Gaussian splatting methods are getting popular. However, their loss
function only contains the $\ell_1$ norm and the structural similarity between
the rendered and input images, without considering the edges in these images.
It is well-known that the edges in an image provide important information.
Therefore, in this paper, we propose an Edge Guided Gaussian Splatting (EGGS)
method that leverages the edges in the input images. More specifically, we give
the edge region a higher weight than the flat region. With such edge guidance,
the resulting Gaussian particles focus more on the edges instead of the flat
regions. Moreover, such edge guidance does not crease the computation cost
during the training and rendering stage. The experiments confirm that such
simple edge-weighted loss function indeed improves about $1\sim2$ dB on several
difference data sets. With simply plugging in the edge guidance, the proposed
method can improve all Gaussian splatting methods in different scenarios, such
as human head modeling, building 3D reconstruction, etc. |
Introduces Edge Guided Gaussian Splatting (EGGS), improving radiance field accuracy in 3D Gaussian splatting methods by weighting edges in the loss function. |
Edges are visually important, and existing Gaussian splatting methods treat all pixels equally in their loss functions, leading to suboptimal results. |
EGGS incorporates an edge-weighting function (e.g., image gradient) into the loss function, giving higher importance to edge pixels during optimization. |
EGGS achieves 1-2 dB PSNR improvement over standard 3DGS on various datasets.
Edge guidance encourages Gaussian particles to align with edges, improving scene geometry representation.
The method is computationally efficient, adding no overhead to training or rendering. |
PSNR improvement may vary depending on scene complexity, image resolution, and other factors.
Future work includes exploring more sophisticated edge detection methods and applying EGGS to other 3DGS variants. |
gaussian splatting, radiance fields, 3d reconstruction, edge detection, computer vision |
2404.08921
Report |
PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos |
Qi Zhao, M. Salman Asif, Zhan Ma |
The primary focus of Neural Representation for Videos (NeRV) is to
effectively model its spatiotemporal consistency. However, current NeRV systems
often face a significant issue of spatial inconsistency, leading to decreased
perceptual quality. To address this issue, we introduce the Pyramidal Neural
Representation for Videos (PNeRV), which is built on a multi-scale information
connection and comprises a lightweight rescaling operator, Kronecker
Fully-connected layer (KFc), and a Benign Selective Memory (BSM) mechanism. The
KFc, inspired by the tensor decomposition of the vanilla Fully-connected layer,
facilitates low-cost rescaling and global correlation modeling. BSM merges
high-level features with granular ones adaptively. Furthermore, we provide an
analysis based on the Universal Approximation Theory of the NeRV system and
validate the effectiveness of the proposed PNeRV.We conducted comprehensive
experiments to demonstrate that PNeRV surpasses the performance of contemporary
NeRV models, achieving the best results in video regression on UVG and DAVIS
under various metrics (PSNR, SSIM, LPIPS, and FVD). Compared to vanilla NeRV,
PNeRV achieves a +4.49 dB gain in PSNR and a 231% increase in FVD on UVG, along
with a +3.28 dB PSNR and 634% FVD increase on DAVIS. |
This paper introduces PNeRV (Pyramidal Neural Representation for Videos) to address the spatial inconsistency issue in current NeRV systems, aiming for enhanced spatiotemporal consistency in video representation. |
Existing NeRV systems suffer from poor perceptual quality due to spatial inconsistency, stemming from a lack of global receptive field and multi-scale information communication. This limits their ability to model complex videos effectively. |
PNeRV leverages a multi-scale information connection using a lightweight Kronecker Fully-connected (KFc) layer for low-cost upsampling and global correlation modeling. It also employs a Benign Selective Memory (BSM) mechanism for adaptive merging of high-level and granular features. The paper also provides a Universal Approximation Theory analysis for NeRV. |
PNeRV outperforms state-of-the-art NeRV models in video regression tasks on UVG and DAVIS datasets, achieving superior performance in PSNR, SSIM, LPIPS, and FVD metrics.
PNeRV demonstrates significant improvement in spatial consistency, reducing noise and artifacts in reconstructed videos, especially in scenes with complex spatiotemporal features.
Ablation studies confirm the effectiveness of KFc and BSM in enhancing perceptual quality and demonstrate the superiority of the pyramidal structure for multi-scale feature learning. |
The hierarchical structure in PNeRV increases computational complexity, demanding further optimization for practical applications.
Future work will explore the theoretical analysis and enhancement of PNeRV's generalization abilities, particularly in video interpolation tasks. |
neural video representation, implicit neural representation, video coding, perceptual quality, multi-scale feature learning |
2404.08819
Report |
The Illusion of State in State-Space Models |
William Merrill, Jackson Petty, Ashish Sabharwal |
State-space models (SSMs) have emerged as a potential alternative
architecture for building large language models (LLMs) compared to the
previously ubiquitous transformer architecture. One theoretical weakness of
transformers is that they cannot express certain kinds of sequential
computation and state tracking (Merrill and Sabharwal, 2023), which SSMs are
explicitly designed to address via their close architectural similarity to
recurrent neural networks (RNNs). But do SSMs truly have an advantage (over
transformers) in expressive power for state tracking? Surprisingly, the answer
is no. Our analysis reveals that the expressive power of SSMs is limited very
similarly to transformers: SSMs cannot express computation outside the
complexity class $\mathsf{TC}^0$. In particular, this means they cannot solve
simple state-tracking problems like permutation composition. It follows that
SSMs are provably unable to accurately track chess moves with certain notation,
evaluate code, or track entities in a long narrative. To supplement our formal
analysis, we report experiments showing that Mamba-style SSMs indeed struggle
with state tracking. Thus, despite its recurrent formulation, the "state" in an
SSM is an illusion: SSMs have similar expressiveness limitations to
non-recurrent models like transformers, which may fundamentally limit their
ability to solve real-world state-tracking problems. |
This paper demonstrates that state-space models (SSMs), like transformers, are limited in their expressive power for state tracking and cannot solve problems outside the complexity class TC^0. |
SSMs have been proposed as alternatives to transformers, with potential advantages in handling stateful and sequential problems. This work investigates whether these advantages hold true theoretically and practically. |
The authors employ circuit complexity analysis to prove that linear and Mamba-style SSMs fall within the TC^0 complexity class, limiting their ability to express complex state tracking. They also conduct experiments on permutation composition tasks to empirically evaluate the state-tracking capabilities of SSMs compared to transformers and RNNs. |
Theoretically, linear and Mamba-style SSMs are limited to TC^0 complexity, similar to transformers, preventing them from solving problems like permutation composition.
Empirically, SSMs and transformers fail to learn permutation composition with a fixed depth, unlike RNNs.
SSMs, while still limited, empirically perform better than transformers on approximate state tracking for less complex tasks. |
The analysis focuses on specific SSM architectures (linear and Mamba-style) and might not cover all variants.
Future work could explore alternative SSM designs that balance parallelizability and state-tracking expressiveness. |
state-space models, transformers, state tracking, circuit complexity, expressive power |
2404.08814
Report |
E3: Ensemble of Expert Embedders for Adapting Synthetic Image Detectors to New Generators Using Limited Data |
Aref Azizpour, Tai D. Nguyen, Manil Shrestha, Kaidi Xu, Edward Kim, Matthew C. Stamm |
As generative AI progresses rapidly, new synthetic image generators continue
to emerge at a swift pace. Traditional detection methods face two main
challenges in adapting to these generators: the forensic traces of synthetic
images from new techniques can vastly differ from those learned during
training, and access to data for these new generators is often limited. To
address these issues, we introduce the Ensemble of Expert Embedders (E3), a
novel continual learning framework for updating synthetic image detectors. E3
enables the accurate detection of images from newly emerged generators using
minimal training data. Our approach does this by first employing transfer
learning to develop a suite of expert embedders, each specializing in the
forensic traces of a specific generator. Then, all embeddings are jointly
analyzed by an Expert Knowledge Fusion Network to produce accurate and reliable
detection decisions. Our experiments demonstrate that E3 outperforms existing
continual learning methods, including those developed specifically for
synthetic image detection. |
The paper introduces Ensemble of Expert Embedders (E3), a novel continual learning framework for updating synthetic image detectors to accurately detect images from newly emerged generators using minimal training data. |
Traditional detection methods struggle to adapt to new synthetic image generators due to the vastly different forensic traces and limited access to data from these generators. This necessitates continual updating of detectors, which poses challenges like catastrophic forgetting and data inefficiency. |
E3 employs transfer learning to develop a suite of expert embedders, each specializing in forensic traces of a specific generator. Embeddings from all experts are jointly analyzed by an Expert Knowledge Fusion Network to produce accurate detection decisions. |
E3 significantly outperforms existing continual learning methods, including those designed for synthetic image detection.
E3 exhibits strong and stable performance across various new generators, including those with limited training data.
The framework demonstrates generality by achieving superior results across multiple detector architectures. |
The ensemble approach increases network size, although the increase in parameters is manageable and outweighed by the improved detection accuracy.
Future work could explore compressing the model or reducing the number of experts to address the increased network size. |
synthetic image detection, continual learning, generative adversarial networks (gans), transfer learning, ensemble learning |
2404.08639
Report |
COCONut: Modernizing COCO Segmentation |
Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, Liang-Chieh Chen |
In recent decades, the vision community has witnessed remarkable progress in
visual recognition, partially owing to advancements in dataset benchmarks.
Notably, the established COCO benchmark has propelled the development of modern
detection and segmentation systems. However, the COCO segmentation benchmark
has seen comparatively slow improvement over the last decade. Originally
equipped with coarse polygon annotations for thing instances, it gradually
incorporated coarse superpixel annotations for stuff regions, which were
subsequently heuristically amalgamated to yield panoptic segmentation
annotations. These annotations, executed by different groups of raters, have
resulted not only in coarse segmentation masks but also in inconsistencies
between segmentation types. In this study, we undertake a comprehensive
reevaluation of the COCO segmentation annotations. By enhancing the annotation
quality and expanding the dataset to encompass 383K images with more than 5.18M
panoptic masks, we introduce COCONut, the COCO Next Universal segmenTation
dataset. COCONut harmonizes segmentation annotations across semantic, instance,
and panoptic segmentation with meticulously crafted high-quality masks, and
establishes a robust benchmark for all segmentation tasks. To our knowledge,
COCONut stands as the inaugural large-scale universal segmentation dataset,
verified by human raters. We anticipate that the release of COCONut will
significantly contribute to the community's ability to assess the progress of
novel neural networks. |
The paper introduces COCONut, a large-scale universal segmentation dataset designed to modernize and improve upon the COCO segmentation annotations. |
The original COCO segmentation annotations suffer from limitations such as coarse masks, inconsistencies between segmentation types, and a relatively small dataset size, hindering the evaluation and training of modern segmentation models. |
The authors developed an assisted-manual annotation pipeline and a data engine to efficiently create high-quality segmentation masks. The pipeline leverages neural networks for generating proposals and allows human raters to edit and refine them. The data engine iteratively expands the dataset while maintaining high annotation quality. |
COCONut provides human-verified annotations for 383K images and 5.18M masks, surpassing the size and quality of existing datasets.
The assisted-manual pipeline significantly accelerates the annotation process while ensuring high-quality masks.
Experiments demonstrate that models trained on COCONut outperform those trained on COCO, highlighting the importance of large-scale, high-quality annotations. |
The dataset is currently limited to 133 semantic classes, potentially limiting its applicability to open-vocabulary segmentation tasks.
Future work could explore incorporating more diverse image sources and further expanding the dataset size. |
segmentation, dataset, coco, annotation, deep learning |
2404.08636
Report |
Probing the 3D Awareness of Visual Foundation Models |
Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani |
Recent advances in large-scale pretraining have yielded visual foundation
models with strong capabilities. Not only can recent models generalize to
arbitrary images for their training task, their intermediate representations
are useful for other visual tasks such as detection and segmentation. Given
that such models can classify, delineate, and localize objects in 2D, we ask
whether they also represent their 3D structure? In this work, we analyze the 3D
awareness of visual foundation models. We posit that 3D awareness implies that
representations (1) encode the 3D structure of the scene and (2) consistently
represent the surface across views. We conduct a series of experiments using
task-specific probes and zero-shot inference procedures on frozen features. Our
experiments reveal several limitations of the current models. Our code and
analysis can be found at https://github.com/mbanani/probe3d. |
This paper investigates the 3D awareness of visual foundation models, examining how well they capture 3D scene structure and exhibit consistency across different viewpoints. |
Understanding the 3D awareness of visual foundation models is crucial for assessing their capabilities and limitations in representing the 3D world, particularly as they are increasingly used in 3D-related tasks. |
The authors probe frozen representations of various large-scale pretrained models using task-specific probes and zero-shot inference for depth estimation, surface normal estimation, and 3D correspondence on both scene-level (NYUv2, ScanNet) and object-level (NAVI) datasets. |
Discriminative self-supervised models like DINOv2 exhibit the strongest 3D awareness, demonstrating impressive performance in encoding depth and surface normals.
Models show good correspondence estimation for small viewpoint changes but struggle with large viewpoint variations, indicating a lack of true 3D consistency.
Vision-language models like CLIP perform poorly in capturing 3D information despite their strong semantic generalization abilities. |
The study relies on publicly available checkpoints trained on diverse datasets with varying scales and recipes, limiting controlled comparisons.
The analysis focuses on specific aspects of 3D awareness and probing methods, potentially overlooking other facets of 3D understanding and evaluation techniques. |
3d vision, visual foundation models, self-supervised learning, representation learning, multiview consistency |
2404.08603
Report |
Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation |
Yanhao Zheng, Kai Liu |
Open-vocabulary object detection (OVOD) aims at localizing and recognizing
visual objects from novel classes unseen at the training time. Whereas,
empirical studies reveal that advanced detectors generally assign lower scores
to those novel instances, which are inadvertently suppressed during inference
by commonly adopted greedy strategies like Non-Maximum Suppression (NMS),
leading to sub-optimal detection performance for novel classes. This paper
systematically investigates this problem with the commonly-adopted two-stage
OVOD paradigm. Specifically, in the region-proposal stage, proposals that
contain novel instances showcase lower objectness scores, since they are
treated as background proposals during the training phase. Meanwhile, in the
object-classification stage, novel objects share lower region-text similarities
(i.e., classification scores) due to the biased visual-language alignment by
seen training samples. To alleviate this problem, this paper introduces two
advanced measures to adjust confidence scores and conserve erroneously
dismissed objects: (1) a class-agnostic localization quality estimate via
overlap degree of region/object proposals, and (2) a text-guided visual
similarity estimate with proxy prototypes for novel classes. Integrated with
adjusting techniques specifically designed for the region-proposal and
object-classification stages, this paper derives the aggregated confidence
estimate for the open-vocabulary object detection paradigm (AggDet). Our AggDet
is a generic and training-free post-processing scheme, which consistently
bolsters open-vocabulary detectors across model scales and architecture
designs. For instance, AggDet receives 3.3% and 1.5% gains on OV-COCO and
OV-LVIS benchmarks respectively, without any training cost. |
This paper introduces AggDet, a training-free post-processing method for open-vocabulary object detection (OVOD), which aggregates confidence estimates from both region-proposal and object-classification stages to boost the detection performance on novel classes. |
Current OVOD detectors exhibit a significant performance gap between novel and base classes, due to the underestimated confidence scores for novel instances in both region-proposal and object-classification stages. |
AggDet leverages (1) a class-agnostic localization quality estimate via the overlap degree of region proposals, and (2) a text-guided visual similarity estimate with proxy prototypes for novel classes, to adjust the confidence scores during inference. |
AggDet consistently enhances various OVOD detectors across model scales and architectures, without any training cost.
The method achieves up to 3.3% and 1.5% gains on OV-COCO and OV-LVIS benchmarks, respectively.
AggDet introduces minimal computational overhead, with less than 1 ms latency during inference. |
The hyper-parameters in AggDet need to be slightly tuned for different datasets.
Future work could explore incorporating the aggregation techniques into the training paradigm for further performance improvement. |
open-vocabulary object detection, confidence aggregation, region proposal, object classification, zero-shot learning |
2404.08590
Report |
Improving Referring Image Segmentation using Vision-Aware Text Features |
Hai Nguyen-Truong, E-Ro Nguyen, Tuan-Anh Vu, Minh-Triet Tran, Binh-Son Hua, Sai-Kit Yeung |
Referring image segmentation is a challenging task that involves generating
pixel-wise segmentation masks based on natural language descriptions. Existing
methods have relied mostly on visual features to generate the segmentation
masks while treating text features as supporting components. This over-reliance
on visual features can lead to suboptimal results, especially in complex
scenarios where text prompts are ambiguous or context-dependent. To overcome
these challenges, we present a novel framework VATEX to improve referring image
segmentation by enhancing object and context understanding with Vision-Aware
Text Feature. Our method involves using CLIP to derive a CLIP Prior that
integrates an object-centric visual heatmap with text description, which can be
used as the initial query in DETR-based architecture for the segmentation task.
Furthermore, by observing that there are multiple ways to describe an instance
in an image, we enforce feature similarity between text variations referring to
the same visual input by two components: a novel Contextual Multimodal Decoder
that turns text embeddings into vision-aware text features, and a Meaning
Consistency Constraint to ensure further the coherent and consistent
interpretation of language expressions with the context understanding obtained
from the image. Our method achieves a significant performance improvement on
three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Code is available at:
https://nero1342.github.io/VATEX\_RIS. |
This paper proposes VATEX, a novel framework that leverages vision-aware text features to enhance the performance of referring image segmentation. |
Existing methods often struggle with complex or ambiguous language expressions, leading to inaccurate segmentation results. VATEX addresses this by enhancing object and context understanding through a deeper integration of visual and textual information. |
The proposed method utilizes a CLIP Prior for object localization, a Contextual Multimodal Decoder (CMD) for hierarchical visual-textual feature fusion, and a Meaning Consistency Constraint (MCC) to enforce consistent representation of different expressions referring to the same object. |
VATEX achieves state-of-the-art performance on three referring image segmentation benchmarks: RefCOCO, RefCOCO+, and G-Ref.
The method also demonstrates strong performance on referring video object segmentation benchmarks, Ref-YouTube-VOS and Ref-DAVIS17.
Ablation studies and qualitative analysis validate the contribution of each proposed component (CLIP Prior, CMD, MCC) to the overall performance improvement. |
The method currently does not explicitly model relationships between objects or actions, limiting its accuracy in scenarios requiring such understanding.
Future work will focus on incorporating object interaction and action alignment into the framework for improved segmentation in more complex scenarios. |
referring image segmentation, vision-aware text features, clip localization, multimodal understanding, meaning consistency constraint |
2404.08580
Report |
Lossy Image Compression with Foundation Diffusion Models |
Lucas Relic, Roberto Azevedo, Markus Gross, Christopher Schroers |
Incorporating diffusion models in the image compression domain has the
potential to produce realistic and detailed reconstructions, especially at
extremely low bitrates. Previous methods focus on using diffusion models as
expressive decoders robust to quantization errors in the conditioning signals,
yet achieving competitive results in this manner requires costly training of
the diffusion model and long inference times due to the iterative generative
process. In this work we formulate the removal of quantization error as a
denoising task, using diffusion to recover lost information in the transmitted
image latent. Our approach allows us to perform less than 10\% of the full
diffusion generative process and requires no architectural changes to the
diffusion model, enabling the use of foundation models as a strong prior
without additional fine tuning of the backbone. Our proposed codec outperforms
previous methods in quantitative realism metrics, and we verify that our
reconstructions are qualitatively preferred by end users, even when other
methods use twice the bitrate. |
This paper proposes a novel lossy image compression codec leveraging foundation latent diffusion models for realistic image reconstruction, particularly at low bitrates. |
Existing image compression methods often produce unrealistic or distorted images at very low bitrates. This work addresses this by using diffusion models to synthesize lost details and enhance perceptual quality. |
The method combines a variational autoencoder from a pre-trained latent diffusion model, adaptive quantization, a learned timestep prediction module for optimal denoising, and an entropy model. It processes a quantized latent representation and uses the diffusion model for denoising, allowing for a significant reduction in the number of diffusion steps compared to previous works. |
The proposed codec achieves state-of-the-art results in image realism as measured by FID, outperforming previous generative compression methods.
It maintains competitive performance in traditional distortion metrics like LPIPS and MS-SSIM, especially compared to other diffusion-based codecs.
Subjective user study confirms that the reconstructions are visually preferred over other state-of-the-art methods, even at lower bitrates. |
The method might inaccurately reconstruct certain image details due to limitations of the foundation diffusion model's VAE.
Potential misgeneration of content at very low bitrates raises ethical concerns. |
image compression, latent diffusion, generative models, low bitrate, image realism |
2404.08540
Report |
On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation |
Agneet Chatterjee, Tejas Gokhale, Chitta Baral, Yezhou Yang |
Recent advances in monocular depth estimation have been made by incorporating
natural language as additional guidance. Although yielding impressive results,
the impact of the language prior, particularly in terms of generalization and
robustness, remains unexplored. In this paper, we address this gap by
quantifying the impact of this prior and introduce methods to benchmark its
effectiveness across various settings. We generate "low-level" sentences that
convey object-centric, three-dimensional spatial relationships, incorporate
them as additional language priors and evaluate their downstream impact on
depth estimation. Our key finding is that current language-guided depth
estimators perform optimally only with scene-level descriptions and
counter-intuitively fare worse with low level descriptions. Despite leveraging
additional data, these methods are not robust to directed adversarial attacks
and decline in performance with an increase in distribution shift. Finally, to
provide a foundation for future research, we identify points of failures and
offer insights to better understand these shortcomings. With an increasing
number of methods using language for depth estimation, our findings highlight
the opportunities and pitfalls that require careful consideration for effective
deployment in real-world settings |
This paper investigates the impact of natural language guidance on monocular depth estimation, particularly its generalization and robustness. |
Understanding the role of language priors is crucial for effectively deploying depth estimation in real-world applications like autonomous driving and robotics. |
The authors systematically evaluate language-guided depth estimation by: (1) generating sentences describing spatial relationships between objects, image captions, and activity descriptions, (2) conducting supervised and zero-shot experiments with varying language inputs, (3) analyzing robustness under adversarial conditions like object masking and distribution shifts. |
Existing language-guided methods exhibit a strong scene-level bias, performing optimally with scene-level descriptions but deteriorating with low-level spatial relationships.
Language-guided models are less robust to distribution shifts and adversarial attacks compared to vision-only methods.
Performance improvement is observed with an increase in the number of low-level spatial sentences, suggesting a need for sufficient scene-level representation. |
The study primarily focuses on the VPD model and may not fully represent all language-guided depth estimators.
Future work should explore alternative methods for incorporating language priors and improving robustness to domain shifts. |
depth estimation, language guidance, robustness, distribution shift, adversarial attacks |
2404.08506
Report |
LaSagnA: Language-based Segmentation Assistant for Complex Queries |
Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, Lin Ma |
Recent advancements have empowered Large Language Models for Vision (vLLMs)
to generate detailed perceptual outcomes, including bounding boxes and masks.
Nonetheless, there are two constraints that restrict the further application of
these vLLMs: the incapability of handling multiple targets per query and the
failure to identify the absence of query objects in the image. In this study,
we acknowledge that the main cause of these problems is the insufficient
complexity of training queries. Consequently, we define the general sequence
format for complex queries. Then we incorporate a semantic segmentation task in
the current pipeline to fulfill the requirements of training data. Furthermore,
we present three novel strategies to effectively handle the challenges arising
from the direct integration of the proposed format. The effectiveness of our
model in processing complex queries is validated by the comparable results with
conventional methods on both close-set and open-set semantic segmentation
datasets. Additionally, we outperform a series of vLLMs in reasoning and
referring segmentation, showcasing our model's remarkable capabilities. We
release the code at https://github.com/congvvc/LaSagnA. |
This paper presents LaSagnA, a Large Language Model for vision (vLLM) that handles complex queries involving multiple arbitrary targets, which may or may not exist in an image, by introducing a new input sequence format and incorporating semantic segmentation tasks into training. |
Existing vLLM-based segmentation assistants struggle with complex queries because their training primarily revolves around single-target scenarios where the queried object is always present in the image. This limits their applicability in real-world settings where multiple or even non-existent targets might be queried. |
The authors define a new sequence format that incorporates multiple classes and negative classes. They integrate semantic segmentation tasks into the training process, and to address challenges in training with this new format, they propose three strategies: sequence augmentation (adding negative classes to the response), random classes list (using a dynamic list of categories in the query), and target order consistency (aligning category order in response with the query). |
LaSagnA achieves comparable results to state-of-the-art segmentation specialists on both closed-set and open-set semantic segmentation benchmarks.
The model outperforms previous vLLMs on referring segmentation tasks, demonstrating its enhanced ability to locate and segment objects based on complex language descriptions.
LaSagnA exhibits promising zero-shot performance on the generalized referring segmentation benchmark (gRefCOCO), highlighting its capacity to handle unseen scenarios with multiple and non-existent targets. |
While LaSagnA excels in high-level understanding, its accuracy in capturing low-level visual details and handling small or crowded objects still lags behind specialized segmentation models.
Further research is needed to develop lighter and more efficient vLLMs and mask decoders to enhance computational efficiency. |
large language models for vision (vllms), semantic segmentation, referring segmentation, complex query handling, open-set segmentation |
2404.08449
Report |
OccGaussian: 3D Gaussian Splatting for Occluded Human Rendering |
Jingrui Ye, Zongkai Zhang, Yujiao Jiang, Qingmin Liao, Wenming Yang, Zongqing Lu |
Rendering dynamic 3D human from monocular videos is crucial for various
applications such as virtual reality and digital entertainment. Most methods
assume the people is in an unobstructed scene, while various objects may cause
the occlusion of body parts in real-life scenarios. Previous method utilizing
NeRF for surface rendering to recover the occluded areas, but it requiring more
than one day to train and several seconds to render, failing to meet the
requirements of real-time interactive applications. To address these issues, we
propose OccGaussian based on 3D Gaussian Splatting, which can be trained within
6 minutes and produces high-quality human renderings up to 160 FPS with
occluded input. OccGaussian initializes 3D Gaussian distributions in the
canonical space, and we perform occlusion feature query at occluded regions,
the aggregated pixel-align feature is extracted to compensate for the missing
information. Then we use Gaussian Feature MLP to further process the feature
along with the occlusion-aware loss functions to better perceive the occluded
area. Extensive experiments both in simulated and real-world occlusions,
demonstrate that our method achieves comparable or even superior performance
compared to the state-of-the-art method. And we improving training and
inference speeds by 250x and 800x, respectively. Our code will be available for
research purposes. |
OccGaussian, a novel method for rendering humans in monocular videos with occlusions using 3D Gaussian Splatting, achieving fast training and real-time rendering. |
Previous methods for rendering humans under occlusion are too slow in training and inference, limiting their real-world applications. |
OccGaussian leverages aggregated pixel-aligned features from visible points to recover occluded regions. It employs a K-nearest feature query and MLPs to model occluded points' colors and opacities. Additionally, it incorporates occlusion and consistency losses for enhanced rendering in occluded areas. |
OccGaussian achieves comparable or better rendering quality than the state-of-the-art method OccNeRF.
It significantly reduces training time to 6-13 minutes, approximately 250 times faster than OccNeRF.
It enables real-time rendering at up to 169 FPS, 800 times faster than OccNeRF. |
OccGaussian may struggle to fully recover regions occluded for extended periods due to weak supervision.
Reliance on accurate human poses and camera parameters can limit its performance on in-the-wild videos. |
human rendering, occlusion handling, 3d gaussian splatting, monocular video, real-time rendering |
2404.08312
Report |
GPN: Generative Point-based NeRF |
Haipeng Wang |
Scanning real-life scenes with modern registration devices typically gives
incomplete point cloud representations, primarily due to the limitations of
partial scanning, 3D occlusions, and dynamic light conditions. Recent works on
processing incomplete point clouds have always focused on point cloud
completion. However, these approaches do not ensure consistency between the
completed point cloud and the captured images regarding color and geometry. We
propose using Generative Point-based NeRF (GPN) to reconstruct and repair a
partial cloud by fully utilizing the scanning images and the corresponding
reconstructed cloud. The repaired point cloud can achieve multi-view
consistency with the captured images at high spatial resolution. For the
finetunes of a single scene, we optimize the global latent condition by
incorporating an Auto-Decoder architecture while retaining multi-view
consistency. As a result, the generated point clouds are smooth, plausible, and
geometrically consistent with the partial scanning images. Extensive
experiments on ShapeNet demonstrate that our works achieve competitive
performances to the other state-of-the-art point cloud-based neural scene
rendering and editing performances. |
This paper proposes GPN, a lightweight, generalizable point-based NeRF framework that reconstructs and repairs partial point clouds using scanning images and reconstructed clouds, ensuring multi-view consistency. |
Existing point cloud completion methods often lack consistency between the completed point cloud and captured images in terms of color and geometry. GPN addresses this limitation by leveraging both scanning images and point clouds. |
GPN uses a hypernetwork paradigm-based VAE architecture for generalization training and an auto-decoder-based fine-tuning strategy for per-scene optimization. It proposes two frameworks: "Generation Framework" for complete clouds and "Completion Framework" for repairing incomplete clouds. |
GPN achieves competitive performance on ShapeNet for point cloud rendering and editing.
The generated point clouds are smooth, plausible, and geometrically consistent with the input images.
GPN enables point cloud completion while maintaining multi-view consistency with the captured images. |
The current implementation of GPN requires further exploration to improve speed and accuracy using techniques like Gaussian splatting.
Future work can explore incorporating diffusion models for more diverse generation capabilities. |
point cloud, nerf, generative model, point cloud completion, multi-view consistency |
2404.08273
Report |
Struggle with Adversarial Defense? Try Diffusion |
Yujie Li, Yanbin Wang, Haitao Xu, Bin Liu, Jianguo Sun, Zhenhao Guo, Wenrui Ma |
Adversarial attacks induce misclassification by introducing subtle
perturbations. Recently, diffusion models are applied to the image classifiers
to improve adversarial robustness through adversarial training or by purifying
adversarial noise. However, diffusion-based adversarial training often
encounters convergence challenges and high computational expenses.
Additionally, diffusion-based purification inevitably causes data shift and is
deemed susceptible to stronger adaptive attacks. To tackle these issues, we
propose the Truth Maximization Diffusion Classifier (TMDC), a generative
Bayesian classifier that builds upon pre-trained diffusion models and the
Bayesian theorem. Unlike data-driven classifiers, TMDC, guided by Bayesian
principles, utilizes the conditional likelihood from diffusion models to
determine the class probabilities of input images, thereby insulating against
the influences of data shift and the limitations of adversarial training.
Moreover, to enhance TMDC's resilience against more potent adversarial attacks,
we propose an optimization strategy for diffusion classifiers. This strategy
involves post-training the diffusion model on perturbed datasets with
ground-truth labels as conditions, guiding the diffusion model to learn the
data distribution and maximizing the likelihood under the ground-truth labels.
The proposed method achieves state-of-the-art performance on the CIFAR10
dataset against heavy white-box attacks and strong adaptive attacks.
Specifically, TMDC achieves robust accuracies of 82.81% against $l_{\infty}$
norm-bounded perturbations and 86.05% against $l_{2}$ norm-bounded
perturbations, respectively, with $\epsilon=0.05$. |
This paper proposes the Truth Maximization Diffusion Classifier (TMDC), a generative Bayesian classifier built on pre-trained diffusion models, to enhance adversarial robustness against image classification attacks. |
Existing defense strategies like adversarial training and image denoising are either computationally expensive, face convergence issues, or are susceptible to adaptive attacks. This highlights the need for a more robust and efficient defense mechanism. |
The authors leverage pre-trained diffusion models and Bayesian theorem to compute class probabilities, minimizing the influence of data shift and limitations of adversarial training. They further propose a Truth Maximization optimization strategy, training the diffusion model on perturbed datasets with ground-truth labels to maximize the likelihood under true labels. |
Diffusion Classifier demonstrates superior robustness against white-box attacks compared to traditional neural networks even without training.
Truth Maximization optimization significantly improves the adversarial robustness of the Diffusion Classifier, outperforming conventional adversarial training methods.
TMDC achieves state-of-the-art accuracy on CIFAR-10 against strong white-box and combined adaptive attacks (Auto Attack), reaching 82.81% and 86.05% accuracy for l-infinity and l2 norms, respectively, with epsilon=0.05. |
TMDC still requires training on adversarial samples, posing computational challenges.
Future work can explore decoupling training by optimizing the sampling strategy during inference to enhance both robustness and efficiency. |
diffusion models, adversarial robustness, generative classifier, adversarial attacks, truth maximization |
2404.08252
Report |
MonoPatchNeRF: Improving Neural Radiance Fields with Patch-based Monocular Guidance |
Yuqun Wu, Jae Yong Lee, Chuhang Zou, Shenlong Wang, Derek Hoiem |
The latest regularized Neural Radiance Field (NeRF) approaches produce poor
geometry and view extrapolation for multiview stereo (MVS) benchmarks such as
ETH3D. In this paper, we aim to create 3D models that provide accurate geometry
and view synthesis, partially closing the large geometric performance gap
between NeRF and traditional MVS methods. We propose a patch-based approach
that effectively leverages monocular surface normal and relative depth
predictions. The patch-based ray sampling also enables the appearance
regularization of normalized cross-correlation (NCC) and structural similarity
(SSIM) between randomly sampled virtual and training views. We further show
that "density restrictions" based on sparse structure-from-motion points can
help greatly improve geometric accuracy with a slight drop in novel view
synthesis metrics. Our experiments show 4x the performance of RegNeRF and 8x
that of FreeNeRF on average F1@2cm for ETH3D MVS benchmark, suggesting a
fruitful research direction to improve the geometric accuracy of NeRF-based
models, and sheds light on a potential future approach to enable NeRF-based
optimization to eventually outperform traditional MVS. |
Proposes MonoPatchNeRF, a patch-based regularized NeRF model that leverages monocular depth and normal predictions and virtual view appearance consistency priors for accurate 3D models from sparse views. |
NeRF struggles with accurate geometry and view extrapolation, especially in sparse view scenarios, while MVS methods, though better geometrically, often yield noisy and incomplete models with limited rendering capabilities. |
Employs patch-based ray sampling to effectively integrate monocular cues, utilizes NCC and SSIM losses for virtual view appearance consistency, and introduces density restrictions based on aligned sparse SfM points to refine geometry. |
Achieves 4x better geometric accuracy than RegNeRF and 8x better than FreeNeRF on ETH3D.
Outperforms other NeRF-based methods in novel view synthesis, ranking best in SSIM and LPIPS.
Demonstrates improved handling of challenging large-scale scenes, surpassing MonoSDF in TnT's advanced scenes. |
Geometric accuracy still falls short of MVS systems, even with MVS supervision.
The method is computationally slower than traditional MVS approaches. |
neural radiance fields, multi-view stereo, 3d reconstruction, monocular depth estimation, sparse view synthesis |
2404.08197
Report |
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies |
Zichao Li, Cihang Xie, Ekin Dogus Cubuk |
This paper investigates the performance of the Contrastive Language-Image
Pre-training (CLIP) when scaled down to limited computation budgets. We explore
CLIP along three dimensions: data, architecture, and training strategies. With
regards to data, we demonstrate the significance of high-quality training data
and show that a smaller dataset of high-quality data can outperform a larger
dataset with lower quality. We also examine how model performance varies with
different dataset sizes, suggesting that smaller ViT models are better suited
for smaller datasets, while larger models perform better on larger datasets
with fixed compute. Additionally, we provide guidance on when to choose a
CNN-based architecture or a ViT-based architecture for CLIP training. We
compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data
Augmentation - and show that the choice of training strategy depends on the
available compute resource. Our analysis reveals that CLIP+Data Augmentation
can achieve comparable performance to CLIP using only half of the training
data. This work provides practical insights into how to effectively train and
deploy CLIP models, making them more accessible and affordable for practical
use in various applications. |
This paper provides a comprehensive study on scaling down Contrastive Language-Image Pre-training (CLIP) for limited computational budgets, focusing on data, architecture, and training strategies. |
The goal is to make CLIP models more accessible and affordable for practical use in various applications by providing insights into efficient training and deployment under resource constraints. |
The authors conduct experiments on the WebLI dataset, comparing different data sizes and qualities, various vision encoder architectures (ViT, CNN), and training strategies (SLIP, FLIP, CLIP, CLIP+Data Augmentation) while evaluating zero-shot, linear probing, and retrieval performances. |
High-quality data is crucial, as a smaller subset with higher quality can outperform a larger, lower-quality dataset.
The choice of vision encoder architecture depends on the dataset size and compute budget; CNNs can be advantageous for smaller datasets, while larger ViTs benefit from larger datasets.
Data augmentation techniques, particularly Stacked RandAugment, significantly improve CLIP performance with minimal computational overhead. |
The study primarily focuses on English language image-text pairs from the WebLI dataset, potentially limiting generalizability to other languages or domains.
Future work could explore other efficient architectures and self-supervised learning methods for further computational cost reduction. |
clip, contrastive learning, vision transformer, data augmentation, resource constraints |
2404.08187
Report |
Adapting CNNs for Fisheye Cameras without Retraining |
Ryan Griffiths, Donald G. Dansereau |
The majority of image processing approaches assume images are in or can be
rectified to a perspective projection. However, in many applications it is
beneficial to use non conventional cameras, such as fisheye cameras, that have
a larger field of view (FOV). The issue arises that these large-FOV images
can't be rectified to a perspective projection without significant cropping of
the original image. To address this issue we propose Rectified Convolutions
(RectConv); a new approach for adapting pre-trained convolutional networks to
operate with new non-perspective images, without any retraining. Replacing the
convolutional layers of the network with RectConv layers allows the network to
see both rectified patches and the entire FOV. We demonstrate RectConv adapting
multiple pre-trained networks to perform segmentation and detection on fisheye
imagery from two publicly available datasets. Our approach requires no
additional data or training, and operates directly on the native image as
captured from the camera. We believe this work is a step toward adapting the
vast resources available for perspective images to operate across a broad range
of camera geometries. |
This paper proposes Rectified Convolutions (RectConv), a method for adapting pre-trained convolutional networks to operate with new non-perspective images without retraining. |
Adapting neural networks to new camera technologies typically requires gathering large datasets, even when the operating environment is the same. This work allows for the use of pre-trained networks on novel camera geometries without retraining or significant preprocessing. |
RectConv modifies convolutional layers to adapt kernel shape to local image geometry using camera calibration parameters. This allows for the processing of distorted images without the need for rectification. |
RectConv outperforms naive application of pre-trained networks and image rectification methods on fisheye imagery.
The method effectively adapts segmentation and detection networks trained on conventional imagery to work with fisheye images from the Woodscape and PIROPO datasets.
Converting only the backbone of the network to RectConv yields the most significant performance improvement. |
Bounding box conversion for object detection in RectConv networks requires further improvement.
Future work includes demonstrating RectConv on additional tasks and camera geometries, as well as expanding network conversion to handle more layer types (e.g., deconvolution). |
fisheye, convolutions, large-fov, cameras, deep learning |
2404.08181
Report |
Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation |
Sina Hajimiri, Ismail Ben Ayed, Jose Dolz |
Despite the significant progress in deep learning for dense visual
recognition problems, such as semantic segmentation, traditional methods are
constrained by fixed class sets. Meanwhile, vision-language foundation models,
such as CLIP, have showcased remarkable effectiveness in numerous zero-shot
image-level tasks, owing to their robust generalizability. Recently, a body of
work has investigated utilizing these models in open-vocabulary semantic
segmentation (OVSS). However, existing approaches often rely on impractical
supervised pre-training or access to additional pre-trained networks. In this
work, we propose a strong baseline for training-free OVSS, termed
Neighbour-Aware CLIP (NACLIP), representing a straightforward adaptation of
CLIP tailored for this scenario. Our method enforces localization of patches in
the self-attention of CLIP's vision transformer which, despite being crucial
for dense prediction tasks, has been overlooked in the OVSS literature. By
incorporating design choices favouring segmentation, our approach significantly
improves performance without requiring additional data, auxiliary pre-trained
networks, or extensive hyperparameter tuning, making it highly practical for
real-world applications. Experiments are performed on 8 popular semantic
segmentation benchmarks, yielding state-of-the-art performance on most
scenarios. Our code is publicly available at https://github.com/sinahmr/NACLIP . |
This paper introduces NACLIP, a training-free open-vocabulary semantic segmentation method that enhances CLIP's localization capability for pixel-wise prediction by enforcing spatial consistency in attention maps within the visual encoder. |
Existing open-vocabulary semantic segmentation methods rely on impractical supervised pre-training or auxiliary pre-trained networks, limiting their real-world applicability. This work addresses the need for a more practical training-free approach. |
NACLIP removes the CLS token, modifies the self-attention module to incorporate spatial consistency using a Gaussian kernel, employs a key-based similarity measure, and simplifies the final encoder block architecture for better dense prediction. |
NACLIP achieves state-of-the-art performance on 7 out of 8 popular OVSS benchmarks without requiring additional data or fine-tuning.
It demonstrates robustness to different CLIP visual backbones.
Qualitative results highlight NACLIP's improved object boundary detection and contextual understanding compared to other methods. |
The study acknowledges the potential relevance of the CLS token for dense prediction and suggests further investigation.
Future work could explore incorporating additional cues or refining the model for improved performance on specific datasets. |
semantic segmentation, open-vocabulary, training-free, clip, vision transformer |
2404.08111
Report |
S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing |
Guangzhi Wang, Tianyi Chen, Kamran Ghasedi, HsiangTao Wu, Tianyu Ding, Chris Nuesmeyer, Ilya Zharkov, Mohan Kankanhalli, Luming Liang |
Face attribute editing plays a pivotal role in various applications. However,
existing methods encounter challenges in achieving high-quality results while
preserving identity, editing faithfulness, and temporal consistency. These
challenges are rooted in issues related to the training pipeline, including
limited supervision, architecture design, and optimization strategy. In this
work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training
framework for face video editing. S3Editor is a generic solution that
comprehensively addresses these challenges with three key contributions.
Firstly, S3Editor adopts a self-training paradigm to enhance the training
process through semi-supervision. Secondly, we propose a semantic disentangled
architecture with a dynamic routing mechanism that accommodates diverse editing
requirements. Thirdly, we present a structured sparse optimization schema that
identifies and deactivates malicious neurons to further disentangle impacts
from untarget attributes. S3Editor is model-agnostic and compatible with
various editing approaches. Our extensive qualitative and quantitative results
affirm that our approach significantly enhances identity preservation, editing
fidelity, as well as temporal consistency. |
This paper presents S3Editor, a novel Sparse Semantic-disentangled Self-training framework for improving existing face video editing approaches. |
Current face video editing methods struggle to balance high-quality results with identity preservation, editing faithfulness, and temporal consistency due to limitations in training data, architecture, and optimization strategies. |
S3Editor utilizes a self-training paradigm with pseudo-edited data, a semantic disentangled architecture for diverse edits, and a structured sparse learning schema to deactivate irrelevant neurons and minimize over-editing. |
S3Editor significantly enhances identity preservation and editing faithfulness compared to existing methods.
The framework improves temporal consistency across video frames, even without explicit temporal constraints.
The semantic disentanglement and sparse learning strategies allow for localized edits, minimizing unwanted changes to unrelated facial features. |
The current implementation requires a predefined set of attributes for clustering, potentially limiting its generalization to entirely novel edits.
Future work could explore alternative neuron grouping strategies beyond landmark-based partitioning for sparse learning. |
face video editing, self-training, semantic disentanglement, sparse learning, temporal consistency |
2404.08031
Report |
Latent Guard: a Safety Framework for Text-to-image Generation |
Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati |
With the ability to generate high-quality images, text-to-image (T2I) models
can be exploited for creating inappropriate content. To prevent misuse,
existing safety measures are either based on text blacklists, which can be
easily circumvented, or harmful content classification, requiring large
datasets for training and offering low flexibility. Hence, we propose Latent
Guard, a framework designed to improve safety measures in text-to-image
generation. Inspired by blacklist-based approaches, Latent Guard learns a
latent space on top of the T2I model's text encoder, where it is possible to
check the presence of harmful concepts in the input text embeddings. Our
proposed framework is composed of a data generation pipeline specific to the
task using large language models, ad-hoc architectural components, and a
contrastive learning strategy to benefit from the generated data. The
effectiveness of our method is verified on three datasets and against four
baselines. Code and data will be shared at
https://github.com/rt219/LatentGuard. |
Introduces Latent Guard, a framework for improving safety measures in text-to-image generation by detecting blacklisted concepts in the latent space of input text embeddings. |
Existing safety measures like text blacklists are easily circumvented, while harmful content classifiers require large datasets and lack flexibility. |
Uses contrastive learning to train an Embedding Mapping Layer on top of pretrained text encoders. This layer maps embeddings of blacklisted concepts and prompts containing them closer together in a latent space. |
Outperforms baselines like Text Blacklists, CLIPScore, BERTScore, and LLM-based classifiers in detecting unsafe prompts.
Demonstrates robustness against adversarial attacks targeting the text encoder.
Generalizes well to unseen datasets and concepts, allowing for flexible blacklist modifications at test time. |
Performance heavily relies on the comprehensiveness of the blacklisted concepts.
LLM-generated training data may not fully represent real-world input distributions. |
text-to-image generation, safety, contrastive learning, latent space, adversarial attacks |
2404.08030
Report |
Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative Models |
Mazda Moayeri, Samyadeep Basu, Sriram Balasubramanian, Priyatham Kattakinda, Atoosa Chengini, Robert Brauneis, Soheil Feizi |
Recent text-to-image generative models such as Stable Diffusion are extremely
adept at mimicking and generating copyrighted content, raising concerns amongst
artists that their unique styles may be improperly copied. Understanding how
generative models copy "artistic style" is more complex than duplicating a
single image, as style is comprised by a set of elements (or signature) that
frequently co-occurs across a body of work, where each individual work may vary
significantly. In our paper, we first reformulate the problem of "artistic
copyright infringement" to a classification problem over image sets, instead of
probing image-wise similarities. We then introduce ArtSavant, a practical
(i.e., efficient and easy to understand) tool to (i) determine the unique style
of an artist by comparing it to a reference dataset of works from 372 artists
curated from WikiArt, and (ii) recognize if the identified style reappears in
generated images. We leverage two complementary methods to perform artistic
style classification over image sets, includingTagMatch, which is a novel
inherently interpretable and attributable method, making it more suitable for
broader use by non-technical stake holders (artists, lawyers, judges, etc).
Leveraging ArtSavant, we then perform a large-scale empirical study to provide
quantitative insight on the prevalence of artistic style copying across 3
popular text-to-image generative models. Namely, amongst a dataset of prolific
artists (including many famous ones), only 20% of them appear to have their
styles be at a risk of copying via simple prompting of today's popular
text-to-image generative models. |
This paper introduces ArtSavant, a tool designed to detect and articulate potential artistic style copying by text-to-image generative models. |
The rise of AI models capable of mimicking artistic styles raises copyright concerns for artists. This work addresses the need for a practical and interpretable tool to identify and analyze potential style infringements. |
The authors curate a dataset of artworks from 372 prolific artists and develop two complementary methods: DeepMatch (a black-box neural network classifier) and TagMatch (an interpretable tag-based classifier using CLIP and a novel tag composition method). They apply these methods to generated images from popular text-to-image models, analyzing match rates and confidences. |
DeepMatch achieves 89.3% accuracy on real art, indicating the existence of unique artistic styles for most artists.
Analysis of generated images reveals that only about 20% of the artists studied are at high risk of style copying by current generative models using simple prompting.
TagMatch provides interpretable and attributable evidence of style copying by identifying shared tag signatures between generated images and reference artists. |
The study's scope is limited to 372 artists, which may not fully represent the vast diversity of artistic styles.
The atomic tagging method, while precise, relies on CLIP and may not capture all nuances of artistic style. |
artistic style copying, copyright infringement, text-to-image generation, deep learning, interpretability |
2404.07993
Report |
Connecting NeRFs, Images, and Text |
Francesco Ballerini, Pierluigi Zama Ramirez, Roberto Mirabella, Samuele Salti, Luigi Di Stefano |
Neural Radiance Fields (NeRFs) have emerged as a standard framework for
representing 3D scenes and objects, introducing a novel data type for
information exchange and storage. Concurrently, significant progress has been
made in multimodal representation learning for text and image data. This paper
explores a novel research direction that aims to connect the NeRF modality with
other modalities, similar to established methodologies for images and text. To
this end, we propose a simple framework that exploits pre-trained models for
NeRF representations alongside multimodal models for text and image processing.
Our framework learns a bidirectional mapping between NeRF embeddings and those
obtained from corresponding images and text. This mapping unlocks several novel
and useful applications, including NeRF zero-shot classification and NeRF
retrieval from images or text. |
This paper proposes a novel framework to connect Neural Radiance Fields (NeRFs) with other modalities like images and text, enabling applications like zero-shot NeRF classification and NeRF retrieval. |
As NeRFs become a standard for 3D scene representation, connecting them with existing modalities (like text and images) unlocks new possibilities for information exchange, storage, and multimodal applications. |
The framework leverages pre-trained models like CLIP and NF2Vec to learn bidirectional mapping between NeRF embeddings and embeddings from corresponding images and text using two simple MLPs. |
The framework enables zero-shot NeRF classification with accuracy comparable to methods relying on rendered images, but without rendering a single pixel.
It allows retrieval of NeRFs from both image and text queries, achieving competitive performance compared to baselines.
An adaptation technique using ControlNet is proposed to improve NeRF retrieval from real-world images. |
The current work is limited to synthetic objects due to reliance on NF2Vec trained on ShapeNet.
NeRF generation is constrained by the capabilities of the NF2Vec decoder. |
neural radiance fields, nerf, multimodal learning, vision-language models, zero-shot classification |
2404.07991
Report |
GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh |
Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G. Schwing, Shenlong Wang |
We introduce GoMAvatar, a novel approach for real-time, memory-efficient,
high-quality animatable human modeling. GoMAvatar takes as input a single
monocular video to create a digital avatar capable of re-articulation in new
poses and real-time rendering from novel viewpoints, while seamlessly
integrating with rasterization-based graphics pipelines. Central to our method
is the Gaussians-on-Mesh representation, a hybrid 3D model combining rendering
quality and speed of Gaussian splatting with geometry modeling and
compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap data and
various YouTube videos. GoMAvatar matches or surpasses current monocular human
modeling algorithms in rendering quality and significantly outperforms them in
computational efficiency (43 FPS) while being memory-efficient (3.63 MB per
subject). |
Introduces \method, a novel framework for real-time, memory-efficient, high-quality animatable human modeling from a single monocular video. |
High-fidelity, animatable digital avatars are crucial for various applications, but conventional methods are slow, expensive, and cumbersome. Affordable methods using only monocular RGB videos are highly desirable. |
Presents the Gaussians-on-Mesh (GoM) representation, combining rendering quality and speed of Gaussian splatting with the geometry modeling and compatibility of deformable meshes. Leverages Gaussian splats for rendering and a skeleton-driven deformable mesh for articulation. Employs a differentiable shading module to handle view dependency. |
\method matches or surpasses state-of-the-art monocular human modeling algorithms in rendering quality.
It significantly outperforms competitors in computational efficiency, achieving a rendering speed of 43 FPS on an NVIDIA A100 GPU.
\method is memory-efficient, requiring only 3.63 MB per subject. |
Limited ability to hallucinate unseen regions.
Challenges in handling significant topology changes, such as dynamically moving clothing parts. |
human modeling, animatable avatars, monocular video, gaussians-on-mesh, real-time rendering |
2404.07990
Report |
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models |
Moreno D'Incà, Elia Peruzzo, Massimiliano Mancini, Dejia Xu, Vidit Goel, Xingqian Xu, Zhangyang Wang, Humphrey Shi, Nicu Sebe |
Text-to-image generative models are becoming increasingly popular and
accessible to the general public. As these models see large-scale deployments,
it is necessary to deeply investigate their safety and fairness to not
disseminate and perpetuate any kind of biases. However, existing works focus on
detecting closed sets of biases defined a priori, limiting the studies to
well-known concepts. In this paper, we tackle the challenge of open-set bias
detection in text-to-image generative models presenting OpenBias, a new
pipeline that identifies and quantifies the severity of biases agnostically,
without access to any precompiled set. OpenBias has three stages. In the first
phase, we leverage a Large Language Model (LLM) to propose biases given a set
of captions. Secondly, the target generative model produces images using the
same set of captions. Lastly, a Vision Question Answering model recognizes the
presence and extent of the previously proposed biases. We study the behavior of
Stable Diffusion 1.5, 2, and XL emphasizing new biases, never investigated
before. Via quantitative experiments, we demonstrate that OpenBias agrees with
current closed-set bias detection methods and human judgement. |
Proposes OpenBias, the first open-set bias detection pipeline for text-to-image generative models that identifies, recognizes, and quantifies biases without predefined categories. |
Existing bias detection methods rely on pre-defined bias categories, limiting their scope and ability to uncover novel biases, which is crucial as AI-generated content becomes increasingly prevalent. |
OpenBias leverages a Large Language Model (LLM) to propose potential biases and generate related questions from a dataset of captions. Then, a Vision Question Answering (VQA) model assesses the presence and severity of those biases in images generated by the target generative model. |
OpenBias successfully identifies both well-known and previously unexplored biases across different versions of Stable Diffusion.
The pipeline demonstrates a strong agreement with FairFace, a classifier trained for fair predictions, and aligns well with human judgment in a user study.
The context-aware analysis highlights the influence of caption elements on bias perpetuation, revealing varying bias intensity depending on the context. |
OpenBias relies on the performance of the underlying LLM and VQA models, inheriting their limitations and potential biases.
The study primarily focuses on qualitative analysis of context-aware biases, leaving room for more systematic quantitative investigation in future work. |
bias detection, text-to-image generation, open-set recognition, large language models, vision question answering |
2404.07987
Report |
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback |
Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, Chen Chen |
To enhance the controllability of text-to-image diffusion models, existing
efforts like ControlNet incorporated image-based conditional controls. In this
paper, we reveal that existing methods still face significant challenges in
generating images that align with the image conditional controls. To this end,
we propose ControlNet++, a novel approach that improves controllable generation
by explicitly optimizing pixel-level cycle consistency between generated images
and conditional controls. Specifically, for an input conditional control, we
use a pre-trained discriminative reward model to extract the corresponding
condition of the generated images, and then optimize the consistency loss
between the input conditional control and extracted condition. A
straightforward implementation would be generating images from random noises
and then calculating the consistency loss, but such an approach requires
storing gradients for multiple sampling timesteps, leading to considerable time
and memory costs. To address this, we introduce an efficient reward strategy
that deliberately disturbs the input images by adding noise, and then uses the
single-step denoised images for reward fine-tuning. This avoids the extensive
costs associated with image sampling, allowing for more efficient reward
fine-tuning. Extensive experiments show that ControlNet++ significantly
improves controllability under various conditional controls. For example, it
achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE,
respectively, for segmentation mask, line-art edge, and depth conditions. |
ControlNet++ improves the controllability of text-to-image diffusion models by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls using a pre-trained discriminative reward model. |
Existing controllable generation methods struggle to accurately align generated images with input image conditions, hindering precise and fine-grained control. |
The method disrupts the consistency between training images and conditions by adding noise. Then, it uses single-step denoised images for efficient reward fine-tuning, optimizing the consistency between input and predicted conditions (e.g., segmentation masks). |
ControlNet++ significantly outperforms existing methods in terms of controllability across various conditional controls (e.g., segmentation masks, depth maps, edges).
It achieves this without compromising image quality, as evidenced by FID scores comparable or superior to baselines.
Images generated by ControlNet++ are effective for downstream tasks, demonstrated by improved performance when used to train a segmentation model. |
The method's current focus is primarily on controllability, with future work aiming to incorporate quality and aesthetics through human feedback.
Expanding the range of controllable conditions (e.g., human pose, scribbles) is another avenue for future development. |
controllable generation, diffusion model, controlnet, consistency feedback, reward fine-tuning |
2404.07973
Report |
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models |
Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang |
While Ferret seamlessly integrates regional understanding into the Large
Language Model (LLM) to facilitate its referring and grounding capability, it
poses certain limitations: constrained by the pre-trained fixed visual encoder
and failed to perform well on broader tasks. In this work, we unveil Ferret-v2,
a significant upgrade to Ferret, with three key designs. (1) Any resolution
grounding and referring: A flexible approach that effortlessly handles higher
image resolution, improving the model's ability to process and understand
images in greater detail. (2) Multi-granularity visual encoding: By integrating
the additional DINOv2 encoder, the model learns better and diverse underlying
contexts for global and fine-grained visual information. (3) A three-stage
training paradigm: Besides image-caption alignment, an additional stage is
proposed for high-resolution dense alignment before the final instruction
tuning. Experiments show that Ferret-v2 provides substantial improvements over
Ferret and other state-of-the-art methods, thanks to its high-resolution
scaling and fine-grained visual processing. |
Introduces Ferret-v2, an upgraded version of the Ferret model for multimodal understanding, featuring enhanced capabilities for handling referring and grounding at any resolution. |
Addresses the limitations of existing MLLMs in handling high-resolution images and fine-grained visual details for tasks involving regional understanding, such as referring and grounding. |
Employs a multi-granularity visual encoding strategy using CLIP for global context and DINOv2 for local details, along with a three-stage training paradigm (image-caption alignment, high-resolution dense alignment, intent-enhanced instruction tuning) to bridge global and local visual understanding. |
Achieves significant performance improvements over the original Ferret and other state-of-the-art models in tasks like Referring Object Classification (ROC) and Referring Expression Comprehension (REC).
Demonstrates enhanced ability to handle higher image resolutions, leading to improved accuracy in identifying small objects and details.
Exhibits competitive performance on modern MLLM benchmarks by incorporating task-specific datasets and a strategic prompting approach to bridge the gap between regional and global understanding. |
Potential for generating harmful or counterfactual responses, a common limitation in MLLMs.
Limited exploration of different vision encoders for multi-granularity visual encoding. |
multimodal learning, large language models, referring and grounding, high-resolution image understanding, multi-granularity visual encoding |
2404.07949
Report |
Taming Stable Diffusion for Text to 360° Panorama Image Generation |
Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xiaoshui Huang, Dinh Phung, Wanli Ouyang, Jianfei Cai |
Generative models, e.g., Stable Diffusion, have enabled the creation of
photorealistic images from text prompts. Yet, the generation of 360-degree
panorama images from text remains a challenge, particularly due to the dearth
of paired text-panorama data and the domain gap between panorama and
perspective images. In this paper, we introduce a novel dual-branch diffusion
model named PanFusion to generate a 360-degree image from a text prompt. We
leverage the stable diffusion model as one branch to provide prior knowledge in
natural image generation and register it to another panorama branch for
holistic image generation. We propose a unique cross-attention mechanism with
projection awareness to minimize distortion during the collaborative denoising
process. Our experiments validate that PanFusion surpasses existing methods
and, thanks to its dual-branch structure, can integrate additional constraints
like room layout for customized panorama outputs. Code is available at
https://chengzhag.github.io/publication/panfusion. |
Introduces PanFusion, a novel dual-branch diffusion model for generating high-quality, consistent 360° panoramas from text prompts. |
Addresses the limitations of existing text-to-panorama generation methods, which struggle with issues like loop closure, repetitive elements, and visual inconsistency. |
Leverages a panorama branch for global layout guidance and a perspective branch to exploit Stable Diffusion's strengths in perspective image generation. Employs an Equirectangular-Perspective Projection Attention (EPPA) mechanism to ensure consistency between the branches. |
Outperforms state-of-the-art methods in terms of realism and consistency in both panorama and perspective views.
Effectively integrates room layout as an additional condition for customized panorama generation.
Demonstrates strong generalization ability to out-of-domain prompts, including outdoor scenes. |
Higher computational complexity due to the dual-branch architecture.
Occasional failure to generate entrances for indoor scenes. |
panorama generation, text-to-image synthesis, diffusion models, equirectangular projection, layout-conditioned generation |
2404.07933
Report |
Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation |
Keonhee Han, Dominik Muhle, Felix Wimbauer, Daniel Cremers |
Inferring scene geometry from images via Structure from Motion is a
long-standing and fundamental problem in computer vision. While classical
approaches and, more recently, depth map predictions only focus on the visible
parts of a scene, the task of scene completion aims to reason about geometry
even in occluded regions. With the popularity of neural radiance fields
(NeRFs), implicit representations also became popular for scene completion by
predicting so-called density fields. Unlike explicit approaches. e.g.
voxel-based methods, density fields also allow for accurate depth prediction
and novel-view synthesis via image-based rendering. In this work, we propose to
fuse the scene reconstruction from multiple images and distill this knowledge
into a more accurate single-view scene reconstruction. To this end, we propose
Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed
images, trained fully self-supervised only from image data. Using knowledge
distillation, we use MVBTS to train a single-view scene completion network via
direct supervision called KDBTS. It achieves state-of-the-art performance on
occupancy prediction, especially in occluded regions. |
This paper presents a novel method for improving single-view 3D scene completion by leveraging information from multiple views. |
Accurate 3D scene understanding is essential for robotics and autonomous driving, and single-view methods often struggle with occlusions. |
The authors first train a multi-view density field reconstruction model (MVBTS) in a self-supervised manner. Then, they use knowledge distillation to train a single-view model (KDBTS) supervised by the MVBTS predictions. |
MVBTS effectively fuses density information from multiple views, leading to accurate scene reconstructions.
KDBTS achieves state-of-the-art occupancy prediction on the KITTI-360 benchmark, outperforming previous single-view methods.
Knowledge distillation from multi-view predictions provides a strong supervisory signal for single-view scene completion. |
The method assumes a static scene, limiting its performance in dynamic environments.
Future work can focus on modeling dynamic objects and view-dependent effects to further enhance reconstruction accuracy. |
scene completion, density fields, knowledge distillation, multi-view learning, self-supervised learning |
2404.07850
Report |
MindBridge: A Cross-Subject Brain Decoding Framework |
Shizun Wang, Songhua Liu, Zhenxiong Tan, Xinchao Wang |
Brain decoding, a pivotal field in neuroscience, aims to reconstruct stimuli
from acquired brain signals, primarily utilizing functional magnetic resonance
imaging (fMRI). Currently, brain decoding is confined to a
per-subject-per-model paradigm, limiting its applicability to the same
individual for whom the decoding model is trained. This constraint stems from
three key challenges: 1) the inherent variability in input dimensions across
subjects due to differences in brain size; 2) the unique intrinsic neural
patterns, influencing how different individuals perceive and process sensory
information; 3) limited data availability for new subjects in real-world
scenarios hampers the performance of decoding models. In this paper, we present
a novel approach, MindBridge, that achieves cross-subject brain decoding by
employing only one model. Our proposed framework establishes a generic paradigm
capable of addressing these challenges by introducing biological-inspired
aggregation function and novel cyclic fMRI reconstruction mechanism for
subject-invariant representation learning. Notably, by cycle reconstruction of
fMRI, MindBridge can enable novel fMRI synthesis, which also can serve as
pseudo data augmentation. Within the framework, we also devise a novel
reset-tuning method for adapting a pretrained model to a new subject.
Experimental results demonstrate MindBridge's ability to reconstruct images for
multiple subjects, which is competitive with dedicated subject-specific models.
Furthermore, with limited data for a new subject, we achieve a high level of
decoding accuracy, surpassing that of subject-specific models. This advancement
in cross-subject brain decoding suggests promising directions for wider
applications in neuroscience and indicates potential for more efficient
utilization of limited fMRI data in real-world scenarios. Project page:
https://littlepure2333.github.io/MindBridge |
This paper proposes MindBridge, a novel framework for cross-subject brain decoding using fMRI, overcoming the limitations of subject-specific models by learning subject-invariant representations. |
Current brain decoding requires training one model per subject, limiting its applicability. MindBridge allows a single model to decode brain signals from multiple subjects, enabling broader applications in neuroscience and efficient use of limited fMRI data. |
MindBridge utilizes an adaptive signal aggregation function to unify fMRI signal sizes across subjects and a cyclic fMRI reconstruction mechanism for subject-invariant representation learning. It also introduces a reset-tuning strategy for adapting to new subjects with limited data. |
MindBridge achieves comparable brain decoding performance to state-of-the-art subject-specific methods using only one model.
It effectively adapts to new subjects with limited data, surpassing methods trained from scratch.
MindBridge enables novel fMRI synthesis, transforming one subject's fMRI signal into another's while preserving semantic meaning. |
Evaluation is limited to the NSD dataset due to the scarcity of high-quality fMRI data.
The serialization of fMRI signals as 1D vectors might disrupt the original spatial relationships. |
brain decoding, fmri, cross-subject learning, diffusion models, neuroscience |
2404.07794
Report |
DGMamba: Domain Generalization via Generalized State Space Model |
Shaocong Long, Qianyu Zhou, Xiangtai Li, Xuequan Lu, Chenhao Ying, Yuan Luo, Lizhuang Ma, Shuicheng Yan |
Domain generalization~(DG) aims at solving distribution shift problems in
various scenes. Existing approaches are based on Convolution Neural Networks
(CNNs) or Vision Transformers (ViTs), which suffer from limited receptive
fields or quadratic complexities issues. Mamba, as an emerging state space
model (SSM), possesses superior linear complexity and global receptive fields.
Despite this, it can hardly be applied to DG to address distribution shifts,
due to the hidden state issues and inappropriate scan mechanisms. In this
paper, we propose a novel framework for DG, named DGMamba, that excels in
strong generalizability toward unseen domains and meanwhile has the advantages
of global receptive fields, and efficient linear complexity. Our DGMamba
compromises two core components: Hidden State Suppressing~(HSS) and
Semantic-aware Patch refining~(SPR). In particular, HSS is introduced to
mitigate the influence of hidden states associated with domain-specific
features during output prediction. SPR strives to encourage the model to
concentrate more on objects rather than context, consisting of two designs:
Prior-Free Scanning~(PFS), and Domain Context Interchange~(DCI). Concretely,
PFS aims to shuffle the non-semantic patches within images, creating more
flexible and effective sequences from images, and DCI is designed to regularize
Mamba with the combination of mismatched non-semantic and semantic information
by fusing patches among domains. Extensive experiments on four commonly used DG
benchmarks demonstrate that the proposed DGMamba achieves remarkably superior
results to state-of-the-art models. The code will be made publicly available. |
This paper introduces DGMamba, a novel state space model-based framework for domain generalization, aiming to improve the generalizability of models like Mamba on unseen domains while preserving their global receptive fields and linear complexity advantages. |
Existing CNN- or ViT-based domain generalization methods suffer from limitations such as local receptive fields (CNNs) or quadratic complexities (ViTs). Mamba, as a state space model, holds promise but lacks inherent mechanisms to handle domain shifts effectively. |
DGMamba tackles these issues with two core components: 1) Hidden State Suppressing (HSS) mitigates the impact of domain-specific information accumulated in hidden states during propagation. 2) Semantic-aware Patch Refining (SPR), comprising Prior-Free Scanning (PFS) and Domain Context Interchange (DCI), encourages the model to focus on objects rather than domain-specific context by shuffling and interchanging non-semantic patches. |
DGMamba significantly outperforms state-of-the-art domain generalization methods on five benchmarks (PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet).
Ablation studies confirm that HSS, PFS, and DCI all contribute to the performance improvement.
DGMamba achieves superior generalization performance with fewer parameters and lower computational complexity compared to CNN- and ViT-based counterparts. |
Exploration of feature/domain prompts in SSM-based models for enhanced representation learning.
Extension of DGMamba to high-structure tasks like domain-generalized semantic segmentation and object detection. |
domain generalization, state space model, mamba, hidden state suppressing, semantic-aware patch refining |
2404.07771
Report |
An Overview of Diffusion Models: Applications, Guided Generation, Statistical Rates and Optimization |
Minshuo Chen, Song Mei, Jianqing Fan, Mengdi Wang |
Diffusion models, a powerful and universal generative AI technology, have
achieved tremendous success in computer vision, audio, reinforcement learning,
and computational biology. In these applications, diffusion models provide
flexible high-dimensional data modeling, and act as a sampler for generating
new samples under active guidance towards task-desired properties. Despite the
significant empirical success, theory of diffusion models is very limited,
potentially slowing down principled methodological innovations for further
harnessing and improving diffusion models. In this paper, we review emerging
applications of diffusion models, understanding their sample generation under
various controls. Next, we overview the existing theories of diffusion models,
covering their statistical properties and sampling capabilities. We adopt a
progressive routine, beginning with unconditional diffusion models and
connecting to conditional counterparts. Further, we review a new avenue in
high-dimensional structured optimization through conditional diffusion models,
where searching for solutions is reformulated as a conditional sampling problem
and solved by diffusion models. Lastly, we discuss future directions about
diffusion models. The purpose of this paper is to provide a well-rounded
theoretical exposure for stimulating forward-looking theories and methods of
diffusion models. |
This paper reviews the theory and applications of diffusion models, a powerful class of generative AI models, focusing on their ability to learn data distributions and generate new samples under various controls. |
Despite the significant empirical success of diffusion models, their theoretical understanding lags behind, potentially hindering further methodological innovations. |
The paper reviews existing theoretical results on diffusion models, covering score function approximation and estimation, sampling guarantees, and distribution learning. It adopts a progressive approach, starting with unconditional models and extending to conditional ones. |
Diffusion models can efficiently learn complex data distributions, achieving minimax optimal rates for distribution estimation.
The sample complexity of score estimation in diffusion models can be significantly reduced when data lie on a low-dimensional subspace, circumventing the curse of dimensionality.
Conditional diffusion models can be used for black-box optimization by formulating it as a conditional sampling problem, generating high-fidelity solutions that optimize a reward function while preserving data latent structures. |
Theoretical understanding of conditional diffusion models, especially regarding guidance design and adaptation to specific tasks, remains limited.
Principled methodologies for tuning the strength of guidance in conditional diffusion models are still lacking. |
diffusion models, generative ai, score matching, sample complexity, black-box optimization |
2404.07724
Report |
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models |
Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, Jaakko Lehtinen |
Guidance is a crucial technique for extracting the best performance out of
image-generating diffusion models. Traditionally, a constant guidance weight
has been applied throughout the sampling chain of an image. We show that
guidance is clearly harmful toward the beginning of the chain (high noise
levels), largely unnecessary toward the end (low noise levels), and only
beneficial in the middle. We thus restrict it to a specific range of noise
levels, improving both the inference speed and result quality. This limited
guidance interval improves the record FID in ImageNet-512 significantly, from
1.81 to 1.40. We show that it is quantitatively and qualitatively beneficial
across different sampler parameters, network architectures, and datasets,
including the large-scale setting of Stable Diffusion XL. We thus suggest
exposing the guidance interval as a hyperparameter in all diffusion models that
use guidance. |
This paper proposes limiting classifier-free guidance to a specific interval of noise levels during the sampling process of diffusion models, rather than applying it constantly. |
Constant guidance throughout the sampling chain can be detrimental, especially at high and low noise levels. This work shows that restricting guidance to a specific interval improves both image quality and inference speed. |
The authors modify the diffusion ODE to incorporate a piecewise constant guidance weight function, enabling guidance only within a defined interval of noise levels. They evaluate their method using ImageNet with EDM2 and DiT-XL/2 models and qualitatively analyze Stable Diffusion XL outputs. |
Limiting the guidance interval significantly improves FID scores on ImageNet-512, achieving a new state-of-the-art of 1.40 with EDM2-XXL.
The method consistently improves results across different sampler parameters, network architectures (EDM2, DiT, Stable Diffusion XL), and datasets.
Qualitative analysis reveals that the proposed technique leads to better preservation of image composition and more natural color saturation compared to standard CFG. |
The optimal guidance interval is currently determined through grid search or visual inspection, and future work could explore automatic estimation methods.
Further investigation into the interaction between guidance and non-ideal aspects of trained denoisers is needed. |
diffusion models, classifier-free guidance, image generation, sampling techniques, fid |
2404.07600
Report |
Implicit and Explicit Language Guidance for Diffusion-based Visual Perception |
Hefeng Wang, Jiale Cao, Jin Xie, Aiping Yang, Yanwei Pang |
Text-to-image diffusion models have shown powerful ability on conditional
image synthesis. With large-scale vision-language pre-training, diffusion
models are able to generate high-quality images with rich texture and
reasonable structure under different text prompts. However, it is an open
problem to adapt the pre-trained diffusion model for visual perception. In this
paper, we propose an implicit and explicit language guidance framework for
diffusion-based perception, named IEDP. Our IEDP comprises an implicit language
guidance branch and an explicit language guidance branch. The implicit branch
employs frozen CLIP image encoder to directly generate implicit text embeddings
that are fed to diffusion model, without using explicit text prompts. The
explicit branch utilizes the ground-truth labels of corresponding images as
text prompts to condition feature extraction of diffusion model. During
training, we jointly train diffusion model by sharing the model weights of
these two branches. As a result, implicit and explicit branches can jointly
guide feature learning. During inference, we only employ implicit branch for
final prediction, which does not require any ground-truth labels. Experiments
are performed on two typical perception tasks, including semantic segmentation
and depth estimation. Our IEDP achieves promising performance on both tasks.
For semantic segmentation, our IEDP has the mIoU$^\text{ss}$ score of 55.9% on
AD20K validation set, which outperforms the baseline method VPD by 2.2%. For
depth estimation, our IEDP outperforms the baseline method VPD with a relative
gain of 11.0%. |
This paper proposes IEDP, an implicit and explicit language guidance framework leveraging pre-trained text-to-image diffusion models for visual perception tasks. |
Existing methods for adapting diffusion models to perception tasks either rely on unaligned text prompts or require cumbersome caption generation during inference. This work aims to address these limitations. |
IEDP consists of two branches: 1) Implicit branch: generates image-aligned text embeddings directly from input images using a frozen CLIP image encoder and a learnable adapter. 2) Explicit branch: utilizes ground-truth labels of training images as text prompts to condition feature extraction, jointly training the model with the implicit branch. Only the implicit branch is used during inference. |
IEDP achieves a mIoU score of 55.9% on ADE20K for semantic segmentation, outperforming the baseline VPD by 2.2%.
For depth estimation on NYUv2, IEDP attains an RMSE of 0.226, surpassing VPD by a relative gain of 11.0%.
IEDP demonstrates a favorable trade-off between performance and inference time compared to existing diffusion-based perception methods. |
The explicit branch currently relies on ground-truth labels during training, limiting its applicability to fully unsupervised settings.
Future work could explore alternative approaches for generating implicit text embeddings, potentially incorporating object-level information. |
diffusion models, language guidance, visual perception, semantic segmentation, depth estimation |
2404.07554
Report |
CAT: Contrastive Adapter Training for Personalized Image Generation |
Jae Wan Park, Sang Hyun Park, Jun Young Koh, Junha Lee, Min Song |
The emergence of various adapters, including Low-Rank Adaptation (LoRA)
applied from the field of natural language processing, has allowed diffusion
models to personalize image generation at a low cost. However, due to the
various challenges including limited datasets and shortage of regularization
and computation resources, adapter training often results in unsatisfactory
outcomes, leading to the corruption of the backbone model's prior knowledge.
One of the well known phenomena is the loss of diversity in object generation,
especially within the same class which leads to generating almost identical
objects with minor variations. This poses challenges in generation
capabilities. To solve this issue, we present Contrastive Adapter Training
(CAT), a simple yet effective strategy to enhance adapter training through the
application of CAT loss. Our approach facilitates the preservation of the base
model's original knowledge when the model initiates adapters. Furthermore, we
introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to
keep the former information. We qualitatively and quantitatively compare CAT's
improvement. Finally, we mention the possibility of CAT in the aspects of
multi-concept adapter and optimization. |
This paper introduces Contrastive Adapter Training (CAT), a novel method for personalized image generation using diffusion models. CAT enhances adapter training, particularly LoRA, by preserving the base model's knowledge and preventing overfitting. |
Personalized image generation is crucial for various applications, but existing adapter training methods often lead to knowledge corruption and poor generalization. This paper addresses this by introducing a method that preserves the original model's capabilities while enabling personalized generation. |
CAT leverages a contrastive loss function that minimizes the difference in noise prediction between the original and adapted models without token conditioning. It encourages the adapter to specialize in personalized generation while retaining the base model's general knowledge. |
CAT successfully mitigates underfitting and knowledge corruption problems in consistent generation adaptations.
The paper introduces Knowledge Preservation Score (KPS), a novel metric to quantitatively assess the preservation of original model knowledge after adapter training.
Qualitative and quantitative evaluations demonstrate CAT's effectiveness in preserving knowledge and achieving high-fidelity identity generation compared to existing methods. |
The paper acknowledges the limitations of not including CLIP score-based diversity and fidelity calculation due to its instability.
Future work aims to establish a reliable benchmark for consistent character generation, investigate CAT's impact on different domain knowledge, and expand CAT to support multi-concept training with per-token loss. |
image generation, diffusion models, adapter training, personalization, knowledge preservation |
2404.07448
Report |
Transferable and Principled Efficiency for Open-Vocabulary Segmentation |
Jingxuan Xu, Wuyang Chen, Yao Zhao, Yunchao Wei |
Recent success of pre-trained foundation vision-language models makes
Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance,
this approach introduces heavy computational overheads for two challenges: 1)
large model sizes of the backbone; 2) expensive costs during the fine-tuning.
These challenges hinder this OVS strategy from being widely applicable and
affordable in real-world scenarios. Although traditional methods such as model
compression and efficient fine-tuning can address these challenges, they often
rely on heuristics. This means that their solutions cannot be easily
transferred and necessitate re-training on different models, which comes at a
cost. In the context of efficient OVS, we target achieving performance that is
comparable to or even better than prior OVS works based on large
vision-language foundation models, by utilizing smaller models that incur lower
training costs. The core strategy is to make our efficiency principled and thus
seamlessly transferable from one OVS framework to others without further
customization. Comprehensive experiments on diverse OVS benchmarks demonstrate
our superior trade-off between segmentation accuracy and computation costs over
previous works. Our code is available on https://github.com/Xujxyang/OpenTrans |
This paper proposes OpenTrans, a transferable open-vocabulary segmentation technique using smaller models and less training costs without sacrificing performance. |
Current open-vocabulary segmentation (OVS) methods rely on large vision-language foundation models, leading to heavy computational overheads in model size and training costs, hindering their wider application. |
The authors achieve efficiency through two steps: 1) iteratively prune the CLIP image encoder without semantic supervision to obtain a transferable subnetwork applicable to various OVS frameworks and 2) selectively fine-tune layers based on heavy-tail spectrum analysis of pretrained weights to reduce training costs. |
Transferable subnetworks significantly reduce model size and computational costs (up to 54.4% and 47.2% respectively) while preserving or even improving OVS performance.
Principled layer-selective fine-tuning further reduces training costs by up to 12%, leading to a cumulative reduction of 32.6% when combined with the subnetwork.
OpenTrans achieves a strong balance between OVS accuracy and efficiency, outperforming previous methods in efficiency while maintaining competitive performance on diverse benchmarks. |
The method currently focuses on convolutional backbones and can be extended to larger backbones or ViT architectures.
Future work could explore more fine-grained weight selection for fine-tuning and application to other open-vocabulary tasks like object detection. |
open-vocabulary segmentation, model efficiency, transferable subnetwork, efficient fine-tuning, heavy-tail analysis |
2404.07389
Report |
Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models |
Yasi Zhang, Peiyu Yu, Ying Nian Wu |
Text-to-image diffusion models have shown great success in generating
high-quality text-guided images. Yet, these models may still fail to
semantically align generated images with the provided text prompts, leading to
problems like incorrect attribute binding and/or catastrophic object neglect.
Given the pervasive object-oriented structure underlying text prompts, we
introduce a novel object-conditioned Energy-Based Attention Map Alignment
(EBAMA) method to address the aforementioned problems. We show that an
object-centric attribute binding loss naturally emerges by approximately
maximizing the log-likelihood of a $z$-parameterized energy-based model with
the help of the negative sampling technique. We further propose an
object-centric intensity regularizer to prevent excessive shifts of objects
attention towards their attributes. Extensive qualitative and quantitative
experiments, including human evaluation, on several challenging benchmarks
demonstrate the superior performance of our method over previous strong
counterparts. With better aligned attention maps, our approach shows great
promise in further enhancing the text-controlled image editing ability of
diffusion models. |
This paper proposes Object-Conditioned Energy-Based Attention Map Alignment (EBAMA) to enhance semantic alignment between generated images and text prompts in text-to-image diffusion models. |
Existing text-to-image models often fail to capture the full semantic meaning of text prompts, leading to issues like incorrect attribute binding and object neglect. |
The method leverages object-centric attention loss, derived from maximizing the log-likelihood of an object-conditioned Energy-Based Model (EBM), to align attention maps and an intensity regularizer to ensure object presence. |
EBAMA outperforms previous methods in quantitative metrics (Full Sim., Min. Sim., T-C Sim.) across AnE, DVMP, and ABC-6K datasets.
Human evaluation confirms EBAMA's superiority in text-image alignment, particularly for complex, natural-language prompts.
The method effectively enhances text-controlled attribute editing capabilities compared to methods like PtP. |
The method's effectiveness is limited by the expressive power of the base Stable Diffusion model.
EBAMA degrades to regular diffusion model generation when no objects are present in the text prompt. |
text-to-image synthesis, diffusion models, attention mechanisms, semantic alignment, energy-based models |
2404.07206
Report |
GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models |
Zewei Zhang, Huan Liu, Jun Chen, Xiangyu Xu |
In this paper, we introduce GoodDrag, a novel approach to improve the
stability and image quality of drag editing. Unlike existing methods that
struggle with accumulated perturbations and often result in distortions,
GoodDrag introduces an AlDD framework that alternates between drag and
denoising operations within the diffusion process, effectively improving the
fidelity of the result. We also propose an information-preserving motion
supervision operation that maintains the original features of the starting
point for precise manipulation and artifact reduction. In addition, we
contribute to the benchmarking of drag editing by introducing a new dataset,
Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy
Index and Gemini Score, utilizing Large Multimodal Models. Extensive
experiments demonstrate that the proposed GoodDrag compares favorably against
the state-of-the-art approaches both qualitatively and quantitatively. The
project page is https://gooddrag.github.io. |
This paper introduces GoodDrag, a novel approach for drag editing that improves stability and image quality by alternating drag and denoising operations (AlDD) within the diffusion process and using information-preserving motion supervision. |
Existing drag editing methods suffer from instability, distortions, and inaccurate point control, especially in diffusion-based approaches. GoodDrag addresses these issues, enabling more precise and higher-quality edits. |
GoodDrag alternates drag operations with denoising steps throughout the diffusion process (AlDD) to prevent accumulation of perturbations. It also introduces information-preserving motion supervision to maintain the original features of the starting point during dragging. |
GoodDrag effectively reduces artifacts and improves the accuracy of point movement compared to existing methods.
Quantitative evaluations using the proposed Dragging Accuracy Index (DAI) and Gemini Score (GScore) demonstrate GoodDrag's superior performance.
A user study confirms GoodDrag's ability to achieve more precise and visually appealing drag editing results. |
GoodDrag's reliance on iterative optimization can lead to longer processing times.
Future work includes exploring GoodDrag's integration with other image editing techniques and extending it to video editing. |
drag editing, diffusion models, image manipulation, information-preserving motion supervision, alternating drag and denoising |
2404.07204
Report |
BRAVE: Broadening the visual encoding of vision-language models |
Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari |
Vision-language models (VLMs) are typically composed of a vision encoder,
e.g. CLIP, and a language model (LM) that interprets the encoded features to
solve downstream tasks. Despite remarkable progress, VLMs are subject to
several shortcomings due to the limited capabilities of vision encoders, e.g.
"blindness" to certain image features, visual hallucination, etc. To address
these issues, we study broadening the visual encoding capabilities of VLMs. We
first comprehensively benchmark several vision encoders with different
inductive biases for solving VLM tasks. We observe that there is no single
encoding configuration that consistently achieves top performance across
different tasks, and encoders with different biases can perform surprisingly
similarly. Motivated by this, we introduce a method, named BRAVE, that
consolidates features from multiple frozen encoders into a more versatile
representation that can be directly fed as the input to a frozen LM. BRAVE
achieves state-of-the-art performance on a broad range of captioning and VQA
benchmarks and significantly reduces the aforementioned issues of VLMs, while
requiring a smaller number of trainable parameters than existing methods and
having a more compressed representation. Our results highlight the potential of
incorporating different visual biases for a more broad and contextualized
visual understanding of VLMs. |
This paper introduces BRAVE, a method for enhancing vision-language models (VLMs) by consolidating features from multiple vision encoders with diverse biases. |
Existing VLMs often suffer from limitations due to the restricted capabilities of single vision encoders, such as blindness to specific image features or visual hallucinations. |
BRAVE utilizes a novel multi-encoder querying transformer (MEQT) to efficiently combine features from various frozen vision encoders into a compact visual representation. This representation serves as a soft visual prompt for a frozen language model, requiring minimal trainable parameters. |
BRAVE achieves state-of-the-art performance on various captioning (COCO, NoCaps) and VQA benchmarks (OKVQA, GQA, VizWiz-QA, MMVP, POPE).
It exhibits improved robustness against out-of-distribution inputs and visual hallucinations compared to single-encoder VLMs.
BRAVE maintains efficiency with fewer trainable parameters and lower pre-training data requirements than several existing methods. |
The current design requires forward passes from all encoders, potentially limiting inference speed. Future work could explore adaptive mechanisms for encoder selection.
While BRAVE demonstrates improved sample efficiency, further research is needed to reduce its reliance on large pre-training datasets. |
vision-language models, multi-encoder fusion, visual prompting, image captioning, visual question answering |
2404.07191
Report |
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models |
Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, Ying Shan |
We present InstantMesh, a feed-forward framework for instant 3D mesh
generation from a single image, featuring state-of-the-art generation quality
and significant training scalability. By synergizing the strengths of an
off-the-shelf multiview diffusion model and a sparse-view reconstruction model
based on the LRM architecture, InstantMesh is able to create diverse 3D assets
within 10 seconds. To enhance the training efficiency and exploit more
geometric supervisions, e.g, depths and normals, we integrate a differentiable
iso-surface extraction module into our framework and directly optimize on the
mesh representation. Experimental results on public datasets demonstrate that
InstantMesh significantly outperforms other latest image-to-3D baselines, both
qualitatively and quantitatively. We release all the code, weights, and demo of
InstantMesh, with the intention that it can make substantial contributions to
the community of 3D generative AI and empower both researchers and content
creators. |
InstantMesh is a feed-forward framework for fast, high-quality 3D mesh generation from a single image. |
Creating 3D assets from single-view images is valuable for various applications, including virtual reality, industrial design, and entertainment. InstantMesh addresses limitations in speed and quality of previous methods. |
The framework uses a two-stage approach: (1) Generates multi-view images from a single input image using a fine-tuned Zero123++ diffusion model. (2) Reconstructs a 3D mesh from these images using a sparse-view large reconstruction model, integrating a differentiable iso-surface extraction module for efficiency and geometric supervision. |
Achieves state-of-the-art performance on image-to-3D generation, surpassing existing baselines in quantitative metrics and qualitative comparisons.
Generates plausible novel views with high perceptual quality, as measured by SSIM and LPIPS metrics.
Produces smoother and more reliable 3D geometry compared to methods using alternative representations like triplanes or Gaussians. |
Limited triplane resolution from the transformer decoder might hinder high-definition modeling.
Reliance on the diffusion model's multi-view consistency can impact the final quality; improved architectures are expected to mitigate this. |
3d mesh generation, image-to-3d, diffusion models, large reconstruction models, generative ai |
2404.07178
Report |
Move Anything with Layered Scene Diffusion |
Jiawei Ren, Mengmeng Xu, Jui-Chieh Wu, Ziwei Liu, Tao Xiang, Antoine Toisoul |
Diffusion models generate images with an unprecedented level of quality, but
how can we freely rearrange image layouts? Recent works generate controllable
scenes via learning spatially disentangled latent codes, but these methods do
not apply to diffusion models due to their fixed forward process. In this work,
we propose SceneDiffusion to optimize a layered scene representation during the
diffusion sampling process. Our key insight is that spatial disentanglement can
be obtained by jointly denoising scene renderings at different spatial layouts.
Our generated scenes support a wide range of spatial editing operations,
including moving, resizing, cloning, and layer-wise appearance editing
operations, including object restyling and replacing. Moreover, a scene can be
generated conditioned on a reference image, thus enabling object moving for
in-the-wild images. Notably, this approach is training-free, compatible with
general text-to-image diffusion models, and responsive in less than a second. |
Introduces SceneDiffusion, a training-free approach for controllable scene generation and image editing using pre-trained text-to-image diffusion models. |
Addresses the limitation of existing diffusion models in providing fine-grained spatial control due to their fixed forward noising process. |
Optimizes a layered scene representation during the diffusion sampling process by jointly denoising multiple scene layouts at each step. This disentangles spatial layout from visual appearance. |
Generates scenes where objects can be moved, resized, cloned, and their appearance can be edited independently.
Enables object moving for in-the-wild images by using the sampling trajectory of a reference image as an anchor.
Outperforms prior works on image quality and layout consistency metrics for both controllable scene generation and image editing tasks. |
Object appearance may not perfectly align with the mask in the final rendered image.
High memory consumption for simultaneous denoising of multiple layouts. |
diffusion models, controllable scene generation, image editing, layered scene representation, spatial disentanglement |
2404.07177
Report |
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic |
Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, J. Zico Kolter |
Vision-language models (VLMs) are trained for thousands of GPU hours on
carefully curated web datasets. In recent times, data curation has gained
prominence with several works developing strategies to retain 'high-quality'
subsets of 'raw' scraped data. For instance, the LAION public dataset retained
only 10% of the total crawled data. However, these strategies are typically
developed agnostic of the available compute for training. In this paper, we
first demonstrate that making filtering decisions independent of training
compute is often suboptimal: the limited high-quality data rapidly loses its
utility when repeated, eventually requiring the inclusion of 'unseen' but
'lower-quality' data. To address this quality-quantity tradeoff
($\texttt{QQT}$), we introduce neural scaling laws that account for the
non-homogeneous nature of web data, an angle ignored in existing literature.
Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various
quality subsets of web data; (ii) account for how utility diminishes for a data
point at its 'nth' repetition; and (iii) formulate the mutual interaction of
various data pools when combined, enabling the estimation of model performance
on a combination of multiple data pools without ever jointly training on them.
Our key message is that data curation $\textit{cannot}$ be agnostic of the
total compute that a model will be trained for. Our scaling laws allow us to
curate the best possible pool for achieving top performance on Datacomp at
various compute budgets, carving out a pareto-frontier for data curation. Code
is available at https://github.com/locuslab/scaling_laws_data_filtering. |
The paper introduces the first neural scaling laws that consider data quality and compute budget for vision-language models trained on heterogeneous web data. |
Existing data filtering methods for vision-language model training are agnostic of compute budget, leading to suboptimal performance at larger scales. |
The authors propose scaling laws that model the diminishing utility of data with repetitions and formulate the interaction of data pools of varying quality to estimate model performance on combinations of these pools. |
Data filtering strategies must be compute-aware, as the benefit of high-quality data diminishes with repetitions at large compute budgets.
The proposed scaling laws accurately predict model performance on combinations of data pools without requiring training on these combinations.
The scaling laws enable the identification of pareto-optimal data filtering strategies for different compute budgets, guiding data curation for vision-language models. |
The scaling laws do not account for batch size variations, which can significantly impact contrastive learning performance.
The consistency of scaling parameters across different data pool sizes needs further investigation to enable extrapolation to very large-scale training. |
scaling laws, data filtering, vision-language models, contrastive learning, data curation |
2404.07153
Report |
Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations |
Ofir Shifman, Yair Weiss |
Deep neural networks that achieve remarkable performance in image
classification have previously been shown to be easily fooled by tiny
transformations such as a one pixel translation of the input image. In order to
address this problem, two approaches have been proposed in recent years. The
first approach suggests using huge datasets together with data augmentation in
the hope that a highly varied training set will teach the network to learn to
be invariant. The second approach suggests using architectural modifications
based on sampling theory to deal explicitly with image translations. In this
paper, we show that these approaches still fall short in robustly handling
'natural' image translations that simulate a subtle change in camera
orientation. Our findings reveal that a mere one-pixel translation can result
in a significant change in the predicted image representation for approximately
40% of the test images in state-of-the-art models (e.g. open-CLIP trained on
LAION-2B or DINO-v2) , while models that are explicitly constructed to be
robust to cyclic translations can still be fooled with 1 pixel realistic
(non-cyclic) translations 11% of the time. We present Robust Inference by Crop
Selection: a simple method that can be proven to achieve any desired level of
consistency, although with a modest tradeoff with the model's accuracy.
Importantly, we demonstrate how employing this method reduces the ability to
fool state-of-the-art models with a 1 pixel translation to less than 5% while
suffering from only a 1% drop in classification accuracy. Additionally, we show
that our method can be easy adjusted to deal with circular shifts as well. In
such case we achieve 100% robustness to integer shifts with state-of-the-art
accuracy, and with no need for any further training. |
This paper reveals that modern neural networks, including those trained on massive datasets and those designed for translation invariance, are still susceptible to small, realistic image translations, and proposes a method called Robust Inference by Crop Selection (RICS) to address this issue. |
Robustness to small image transformations is crucial for reliable performance in real-world applications, especially as deep neural networks are increasingly used as foundational models for various tasks. |
The RICS method enhances robustness by deterministically selecting a sub-crop from the input image during inference, ensuring consistency in feature representation despite translations. The paper provides theoretical analysis and experimental validation of RICS. |
Even a single-pixel translation can significantly alter the predictions of state-of-the-art models like open-CLIP and DINO-v2.
Methods designed for cyclic translation invariance remain vulnerable to realistic, non-cyclic translations.
RICS significantly improves robustness to realistic translations, achieving over 95% adversarial robustness with minimal impact on accuracy. |
The theoretical guarantee of robustness diminishes with increasing translation size.
The current method only handles integer translations, limiting its applicability to sub-pixel shifts. |
robustness, translation invariance, neural networks, image classification, deep learning |
2404.07106
Report |
3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion |
Yixuan Li, Weidong Yang, Ben Fei |
Point cloud completion aims to generate a complete and high-fidelity point
cloud from an initially incomplete and low-quality input. A prevalent strategy
involves leveraging Transformer-based models to encode global features and
facilitate the reconstruction process. However, the adoption of pooling
operations to obtain global feature representations often results in the loss
of local details within the point cloud. Moreover, the attention mechanism
inherent in Transformers introduces additional computational complexity,
rendering it challenging to handle long sequences effectively. To address these
issues, we propose 3DMambaComplete, a point cloud completion network built on
the novel Mamba framework. It comprises three modules: HyperPoint Generation
encodes point cloud features using Mamba's selection mechanism and predicts a
set of Hyperpoints. A specific offset is estimated, and the down-sampled points
become HyperPoints. The HyperPoint Spread module disperses these HyperPoints
across different spatial locations to avoid concentration. Finally, a
deformation method transforms the 2D mesh representation of HyperPoints into a
fine-grained 3D structure for point cloud reconstruction. Extensive experiments
conducted on various established benchmarks demonstrate that 3DMambaComplete
surpasses state-of-the-art point cloud completion methods, as confirmed by
qualitative and quantitative analyses. |
This paper proposes 3DMambaComplete, a novel point cloud completion network based on a 3D Mamba architecture, that addresses the limitations of Transformer-based methods by achieving linear complexity and a global receptive field for effective completion of long sequences. |
Existing Transformer-based point cloud completion methods suffer from loss of local details due to pooling operations and quadratic complexity of attention mechanisms, hindering their scalability. |
3DMambaComplete utilizes a HyperPoint Generation module to produce hyperpoints, employs a HyperPoint Spread module to disperse them spatially, and uses a Point Deformation module to transform points into a high-quality 3D structure. |
3DMambaComplete outperforms state-of-the-art methods on the PCN dataset, achieving the lowest chamfer distance in each category.
The method significantly surpasses previous techniques on the real-world KITTI dataset, especially for highly incomplete data.
3DMambaComplete demonstrates superior performance on the ShapeNet55 dataset, particularly under high masking ratios, accurately reconstructing complete shapes with fine-grained details. |
While 3DMambaComplete exhibits slightly higher parameters and FLOPS compared to some methods due to incorporating downsampled points, visualizations suggest their contribution to reconstruction effectiveness.
Future work will focus on exploring the impact of different sampling strategies within the 3DMambaComplete framework to further enhance its efficiency. |
point cloud completion, structured state space model, deep learning, mamba, hyperpoint |
2404.06913
Report |
Sparse Global Matching for Video Frame Interpolation with Large Motion |
Chunxu Liu, Guozhen Zhang, Rui Zhao, Limin Wang |
Large motion poses a critical challenge in Video Frame Interpolation (VFI)
task. Existing methods are often constrained by limited receptive fields,
resulting in sub-optimal performance when handling scenarios with large motion.
In this paper, we introduce a new pipeline for VFI, which can effectively
integrate global-level information to alleviate issues associated with large
motion. Specifically, we first estimate a pair of initial intermediate flows
using a high-resolution feature map for extracting local details. Then, we
incorporate a sparse global matching branch to compensate for flow estimation,
which consists of identifying flaws in initial flows and generating sparse flow
compensation with a global receptive field. Finally, we adaptively merge the
initial flow estimation with global flow compensation, yielding a more accurate
intermediate flow. To evaluate the effectiveness of our method in handling
large motion, we carefully curate a more challenging subset from commonly used
benchmarks. Our method demonstrates the state-of-the-art performance on these
VFI subsets with large motion. |
This paper introduces a novel sparse global matching pipeline for Video Frame Interpolation (VFI) that effectively addresses the challenge of large motion. |
Large motion poses significant difficulties for VFI tasks due to the limitations of local receptive fields in accurately estimating optical flow, leading to sub-optimal performance. |
The proposed method uses a two-step strategy: (1) estimate initial intermediate flows using a high-resolution feature map for capturing local details, and (2) incorporate a sparse global matching branch to compensate for errors in the initial flow estimations, specifically targeting regions with large motion identified through a difference map. |
The method achieves state-of-the-art performance on challenging VFI subsets with large motion, including X-Test-L, Xiph-L, and SNU-FILM-L.
A significant improvement in PSNR is observed, reaching up to 0.66 dB enhancement by correcting errors in the initial flow estimation using the sparse global matching technique.
The approach effectively combines local details with global correlations for accurate intermediate flow estimation, leading to improved visual quality in synthesized frames. |
The primary limitation lies in the computational cost of the global feature extractor used in the sparse global matching branch.
Future work can focus on exploring lighter and more efficient alternatives for global feature extraction or distilling knowledge from pre-trained optical flow models. |
video frame interpolation, large motion, sparse global matching, optical flow, deep learning |
2404.06903
Report |
DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting |
Shijie Zhou, Zhiwen Fan, Dejia Xu, Haoran Chang, Pradyumna Chari, Tejas Bharadwaj, Suya You, Zhangyang Wang, Achuta Kadambi |
The increasing demand for virtual reality applications has highlighted the
significance of crafting immersive 3D assets. We present a text-to-3D
360$^{\circ}$ scene generation pipeline that facilitates the creation of
comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of
minutes. Our approach utilizes the generative power of a 2D diffusion model and
prompt self-refinement to create a high-quality and globally coherent panoramic
image. This image acts as a preliminary "flat" (2D) scene representation.
Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to
enable real-time exploration. To produce consistent 3D geometry, our pipeline
constructs a spatially coherent structure by aligning the 2D monocular depth
into a globally optimized point cloud. This point cloud serves as the initial
state for the centroids of 3D Gaussians. In order to address invisible issues
inherent in single-view inputs, we impose semantic and geometric constraints on
both synthesized and input camera views as regularizations. These guide the
optimization of Gaussians, aiding in the reconstruction of unseen regions. In
summary, our method offers a globally consistent 3D scene within a
360$^{\circ}$ perspective, providing an enhanced immersive experience over
existing techniques. Project website at: http://dreamscene360.github.io/ |
Presents DreamScene360, a method for unconstrained text-to-3D scene generation with panoramic Gaussian splatting, enabling the generation of immersive and geometrically consistent 360-degree 3D scenes from text prompts. |
Addresses the limitations of previous text-to-3D methods that struggle with unbounded scenes, constrained viewpoints, and geometric inconsistencies. Offers a solution for generating immersive 3D experiences from text descriptions. |
Employs a multi-round self-refinement module with GPT-4V for text prompt revision and panoramic image generation. Utilizes a pretrained text-to-360° panoramic image diffusion model and optimizes geometric fields with monocular depth estimation. Leverages panoramic Gaussian splatting for efficient and detailed 3D scene representation. |
Generates high-fidelity and diverse 3D scenes from text prompts of varying specificity.
Demonstrates superior performance compared to baseline methods in terms of geometric consistency and visual quality across different viewpoints.
Successfully handles both bounded indoor and unbounded outdoor scenes, enabling immersive exploration. |
Relies on pretrained models for panoramic image generation and depth estimation, potentially limiting generalization to unseen domains.
Computational cost of optimizing Gaussian splatting representation can be high, particularly for complex scenes. |
text-to-3d, 3d scene generation, panoramic gaussian splatting, gpt-4v, immersive experience |
2404.06851
Report |
UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion |
Junsheng Zhou, Weiqi Zhang, Baorui Ma, Kanle Shi, Yu-Shen Liu, Zhizhong Han |
Diffusion models have shown remarkable results for image generation, editing
and inpainting. Recent works explore diffusion models for 3D shape generation
with neural implicit functions, i.e., signed distance function and occupancy
function. However, they are limited to shapes with closed surfaces, which
prevents them from generating diverse 3D real-world contents containing open
surfaces. In this work, we present UDiFF, a 3D diffusion model for unsigned
distance fields (UDFs) which is capable to generate textured 3D shapes with
open surfaces from text conditions or unconditionally. Our key idea is to
generate UDFs in spatial-frequency domain with an optimal wavelet
transformation, which produces a compact representation space for UDF
generation. Specifically, instead of selecting an appropriate wavelet
transformation which requires expensive manual efforts and still leads to large
information loss, we propose a data-driven approach to learn the optimal
wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by
numerical and visual comparisons with the latest methods on widely used
benchmarks. Page: https://weiqi-zhang.github.io/UDiFF. |
UDiFF, a 3D diffusion model for generating textured 3D shapes with open surfaces from text conditions or unconditionally, leveraging an optimal wavelet transformation for compact UDF representation. |
Existing 3D implicit diffusion models are limited to closed shapes, hindering the generation of diverse real-world content with open surfaces. This work addresses this limitation and introduces a novel approach for compact UDF representation. |
UDiFF employs a data-driven approach to learn an optimal wavelet filter for UDF compression and reconstruction, minimizing information loss. It uses a conditional diffusion framework with cross-attention for text-guided generation and a fine predictor for high-fidelity results. Surfaces are extracted using DCUDF and textured with Text2Tex. |
UDiFF outperforms state-of-the-art methods in generating open-surface shapes on DeepFashion3D.
It achieves comparable performance to leading methods on closed-surface shape generation on ShapeNet.
Ablation studies confirm the effectiveness of the optimal wavelet transformation and fine predictor. |
The adapted DCUDF meshing may not be perfectly accurate for complex open surfaces.
Exploring alternative meshing strategies and higher-resolution UDF generation are potential future directions. |
3d shape generation, diffusion models, unsigned distance fields, text-to-3d, open surfaces |
2404.06832
Report |
SplatPose & Detect: Pose-Agnostic 3D Anomaly Detection |
Mathis Kruse, Marco Rudolph, Dominik Woiwode, Bodo Rosenhahn |
Detecting anomalies in images has become a well-explored problem in both
academia and industry. State-of-the-art algorithms are able to detect defects
in increasingly difficult settings and data modalities. However, most current
methods are not suited to address 3D objects captured from differing poses.
While solutions using Neural Radiance Fields (NeRFs) have been proposed, they
suffer from excessive computation requirements, which hinder real-world
usability. For this reason, we propose the novel 3D Gaussian splatting-based
framework SplatPose which, given multi-view images of a 3D object, accurately
estimates the pose of unseen views in a differentiable manner, and detects
anomalies in them. We achieve state-of-the-art results in both training and
inference speed, and detection performance, even when using less training data
than competing methods. We thoroughly evaluate our framework using the recently
proposed Pose-agnostic Anomaly Detection benchmark and its multi-pose anomaly
detection (MAD) data set. |
This paper introduces SplatPose, a novel method for pose-agnostic anomaly detection in images using 3D Gaussian Splatting. |
Existing methods for anomaly detection struggle with varying object poses. While NERF-based solutions exist, they are computationally expensive. This work aims to solve both challenges. |
SplatPose represents objects as 3D Gaussian clouds learned from multi-view images. During inference, it estimates the pose of a query image by transforming the Gaussian cloud and compares the rendered image to the query to detect anomalies. |
SplatPose achieves state-of-the-art anomaly detection results on the MAD dataset, outperforming NERF-based methods.
It significantly reduces training time by 55x and inference time by 13x compared to competitors.
The method demonstrates superior pose estimation accuracy compared to iNeRF, contributing to its improved anomaly detection. |
Limitations include reliance on coarse pose estimation and the need for improvements in image feature comparison.
Future work will focus on real-world data adaptation, application to human pose estimation, and integrating 3D information into 2D approaches. |
anomaly detection, pose estimation, 3d gaussian splatting, novel view synthesis, computer vision |
2404.06814
Report |
Zero-shot Point Cloud Completion Via 2D Priors |
Tianxin Huang, Zhiwen Yan, Yuyang Zhao, Gim Hee Lee |
3D point cloud completion is designed to recover complete shapes from
partially observed point clouds. Conventional completion methods typically
depend on extensive point cloud data for training %, with their effectiveness
often constrained to object categories similar to those seen during training.
In contrast, we propose a zero-shot framework aimed at completing partially
observed point clouds across any unseen categories. Leveraging point rendering
via Gaussian Splatting, we develop techniques of Point Cloud Colorization and
Zero-shot Fractal Completion that utilize 2D priors from pre-trained diffusion
models to infer missing regions. Experimental results on both synthetic and
real-world scanned point clouds demonstrate that our approach outperforms
existing methods in completing a variety of objects without any requirement for
specific training data. |
This paper introduces a novel zero-shot framework for 3D point cloud completion, leveraging 2D priors from pre-trained diffusion models through Gaussian Splatting. |
Existing completion methods are limited by training data diversity, struggling with unseen object categories. This method utilizes 2D priors to improve robustness and generalizability for unseen categories. |
The method involves Point Cloud Colorization, estimating a reference viewpoint and generating a colorized image. Then, Zero-shot Fractal Completion optimizes 3D Gaussians guided by 2D priors from a diffusion model, conditioned on the reference image, to complete missing regions. |
Outperforms state-of-the-art completion methods on synthetic data.
Demonstrates superior performance on real-world scans, generalizing well to unseen categories.
Successfully completes point clouds derived from LiDAR sensors, showcasing its versatility. |
The optimization process for each point cloud can be time-consuming.
Large gaps in edge regions of the reference view may lead to completion defects. |
point cloud completion, gaussian splatting, diffusion model, zero-shot learning, 3d vision |
2404.06780
Report |
Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior |
Fan Lu, Kwan-Yee Lin, Yan Xu, Hongsheng Li, Guang Chen, Changjun Jiang |
Text-to-3D generation has achieved remarkable success via large-scale
text-to-image diffusion models. Nevertheless, there is no paradigm for scaling
up the methodology to urban scale. Urban scenes, characterized by numerous
elements, intricate arrangement relationships, and vast scale, present a
formidable barrier to the interpretability of ambiguous textual descriptions
for effective model optimization. In this work, we surmount the limitations by
introducing a compositional 3D layout representation into text-to-3D paradigm,
serving as an additional prior. It comprises a set of semantic primitives with
simple geometric structures and explicit arrangement relationships,
complementing textual descriptions and enabling steerable generation. Upon
this, we propose two modifications -- (1) We introduce Layout-Guided
Variational Score Distillation to address model optimization inadequacies. It
conditions the score distillation sampling process with geometric and semantic
constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes,
we represent 3D scene with a Scalable Hash Grid structure, incrementally
adapting to the growing scale of urban scenes. Extensive experiments
substantiate the capability of our framework to scale text-to-3D generation to
large-scale urban scenes that cover over 1000m driving distance for the first
time. We also present various scene editing demonstrations, showing the powers
of steerable urban scene generation. Website: https://urbanarchitect.github.io. |
Introduces Urban Architect, a method for steerable 3D urban scene generation leveraging 3D layout priors and text-to-image diffusion models, enabling large-scale, high-quality, and editable scene creation. |
Existing text-to-3D methods struggle to handle the complexity and scale of urban scenes, lacking interpretability for ambiguous textual descriptions and suitable representations for unbounded environments. |
Employs a compositional 3D layout representation with semantic primitives and introduces Layout-Guided Variational Score Distillation (LG-VSD) for layout-constrained optimization and a Scalable Hash Grid (SHG) structure for unbounded scene representation. |
Generates large-scale urban scenes covering over 1000m driving distance with high quality and diversity.
Outperforms previous 3D generative methods in FID and KID metrics on the KITTI-360 dataset.
Supports diverse scene editing effects, including instance manipulation and style transfer, by leveraging the flexibility of the layout representation and diffusion models. |
Current optimization process lacks pixel-level scene control.
Future work explores integrating semantic segmentation into the pipeline for enhanced control. |
text-to-3d generation, urban scene generation, 3d layout prior, score distillation sampling, scalable hash grid |
2404.06773
Report |
Adapting LLaMA Decoder to Vision Transformer |
Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo |
This work examines whether decoder-only Transformers such as LLaMA, which
were originally designed for large language models (LLMs), can be adapted to
the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to
align with LLaMA's architecture, and find that directly applying a casual mask
to the self-attention brings an attention collapse issue, resulting in the
failure to the network training. We suggest to reposition the class token
behind the image tokens with a post-sequence class token technique to overcome
this challenge, enabling causal self-attention to efficiently capture the
entire image's information. Additionally, we develop a soft mask strategy that
gradually introduces a casual mask to the self-attention at the onset of
training to facilitate the optimization behavior. The tailored model, dubbed as
image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct
supervised learning. Its causal self-attention boosts computational efficiency
and learns complex representation by elevating attention map ranks. iLLaMA
rivals the performance with its encoder-only counterparts, achieving 75.1%
ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M
and pre-training on ImageNet-21K further enhances the accuracy to 86.0%.
Extensive experiments demonstrate iLLaMA's reliable properties: calibration,
shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR
transfer learning. We hope our study can kindle fresh views to visual model
design in the wave of LLMs. Pre-trained models and codes are available here. |
This paper investigates the adaptation of decoder-only Transformers, like LLaMA (originally for LLMs), to computer vision tasks by proposing a novel model called image LLaMA (iLLaMA). |
This work aims to bridge the architectural gap between encoder-only visual models and decoder-only textual models, a timely and relevant issue in the era of LLMs. |
The authors progressively modify a standard ViT to align with LLaMA's architecture, proposing techniques like 'post-sequence class token' to address attention collapse and a 'soft mask' strategy to facilitate optimization. |
iLLaMA achieves competitive ImageNet-1K accuracy, reaching 75.1% with only 5.7M parameters and 86.0% with ~310M parameters after ImageNet-21K pre-training.
causal self-attention in iLLaMA boosts computational efficiency and learns complex representations, as evidenced by higher attention map ranks.
iLLaMA exhibits promising properties such as calibration, shape-texture bias, quantization compatibility, and transfer learning capabilities for semantic segmentation (ADE20K) and image classification (CIFAR). |
iLLaMA's application is currently explored mainly for perception tasks, leaving room for investigating its potential in more complex tasks like reasoning and generation.
The impact of the masking mechanism in iLLaMA's causal attention on high-resolution dense prediction tasks requires further investigation and optimization. |
vision transformer, llama, decoder-only architecture, causal self-attention, image recognition |
2404.06727
Report |
Bayesian NeRF: Quantifying Uncertainty with Volume Density in Neural Radiance Fields |
Sibeak Lee, Kyeongsu Kang, Hyeonwoo Yu |
We present the Bayesian Neural Radiance Field (NeRF), which explicitly
quantifies uncertainty in geometric volume structures without the need for
additional networks, making it adept for challenging observations and
uncontrolled images. NeRF diverges from traditional geometric methods by
offering an enriched scene representation, rendering color and density in 3D
space from various viewpoints. However, NeRF encounters limitations in relaxing
uncertainties by using geometric structure information, leading to inaccuracies
in interpretation under insufficient real-world observations. Recent research
efforts aimed at addressing this issue have primarily relied on empirical
methods or auxiliary networks. To fundamentally address this issue, we propose
a series of formulational extensions to NeRF. By introducing generalized
approximations and defining density-related uncertainty, our method seamlessly
extends to manage uncertainty not only for RGB but also for depth, without the
need for additional networks or empirical assumptions. In experiments we show
that our method significantly enhances performance on RGB and depth images in
the comprehensive dataset, demonstrating the reliability of the Bayesian NeRF
approach to quantifying uncertainty based on the geometric structure. |
This document provides a template and guidelines for submitting papers to ECCV [Year] conference. |
It ensures consistent formatting, anonymity for double-blind review, and adherence to ECCV policies. |
The paper details formatting rules for text, headings, figures, formulas, citations, references, and more. It also provides examples for anonymization and cross-referencing. |
The document clarifies the double-blind review policy, emphasizing the importance of anonymization while still citing one's prior work appropriately.
It specifies the strict page limit of 14 pages for the main content (excluding references) to maintain a fair and manageable review process.
The guidelines aim to homogenize the submissions, aiding the reviewers in their task and ultimately benefiting both authors and readers. |
The document assumes the use of LaTeX, although a Word template is available. Authors using Word are solely responsible for ensuring format consistency.
Specific information regarding camera-ready manuscript preparation is deferred until after the paper decisions are announced, potentially leaving authors uninformed on certain aspects. |
conference paper formatting, eccv, double-blind review, latex template, academic writing |
2404.06542
Report |
Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation |
Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara |
Open-vocabulary semantic segmentation aims at segmenting arbitrary categories
expressed in textual form. Previous works have trained over large amounts of
image-caption pairs to enforce pixel-level multimodal alignments. However,
captions provide global information about the semantics of a given image but
lack direct localization of individual concepts. Further, training on
large-scale datasets inevitably brings significant computational costs. In this
paper, we propose FreeDA, a training-free diffusion-augmented method for
open-vocabulary semantic segmentation, which leverages the ability of diffusion
models to visually localize generated concepts and local-global similarities to
match class-agnostic regions with semantic classes. Our approach involves an
offline stage in which textual-visual reference embeddings are collected,
starting from a large set of captions and leveraging visual and semantic
contexts. At test time, these are queried to support the visual matching
process, which is carried out by jointly considering class-agnostic regions and
global semantic similarities. Extensive analyses demonstrate that FreeDA
achieves state-of-the-art performance on five datasets, surpassing previous
methods by more than 7.0 average points in terms of mIoU and without requiring
any training. |
This paper proposes FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation. This method utilizes diffusion models to localize generated concepts and leverages local-global similarities to match image regions with semantic classes. |
Existing open-vocabulary segmentation methods, while effective, rely on computationally expensive training over large image-caption datasets. This paper offers a training-free alternative using diffusion models, which are known for their ability to visually ground generated concepts. |
FreeDA works in two phases. First, in an offline phase, it generates textual-visual reference embeddings using diffusion models. These embeddings capture semantic instances with their textual and visual context. Second, at inference, these references are used to compute local and global similarities to segment an input image. |
FreeDA achieves state-of-the-art performance on five datasets for open-vocabulary semantic segmentation, surpassing previous methods by a significant margin.
The approach demonstrates the effectiveness of combining local similarities based on self-supervised visual features and global similarities from a multimodal encoder (CLIP).
FreeDA shows robustness even without using superpixels for mask refinement, achieving competitive results and surpassing some PAMR-refined methods. |
The method relies on the quality of the generated prototypes from the diffusion model; inaccurate generations could impact segmentation.
Further research on effectively incorporating complex visual contexts and handling instances where objects are partially obscured could improve performance. |
open-vocabulary semantic segmentation, diffusion models, training-free methods, local-global similarity, zero-shot learning |
2404.06451
Report |
SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions |
Xiaoyu Liu, Yuxiang Wei, Ming Liu, Xianhui Lin, Peiran Ren, Xuansong Xie, Wangmeng Zuo |
Human visual imagination usually begins with analogies or rough sketches. For
example, given an image with a girl playing guitar before a building, one may
analogously imagine how it seems like if Iron Man playing guitar before Pyramid
in Egypt. Nonetheless, visual condition may not be precisely aligned with the
imaginary result indicated by text prompt, and existing layout-controllable
text-to-image (T2I) generation models is prone to producing degraded generated
results with obvious artifacts. To address this issue, we present a novel T2I
generation method dubbed SmartControl, which is designed to modify the rough
visual conditions for adapting to text prompt. The key idea of our SmartControl
is to relax the visual condition on the areas that are conflicted with text
prompts. In specific, a Control Scale Predictor (CSP) is designed to identify
the conflict regions and predict the local control scales, while a dataset with
text prompts and rough visual conditions is constructed for training CSP. It is
worth noting that, even with a limited number (e.g., 1,000~2,000) of training
samples, our SmartControl can generalize well to unseen objects. Extensive
experiments on four typical visual condition types clearly show the efficacy of
our SmartControl against state-of-the-arts. Source code, pre-trained models,
and datasets are available at https://github.com/liuxiaoyu1104/SmartControl. |
Introduces SmartControl, a text-to-image generation method that modifies rough visual conditions to better align with user text prompts, enabling photorealistic image synthesis even with unaligned input conditions. |
Existing layout-controllable text-to-image generation models struggle to produce high-quality images when the provided visual conditions (e.g., edges, depth maps) don't precisely match the user's textual description, leading to artifacts. |
SmartControl leverages a Control Scale Predictor (CSP) to identify regions where visual conditions conflict with the text prompt. It then predicts a local control scale map, allowing the model to relax constraints in conflicting areas while preserving structural guidance from the visual condition. |
Achieves superior image-text alignment (measured by CLIP Score) compared to state-of-the-art controllable generation methods on a dataset with rough conditions.
Demonstrates robust generalization, effectively adapting to other text-to-image models like IP-Adapter without requiring retraining.
Maintains high image quality and fidelity to user intent even with a limited training dataset (1,000-2,000 images) for the CSP. |
Evaluation of self-similarity relies on pseudo-ground truths due to the unpaired nature of the dataset.
Further exploration of alternative network architectures for the control scale predictor, potentially improving efficiency and performance. |
text-to-image generation, controlnet, rough conditions, control scale predictor, image synthesis |
2404.06429
Report |
Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion |
Fan Yang, Jianfeng Zhang, Yichun Shi, Bowen Chen, Chenxu Zhang, Huichao Zhang, Xiaofeng Yang, Jiashi Feng, Guosheng Lin |
Benefiting from the rapid development of 2D diffusion models, 3D content
creation has made significant progress recently. One promising solution
involves the fine-tuning of pre-trained 2D diffusion models to harness their
capacity for producing multi-view images, which are then lifted into accurate
3D models via methods like fast-NeRFs or large reconstruction models. However,
as inconsistency still exists and limited generated resolution, the generation
results of such methods still lack intricate textures and complex geometries.
To solve this problem, we propose Magic-Boost, a multi-view conditioned
diffusion model that significantly refines coarse generative results through a
brief period of SDS optimization ($\sim15$min). Compared to the previous text
or single image based diffusion models, Magic-Boost exhibits a robust
capability to generate images with high consistency from pseudo synthesized
multi-view images. It provides precise SDS guidance that well aligns with the
identity of the input images, enriching the local detail in both geometry and
texture of the initial generative results. Extensive experiments show
Magic-Boost greatly enhances the coarse inputs and generates high-quality 3D
assets with rich geometric and textural details. (Project Page:
https://magic-research.github.io/magic-boost/) |
This document provides author guidelines for submitting papers to ECCV. |
This ensures consistent formatting and provides details on anonymity, dual submissions, and other policies. |
The document outlines specific formatting instructions for language, template use, length, line numbering, headings, figures, formulas, footnotes, cross-references, program code, citations, and miscellaneous items. |
Papers should be formatted using the official LNCS style from Springer.
Submissions must be anonymized for double-blind review.
The maximum paper length is 14 pages excluding references. |
The document doesn't detail camera-ready manuscript preparation, this comes after paper decisions.
Specifics on handling overlapping material with concurrent submissions could be more elaborate. |
author guidelines, eccv, conference submission, double-blind review, lncs format |
2404.06425
Report |
ZeST: Zero-Shot Material Transfer from a Single Image |
Ta-Ying Cheng, Prafull Sharma, Andrew Markham, Niki Trigoni, Varun Jampani |
We propose ZeST, a method for zero-shot material transfer to an object in the
input image given a material exemplar image. ZeST leverages existing diffusion
adapters to extract implicit material representation from the exemplar image.
This representation is used to transfer the material using pre-trained
inpainting diffusion model on the object in the input image using depth
estimates as geometry cue and grayscale object shading as illumination cues.
The method works on real images without any training resulting a zero-shot
approach. Both qualitative and quantitative results on real and synthetic
datasets demonstrate that ZeST outputs photorealistic images with transferred
materials. We also show the application of ZeST to perform multiple edits and
robust material assignment under different illuminations. Project Page:
https://ttchengab.github.io/zest |
Introduces ZeST, a zero-shot method for transferring materials from a single exemplar image to objects in another image, leveraging pre-trained diffusion models. |
Addresses the challenging and time-consuming task of 2D material editing, eliminating the need for 3D models, explicit material properties, or training data. |
Combines an image encoder (IP-Adapter) to extract material representation, depth-based ControlNet for geometry guidance, and inpainting diffusion with foreground decoloring for illumination cues. |
Achieves high-fidelity material transfer while preserving object geometry and scene illumination.
Outperforms baselines in both qualitative and quantitative comparisons, demonstrating superior material fidelity and photorealism.
Enables applications like multi-object editing, 3D texturing, and lighting-aware material transfer. |
Occasionally exhibits partial material transfer or blends multiple materials from the exemplar.
Future work includes improving material localization within the exemplar and exploring user interaction for finer control. |
material transfer, diffusion models, zero-shot learning, image editing, computer graphics |
2404.06270
Report |
3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis |
Zhicheng Lu, Xiang Guo, Le Hui, Tianrui Chen, Min Yang, Xiao Tang, Feng Zhu, Yuchao Dai |
In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting
method for dynamic view synthesis. Existing neural radiance fields (NeRF) based
solutions learn the deformation in an implicit manner, which cannot incorporate
3D scene geometry. Therefore, the learned deformation is not necessarily
geometrically coherent, which results in unsatisfactory dynamic view synthesis
and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new
representation of the 3D scene, building upon which the 3D geometry could be
exploited in learning the complex 3D deformation. Specifically, the scenes are
represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized
to move and rotate over time to model the deformation. To enforce the 3D scene
geometry constraint during deformation, we explicitly extract 3D geometry
features and integrate them in learning the 3D deformation. In this way, our
solution achieves 3D geometry-aware deformation modeling, which enables
improved dynamic view synthesis and 3D dynamic reconstruction. Extensive
experimental results on both synthetic and real datasets prove the superiority
of our solution, which achieves new state-of-the-art performance.
The project is available at https://npucvr.github.io/GaGS/ |
This paper proposes a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis, exploiting 3D scene geometry to learn more coherent deformations. |
Existing NeRF-based solutions lack geometric coherence in deformation, leading to unsatisfactory dynamic view synthesis and reconstruction. |
The method leverages 3D Gaussian Splatting to represent scenes as a collection of 3D Gaussians. It extracts 3D geometry features using sparse convolution on voxelized Gaussian distributions and integrates these features into a deformation field that models Gaussian transformations over time. Continuous 6D rotation representation enhances accurate rotation estimation. |
The method achieves state-of-the-art performance on synthetic and real dynamic scene datasets (D-NeRF and HyperNeRF).
Ablation studies confirm the effectiveness of geometry-aware feature extraction, 6D rotation representation, and density control adaptations.
Visualization results demonstrate accurate 3D reconstruction and temporal interpolation capabilities. |
The method struggles with scenes containing points that abruptly appear or disappear.
Performance is limited in handling complex motions and long video sequences. |
dynamic view synthesis, gaussian splatting, 3d geometry, deformation modeling, neural radiance fields |
2404.06244
Report |
Anchor-based Robust Finetuning of Vision-Language Models |
Jinwei Han, Zhiwen Lin, Zhongyisun Sun, Yingguo Gao, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia |
We aim at finetuning a vision-language model without hurting its
out-of-distribution (OOD) generalization. We address two types of OOD
generalization, i.e., i) domain shift such as natural to sketch images, and ii)
zero-shot capability to recognize the category that was not contained in the
finetune data. Arguably, the diminished OOD generalization after finetuning
stems from the excessively simplified finetuning target, which only provides
the class information, such as ``a photo of a [CLASS]''. This is distinct from
the process in that CLIP was pretrained, where there is abundant text
supervision with rich semantic information. Therefore, we propose to compensate
for the finetune process using auxiliary supervision with rich semantic
information, which acts as anchors to preserve the OOD generalization.
Specifically, two types of anchors are elaborated in our method, including i)
text-compensated anchor which uses the images from the finetune set but
enriches the text supervision from a pretrained captioner, ii) image-text-pair
anchor which is retrieved from the dataset similar to pretraining data of CLIP
according to the downstream task, associating with the original CLIP text with
rich semantics. Those anchors are utilized as auxiliary semantic information to
maintain the original feature space of CLIP, thereby preserving the OOD
generalization capabilities. Comprehensive experiments demonstrate that our
method achieves in-distribution performance akin to conventional finetuning
while attaining new state-of-the-art results on domain shift and zero-shot
learning benchmarks. |
This paper proposes Anchor-based Robust Finetuning (ARF) to preserve the out-of-distribution (OOD) generalization of vision-language models during finetuning, addressing both domain shift and zero-shot learning. |
Maintaining OOD generalization (domain shift and zero-shot learning) is crucial for pretrained vision-language models like CLIP, even after finetuning on downstream tasks, to ensure broad applicability. |
ARF utilizes two types of anchors: text-compensated anchors (image paired with a generated caption) and image-text-pair anchors (retrieved from a dataset similar to CLIP's pretraining data) to regularize the finetuning process with auxiliary contrastive supervision. |
ARF achieves state-of-the-art performance on domain shift benchmarks like ImageNet variants and DomainNet, surpassing conventional finetuning and other robust finetuning methods.
ARF excels in zero-shot learning on diverse recognition tasks, maintaining high accuracy on unseen categories while other methods suffer significant degradation.
Ablation studies confirm the effectiveness of both anchor types and the impact of caption quality on ARF's performance. |
The reliance on pretrained captioners and retrieval methods introduces potential limitations in terms of caption quality and retrieval effectiveness.
Exploring the use of larger language models (LLMs) for generating more diverse and informative captions presents a promising direction for future work. |
vision-language models, robust finetuning, out-of-distribution generalization, domain shift, zero-shot learning |
2404.06135
Report |
Mansformer: Efficient Transformer of Mixed Attention for Image Deblurring and Beyond |
Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien, Ming-Hsuan Yang |
Transformer has made an enormous success in natural language processing and
high-level vision over the past few years. However, the complexity of
self-attention is quadratic to the image size, which makes it infeasible for
high-resolution vision tasks. In this paper, we propose the Mansformer, a
Transformer of mixed attention that combines multiple self-attentions, gate,
and multi-layer perceptions (MLPs), to explore and employ more possibilities of
self-attention. Taking efficiency into account, we design four kinds of
self-attention, whose complexities are all linear. By elaborate adjustment of
the tensor shapes and dimensions for the dot product, we split the typical
self-attention of quadratic complexity into four operations of linear
complexity. To adaptively merge these different kinds of self-attention, we
take advantage of an architecture similar to Squeeze-and-Excitation Networks.
Furthermore, we make it to merge the two-staged Transformer design into one
stage by the proposed gated-dconv MLP. Image deblurring is our main target,
while extensive quantitative and qualitative evaluations show that this method
performs favorably against the state-of-the-art methods far more than simply
deblurring. The source codes and trained models will be made available to the
public. |
This paper proposes Mansformer, an efficient Transformer model using mixed attention for image deblurring and other image restoration tasks. |
Existing Transformer models for image restoration struggle to balance computational complexity with the need for both global and local context. Mansformer addresses this with a novel mixed attention mechanism and a more efficient network architecture. |
The Mansformer uses a combination of four linear complexity self-attention mechanisms: local spatial, local channel, global spatial, and global channel. It also introduces a gated-dconv MLP to merge the typical two-stage Transformer design into a single stage. |
Mansformer achieves state-of-the-art performance on single-image motion deblurring benchmarks GoPro and HIDE.
It outperforms previous best methods on image deblurring with JPEG artifacts (REDS dataset) and image deraining (Rain13K).
Ablation studies demonstrate the contribution of each attention mechanism and the efficiency gain from the gated-dconv MLP. |
The model's performance gain is less significant on tasks with smaller image sizes, like real image denoising.
Future work could explore further optimization of the simplified channel attention module for better parameter efficiency. |
image deblurring, vision transformer, mixed attention, image restoration, gated-dconv mlp |
2404.06119
Report |
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation |
Junkai Yan, Yipeng Gao, Qize Yang, Xihan Wei, Xuansong Xie, Ancong Wu, Wei-Shi Zheng |
Text-to-3D generation, which synthesizes 3D assets according to an overall
text description, has significantly progressed. However, a challenge arises
when the specific appearances need customizing at designated viewpoints but
referring solely to the overall description for generating 3D objects. For
instance, ambiguity easily occurs when producing a T-shirt with distinct
patterns on its front and back using a single overall text guidance. In this
work, we propose DreamView, a text-to-image approach enabling multi-view
customization while maintaining overall consistency by adaptively injecting the
view-specific and overall text guidance through a collaborative text guidance
injection module, which can also be lifted to 3D generation via score
distillation sampling. DreamView is trained with large-scale rendered
multi-view images and their corresponding view-specific texts to learn to
balance the separate content manipulation in each view and the global
consistency of the overall object, resulting in a dual achievement of
customization and consistency. Consequently, DreamView empowers artists to
design 3D objects creatively, fostering the creation of more innovative and
diverse 3D assets. Code and model will be released at
https://github.com/iSEE-Laboratory/DreamView. |
The paper introduces DreamView, a text-to-3D generation approach that allows for customized appearances from different viewpoints while maintaining overall 3D consistency. |
Existing text-to-3D methods struggle to customize specific viewpoints based on a single, shared text description, limiting their flexibility and creative potential. |
DreamView employs an adaptive text guidance injection module that balances the influence of overall and view-specific text prompts within a diffusion model. It is first trained for text-to-image generation and then lifted to 3D generation via score distillation sampling. |
DreamView-2D outperforms existing text-to-image models in generating images consistent with both overall and view-specific descriptions.
DreamView-3D successfully generates 3D objects that adhere to detailed text prompts, showcasing unique appearances defined by each viewpoint.
User study confirms that DreamView-3D is preferred for generating 3D assets that align with text descriptions and exhibit high visual quality. |
Generated full-body characters sometimes have blurry faces due to the use of low-resolution training images.
DreamView relies on consistent text descriptions for different viewpoints, meaning it cannot generate different objects from different views. |
text-to-3d generation, view customization, diffusion models, score distillation sampling, text-to-image generation |
2404.06109
Report |
Revising Densification in Gaussian Splatting |
Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder |
In this paper, we address the limitations of Adaptive Density Control (ADC)
in 3D Gaussian Splatting (3DGS), a scene representation method achieving
high-quality, photorealistic results for novel view synthesis. ADC has been
introduced for automatic 3D point primitive management, controlling
densification and pruning, however, with certain limitations in the
densification logic. Our main contribution is a more principled, pixel-error
driven formulation for density control in 3DGS, leveraging an auxiliary,
per-pixel error function as the criterion for densification. We further
introduce a mechanism to control the total number of primitives generated per
scene and correct a bias in the current opacity handling strategy of ADC during
cloning operations. Our approach leads to consistent quality improvements
across a variety of benchmark scenes, without sacrificing the method's
efficiency. |
This paper introduces a novel density control mechanism for 3D Gaussian Splatting (3DGS) that leverages per-pixel errors to guide the densification process, addressing limitations of the original gradient-based approach. |
Adaptive Density Control (ADC) is crucial in 3DGS for determining where to allocate scene representation capacity. However, the existing gradient-based ADC can lead to underfitting, particularly in high-frequency texture areas, and lacks control over the number of primitives. |
The authors propose an error-driven approach where per-pixel errors (e.g., from SSIM) are propagated back to individual Gaussian primitives based on their contribution to the rendered pixel. This per-primitive error guides densification decisions, prioritizing primitives with higher errors for splitting/cloning. The paper also introduces an opacity correction mechanism to address a bias in the cloning process and provides control over the total number of primitives to prevent memory issues. |
The error-driven ADC consistently improves perceptual quality (SSIM, LPIPS) across various benchmark datasets compared to the original 3DGS and its Mip-Splatting variant.
The opacity correction mechanism and primitive growth control further contribute to performance gains and stabilize training.
The proposed approach effectively addresses underfitting in high-frequency texture regions, leading to more perceptually accurate reconstructions. |
The method might still exhibit underfitting in scenes with complex view-dependent effects or significant appearance variations.
Improving the handling of strong view-dependence, appearance variations, and limitations of linear approximation in splatting are potential avenues for future work. |
gaussian splatting, 3d reconstruction, novel view synthesis, adaptive density control, densification |
2404.06091
Report |
Hash3D: Training-free Acceleration for 3D Generation |
Xingyi Yang, Xinchao Wang |
The evolution of 3D generative modeling has been notably propelled by the
adoption of 2D diffusion models. Despite this progress, the cumbersome
optimization process per se presents a critical hurdle to efficiency. In this
paper, we introduce Hash3D, a universal acceleration for 3D generation without
model training. Central to Hash3D is the insight that feature-map redundancy is
prevalent in images rendered from camera positions and diffusion time-steps in
close proximity. By effectively hashing and reusing these feature maps across
neighboring timesteps and camera angles, Hash3D substantially prevents
redundant calculations, thus accelerating the diffusion model's inference in 3D
generation tasks. We achieve this through an adaptive grid-based hashing.
Surprisingly, this feature-sharing mechanism not only speed up the generation
but also enhances the smoothness and view consistency of the synthesized 3D
objects. Our experiments covering 5 text-to-3D and 3 image-to-3D models,
demonstrate Hash3D's versatility to speed up optimization, enhancing efficiency
by 1.3 to 4 times. Additionally, Hash3D's integration with 3D Gaussian
splatting largely speeds up 3D model creation, reducing text-to-3D processing
to about 10 minutes and image-to-3D conversion to roughly 30 seconds. The
project page is at https://adamdad.github.io/hash3D/. |
This paper presents Hash3D, a novel training-free method to accelerate diffusion-based 3D generation by reusing features from similar camera views and timesteps through an adaptive grid-based hashing approach. |
Existing 3D generative models based on 2D diffusion models suffer from lengthy optimization process due to repetitive score function sampling at various camera poses and timesteps. Hash3D addresses this efficiency bottleneck without compromising generation quality. |
Hash3D introduces a memory system with an adaptive grid-based hashing function. This allows for storing and retrieving features extracted from the diffusion model across different camera poses and timesteps. When a new view is similar to a previously computed one, Hash3D reuses the features, avoiding redundant calculations. |
Hash3D demonstrates its versatility by significantly speeding up both text-to-3D and image-to-3D generation processes, achieving a speed improvement of 1.3x to 4x across various baselines.
Beyond efficiency gains, Hash3D also slightly enhances the visual quality of generated 3D models, as evidenced by quantitative metrics and user study results.
The adaptive grid-based hashing mechanism proves effective in balancing computational cost and performance, outperforming methods using constant grid sizes or direct noise hashing. |
The current implementation of adaptive grid sizing uses a brute-force search, which might be sub-optimal.
Future work may explore more sophisticated hashing function learning approaches to further improve the efficiency and accuracy of feature retrieval. |
3d generation, diffusion models, score distillation sampling, hashing, acceleration |
2404.06050
Report |
Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-scale Scenes |
Tianchen Deng, Nailin Wang, Chongdi Wang, Shenghai Yuan, Jingchuan Wang, Danwei Wang, Weidong Chen |
Dense scene reconstruction for photo-realistic view synthesis has various
applications, such as VR/AR, autonomous vehicles. However, most existing
methods have difficulties in large-scale scenes due to three core challenges:
\textit{(a) inaccurate depth input.} Accurate depth input is impossible to get
in real-world large-scale scenes. \textit{(b) inaccurate pose estimation.} Most
existing approaches rely on accurate pre-estimated camera poses. \textit{(c)
insufficient scene representation capability.} A single global radiance field
lacks the capacity to effectively scale to large-scale scenes. To this end, we
propose an incremental joint learning framework, which can achieve accurate
depth, pose estimation, and large-scale scene reconstruction. A vision
transformer-based network is adopted as the backbone to enhance performance in
scale information estimation. For pose estimation, a feature-metric bundle
adjustment (FBA) method is designed for accurate and robust camera tracking in
large-scale scenes. In terms of implicit scene representation, we propose an
incremental scene representation method to construct the entire large-scale
scene as multiple local radiance fields to enhance the scalability of 3D scene
representation. Extended experiments have been conducted to demonstrate the
effectiveness and accuracy of our method in depth estimation, pose estimation,
and large-scale scene reconstruction. |
This paper presents an incremental joint learning framework for accurate depth and pose estimation, enabling large-scale scene reconstruction using a monocular camera. |
Existing methods struggle with large-scale scene reconstruction due to inaccurate depth and pose estimations, and limited scene representation capabilities. |
The framework leverages a vision transformer-based depth network, a feature-metric bundle adjustment (FBA) for pose estimation, and an incremental scene representation method that dynamically creates local radiance fields. |
The proposed method significantly outperforms state-of-the-art methods in novel view synthesis quality on Tanks and Temples, Static Hikes, and a proprietary dataset.
FBA achieves superior pose estimation accuracy compared to existing methods, especially in large-scale scenes.
The incremental scene representation method effectively handles long camera trajectories and large-scale scenes by dynamically creating local radiance fields. |
The method's reliance on a good initialization of the local radiance fields might limit its performance in highly dynamic environments.
Future work could focus on incorporating semantic information into the framework for richer scene understanding. |
scene reconstruction, depth estimation, pose estimation, neural radiance fields, incremental learning |
2404.05979
Report |
StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion |
Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, Changsheng Xu |
Story visualization aims to generate a series of realistic and coherent
images based on a storyline. Current models adopt a frame-by-frame architecture
by transforming the pre-trained text-to-image model into an auto-regressive
manner. Although these models have shown notable progress, there are still
three flaws. 1) The unidirectional generation of auto-regressive manner
restricts the usability in many scenarios. 2) The additional introduced story
history encoders bring an extremely high computational cost. 3) The story
visualization and continuation models are trained and inferred independently,
which is not user-friendly. To these ends, we propose a bidirectional, unified,
and efficient framework, namely StoryImager. The StoryImager enhances the
storyboard generative ability inherited from the pre-trained text-to-image
model for a bidirectional generation. Specifically, we introduce a Target Frame
Masking Strategy to extend and unify different story image generation tasks.
Furthermore, we propose a Frame-Story Cross Attention Module that decomposes
the cross attention for local fidelity and global coherence. Moreover, we
design a Contextual Feature Extractor to extract contextual information from
the whole storyline. The extensive experimental results demonstrate the
excellent performance of our StoryImager. The code is available at
https://github.com/tobran/StoryImager. |
This paper proposes StoryImager, a unified and efficient framework for story visualization and completion, capable of generating coherent and high-fidelity story image sequences. |
Existing story visualization models suffer from limitations such as unidirectional generation, high computational cost, and separate training for different tasks. This limits their usability and efficiency. |
StoryImager leverages a Storyboard-based Generation approach with a Target Frame Masking Strategy to unify different story image generation tasks. It introduces a Frame-Story Cross Attention Module for local fidelity and global coherence and uses a Contextual Feature Extractor for global context information. |
StoryImager outperforms previous state-of-the-art models in FID and FSD on both story visualization and continuation tasks.
It demonstrates significant improvements in visual consistency and story relevance based on human evaluation.
StoryImager is computationally more efficient and requires less hardware resources compared to previous models. |
The model currently only supports a fixed number of frames in a storyboard.
The inference speed is still limited by the diffusion model's sampling steps. |
story visualization, story completion, generative model, diffusion model, computer vision |
2404.05961
Report |
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders |
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy |
Large decoder-only language models (LLMs) are the state-of-the-art models on
most of today's NLP tasks and benchmarks. Yet, the community is only slowly
adopting these models for text embedding tasks, which require rich
contextualized representations. In this work, we introduce LLM2Vec, a simple
unsupervised approach that can transform any decoder-only LLM into a strong
text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional
attention, 2) masked next token prediction, and 3) unsupervised contrastive
learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3
popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed
models on English word- and sequence-level tasks. We outperform encoder-only
models by a large margin on word-level tasks and reach a new unsupervised
state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB).
Moreover, when combining LLM2Vec with supervised contrastive learning, we
achieve state-of-the-art performance on MTEB among models that train only on
publicly available data. Our strong empirical results and extensive analysis
demonstrate that LLMs can be effectively transformed into universal text
encoders in a parameter-efficient manner without the need for expensive
adaptation or synthetic GPT-4 generated data. |
The paper introduces LLM2Vec, a simple unsupervised approach for transforming decoder-only LLMs into strong text encoders using bidirectional attention, masked next token prediction, and unsupervised contrastive learning. |
LLMs are powerful text encoders, but their causal attention mechanism limits their ability to generate rich contextual representations. LLM2Vec overcomes this limitation, enabling the use of LLMs for a wider range of NLP tasks. |
LLM2Vec consists of three steps: 1) enabling bidirectional attention, 2) adapting the model to bidirectional attention via masked next token prediction (MNTP) training, and 3) applying unsupervised contrastive learning (SimCSE) for better sequence representations. |
LLM2Vec-transformed models outperform encoder-only models by a large margin on word-level tasks like chunking, NER, and POS tagging.
LLM2Vec achieves state-of-the-art performance among unsupervised models on the Massive Text Embeddings Benchmark (MTEB).
Combining LLM2Vec with supervised contrastive learning achieves state-of-the-art performance on MTEB among models trained only on publicly available data. |
The paper primarily focuses on English text corpora and benchmarks. Extending LLM2Vec to other languages is left for future work.
The large size of decoder-only LLMs presents challenges for training and inference efficiency, especially for encoding large document collections. |
text embedding, language models, decoder-only models, bidirectional attention, contrastive learning |
2404.05729
Report |
Finding Visual Task Vectors |
Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar |
Visual Prompting is a technique for teaching models to perform a visual task
via in-context examples, without any additional training. In this work, we
analyze the activations of MAE-VQGAN, a recent Visual Prompting model, and find
task vectors, activations that encode task-specific information. Equipped with
this insight, we demonstrate that it is possible to identify the task vectors
and use them to guide the network towards performing different tasks without
providing any input-output examples. To find task vectors, we compute the
average intermediate activations per task and use the REINFORCE algorithm to
search for the subset of task vectors. The resulting task vectors guide the
model towards performing a task better than the original model without the need
for input-output examples. |
This paper explores visual in-context learning by identifying and leveraging "task vectors", which are activations within computer vision models that encode task-specific information, to enable zero-shot task execution. |
This research provides insights into the mechanisms of visual in-context learning and offers a novel approach to adapt models for specific tasks without requiring input-output examples, potentially improving efficiency and flexibility. |
The authors propose an activation scoring mechanism to rank activations based on their task-specificity and use REINFORCE to identify the optimal subset of task vectors that, when patched into the model, guide it towards performing the desired task. |
The study reveals the existence of visual task vectors in the activation space of vision transformers, particularly within specific attention heads.
The proposed method enables zero-shot visual task execution by patching identified task vectors, achieving comparable or even superior performance to one-shot prompting methods.
The research highlights the distributed nature of task vectors across both the encoder and decoder of the network, emphasizing their complex role in visual in-context learning. |
The study focuses primarily on identifying task vectors, while acknowledging the potential presence of other important vector types, such as those encoding image structure, requiring further investigation.
The optimization process currently relies on evaluating model performance in pixel space for most tasks, leaving room for exploring alternative evaluation metrics in the VQGAN token space for potentially improved accuracy. |
visual in-context learning, task vectors, zero-shot learning, visual prompting, vision transformers |
2404.05726
Report |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim |
With the success of large language models (LLMs), integrating the vision
model into LLMs to build vision-language foundation models has gained much more
interest recently. However, existing LLM-based large multimodal models (e.g.,
Video-LLaMA, VideoChat) can only take in a limited number of frames for short
video understanding. In this study, we mainly focus on designing an efficient
and effective model for long-term video understanding. Instead of trying to
process more frames simultaneously like most existing work, we propose to
process videos in an online manner and store past video information in a memory
bank. This allows our model to reference historical video content for long-term
analysis without exceeding LLMs' context length constraints or GPU memory
limits. Our memory bank can be seamlessly integrated into current multimodal
LLMs in an off-the-shelf manner. We conduct extensive experiments on various
video understanding tasks, such as long-video understanding, video question
answering, and video captioning, and our model can achieve state-of-the-art
performances across multiple datasets. Code available at
https://boheumd.github.io/MA-LMM/. |
This paper introduces MA-LMM, a memory-augmented large multimodal model designed for efficient and effective long-term video understanding. |
Existing LLM-based multimodal models struggle with long videos due to limited context length and high GPU memory consumption. MA-LMM addresses these issues by processing videos in an online manner and storing historical information in a memory bank. |
MA-LMM processes video frames sequentially. It employs a visual memory bank to store raw visual features and a query memory bank to capture evolving video understanding from a Q-Former. A memory bank compression technique mitigates memory demands by merging similar adjacent features. |
MA-LMM achieves state-of-the-art results on long-term video understanding benchmarks (LVU, Breakfast, COIN).
It outperforms existing methods on video question answering (MSRVTT-QA, MSVD-QA) and video captioning (MSRVTT, MSVD, YouCook2) datasets.
Ablation studies demonstrate the contribution of each memory bank and the effectiveness of the memory bank compression technique. |
Processing long videos can still lead to prolonged inference times.
Future work includes using a video or clip-based visual encoder, pre-training on large-scale video-text datasets, and incorporating a more powerful LLM. |
long-term video understanding, large multimodal models, memory bank, video question answering, video captioning |
2404.05719
Report |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs |
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan |
Recent advancements in multimodal large language models (MLLMs) have been
noteworthy, yet, these general-domain MLLMs often fall short in their ability
to comprehend and interact effectively with user interface (UI) screens. In
this paper, we present Ferret-UI, a new MLLM tailored for enhanced
understanding of mobile UI screens, equipped with referring, grounding, and
reasoning capabilities. Given that UI screens typically exhibit a more
elongated aspect ratio and contain smaller objects of interest (e.g., icons,
texts) than natural images, we incorporate "any resolution" on top of Ferret to
magnify details and leverage enhanced visual features. Specifically, each
screen is divided into 2 sub-images based on the original aspect ratio (i.e.,
horizontal division for portrait screens and vertical division for landscape
screens). Both sub-images are encoded separately before being sent to LLMs. We
meticulously gather training samples from an extensive range of elementary UI
tasks, such as icon recognition, find text, and widget listing. These samples
are formatted for instruction-following with region annotations to facilitate
precise referring and grounding. To augment the model's reasoning ability, we
further compile a dataset for advanced tasks, including detailed description,
perception/interaction conversations, and function inference. After training on
the curated datasets, Ferret-UI exhibits outstanding comprehension of UI
screens and the capability to execute open-ended instructions. For model
evaluation, we establish a comprehensive benchmark encompassing all the
aforementioned tasks. Ferret-UI excels not only beyond most open-source UI
MLLMs, but also surpasses GPT-4V on all the elementary UI tasks. |
Ferret-UI, a new multimodal large language model (MLLM) specifically designed for mobile UI understanding with enhanced referring, grounding, and reasoning capabilities. |
Existing MLLMs often struggle with the unique characteristics of UI screens, such as elongated aspect ratios and small objects of interest, limiting their effectiveness in UI understanding and interaction tasks. |
The authors build upon the Ferret MLLM and introduce several key innovations: (1) Integration of "any resolution" to handle varying screen aspect ratios and enhance detail; (2) Training on a meticulously curated dataset encompassing elementary UI tasks (e.g., icon recognition, widget listing) and advanced tasks (e.g., detailed description, function inference); (3) Development of a comprehensive benchmark for evaluating model performance across various UI understanding tasks. |
Ferret-UI significantly outperforms the base Ferret model and other open-source UI MLLMs on various tasks, highlighting the importance of domain-specific training.
In comparison to GPT-4V, Ferret-UI demonstrates superior performance in elementary UI tasks, especially on Android screens with numerous small widgets.
Ferret-UI exhibits strong performance in advanced UI tasks, including generating detailed descriptions, engaging in grounded conversations, and inferring screen functionality. |
The model's reliance on UI element detection poses a limitation, as it cannot learn aspects of screens not detected, such as colors, design, or missed UI elements.
Future work includes exploring interactions beyond tapping, such as scrolling, long-clicking, and text input. |
ui understanding, multimodal large language models (mllms), referring and grounding, mobile ui, screen understanding |
2404.05705
Report |
Learning 3D-Aware GANs from Unposed Images with Template Feature Field |
Xinya Chen, Hanlei Guo, Yanrui Bin, Shangzhan Zhang, Yuanbo Yang, Yue Wang, Yujun Shen, Yiyi Liao |
Collecting accurate camera poses of training images has been shown to well
serve the learning of 3D-aware generative adversarial networks (GANs) yet can
be quite expensive in practice. This work targets learning 3D-aware GANs from
unposed images, for which we propose to perform on-the-fly pose estimation of
training images with a learned template feature field (TeFF). Concretely, in
addition to a generative radiance field as in previous approaches, we ask the
generator to also learn a field from 2D semantic features while sharing the
density from the radiance field. Such a framework allows us to acquire a
canonical 3D feature template leveraging the dataset mean discovered by the
generative model, and further efficiently estimate the pose parameters on real
data. Experimental results on various challenging datasets demonstrate the
superiority of our approach over state-of-the-art alternatives from both the
qualitative and the quantitative perspectives. |
This paper presents TeFF, a novel 3D-aware GAN that learns a 3D semantic template feature field alongside the generative model to estimate camera poses of real-world images on the fly, eliminating the need for known camera pose distribution during training. |
This is important because estimating camera poses for real-world images is difficult and expensive, especially for diverse object categories. |
The method augments a generative radiance field with a semantic feature field, using the mean feature field as a template for camera pose estimation. It discretizes azimuth and elevation angles for grid search, utilizes phase correlation to estimate scale and in-plane rotation, and samples camera poses during training based on a probability distribution function derived from matching errors. |
TeFF generates complete 3D geometry even for complex pose distributions, outperforming baselines on datasets like CompCars, SDIP Elephant, and LSUN Plane.
The method achieves superior pose distribution estimation compared to previous approaches like 3DGP and PoF3D, as demonstrated by lower KL divergence values.
Ablation studies confirm the effectiveness of using template feature fields and incorporating four degrees of freedom in camera pose estimation. |
TeFF currently struggles with images exhibiting significant perspective distortion and does not model object articulation.
Future work could explore using multiple templates for multi-category learning and disentangling geometry information during pose estimation. |
3d-aware gan, camera pose estimation, semantic feature field, unposed image synthesis, generative radiance fields |
2404.05674
Report |
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation |
Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang |
In this paper, we present MoMA: an open-vocabulary, training-free
personalized image model that boasts flexible zero-shot capabilities. As
foundational text-to-image models rapidly evolve, the demand for robust
image-to-image translation grows. Addressing this need, MoMA specializes in
subject-driven personalized image generation. Utilizing an open-source,
Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as
both a feature extractor and a generator. This approach effectively synergizes
reference image and text prompt information to produce valuable image features,
facilitating an image diffusion model. To better leverage the generated
features, we further introduce a novel self-attention shortcut method that
efficiently transfers image features to an image diffusion model, improving the
resemblance of the target object in generated images. Remarkably, as a
tuning-free plug-and-play module, our model requires only a single reference
image and outperforms existing methods in generating images with high detail
fidelity, enhanced identity-preservation and prompt faithfulness. Our work is
open-source, thereby providing universal access to these advancements. |
This paper introduces MoMA, a novel, open-vocabulary, and tuning-free image personalization model for subject-driven image generation that excels in detail fidelity, object identity resemblance, and prompt integration. |
Existing methods for subject-driven image generation require extensive resources for tuning or are limited to specific domains. MoMA addresses these limitations by using a pre-trained multimodal LLM for blending text prompts with visual features, enabling alterations in both background context and object texture. |
MoMA uses a three-part methodology: 1) A generative multimodal decoder (adapted LLaVA-7B) extracts and modifies image features from the reference image based on the target prompt. 2) Self-attention layers extract object features from a white-background version of the reference image. 3) Contextualized image features and object image features are injected into the UNet diffusion model during image generation. |
MoMA demonstrates superior detail accuracy and faithfulness to the target object across varied backgrounds in recontextualization tasks.
MoMA effectively alters the texture of target objects as dictated by text prompts while preserving unmentioned visual features.
Quantitative comparisons show that MoMA outperforms existing tuning-free methods in subject fidelity and prompt faithfulness for both recontextualization and texture editing. |
MoMA may struggle to accurately reproduce details for rare subjects, especially those containing text.
Potential for misuse in generating deceptive content, although training excludes person-related subjects to mitigate this risk. |
image generation, multimodal, personalization, llm, diffusion models |
2404.05673
Report |
CoReS: Orchestrating the Dance of Reasoning and Segmentation |
Xiaoyi Bao, Siyang Sun, Shuailei Ma, Kecheng Zheng, Yuxin Guo, Guosheng Zhao, Yun Zheng, Xingang Wang |
The reasoning segmentation task, which demands a nuanced comprehension of
intricate queries to accurately pinpoint object regions, is attracting
increasing attention. However, Multi-modal Large Language Models (MLLM) often
find it difficult to accurately localize the objects described in complex
reasoning contexts. We believe that the act of reasoning segmentation should
mirror the cognitive stages of human visual search, where each step is a
progressive refinement of thought toward the final object. Thus we introduce
the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual
hierarchy indeed enhances the visual search process. Specifically, we propose a
dual-chain structure that generates multi-modal, chain-like outputs to aid the
segmentation process. Furthermore, to steer the MLLM's outputs into this
intended hierarchy, we incorporate in-context inputs as guidance. Extensive
experiments demonstrate the superior performance of our CoReS, which surpasses
the state-of-the-art method by 7.1\% on the ReasonSeg dataset. Project:
https://chain-of-reasoning-and-segmentation.github.io/. |
This paper presents CoReS, a novel dual-modal chain-of-thought framework for enhancing fine-grained reasoning tasks in Multi-modal Large Language Models (MLLMs). |
Existing MLLMs struggle to accurately segment objects described using complex reasoning, especially when differentiating visually similar objects. CoReS addresses this by mimicking the top-down visual hierarchy employed by humans during visual search. |
CoReS utilizes a dual-chain structure: a reasoning chain generates multi-modal, hierarchical outputs from the MLLM, and a segmentation chain leverages this information for iterative segmentation refinement. In-context inputs, composed of text-based question-answer pairs, guide the MLLM to produce outputs adhering to the desired hierarchy. |
CoReS outperforms state-of-the-art methods, achieving a 7.1% improvement on the ReasonSeg benchmark.
Ablation studies demonstrate the effectiveness of both the dual-chain structure and the in-context guidance.
CoReS exhibits greater performance gains on tasks involving complex reasoning and fine-grained segmentation, as evident from results on refCOCOg and ReasonPart datasets. |
The quality of the pre-constructed context library for in-context learning could be further improved.
Exploring deeper logical levels in the dual-chain structure, beyond the current two levels, is a potential direction for future research. |
reasoning segmentation, multi-modal learning, chain-of-thought, in-context learning, fine-grained segmentation |
2404.05667
Report |
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation |
Jiannan Ge, Lingxi Xie, Hongtao Xie, Pandeng Li, Xiaopeng Zhang, Yongdong Zhang, Qi Tian |
A serious issue that harms the performance of zero-shot visual recognition is
named objective misalignment, i.e., the learning objective prioritizes
improving the recognition accuracy of seen classes rather than unseen classes,
while the latter is the true target to pursue. This issue becomes more
significant in zero-shot image segmentation because the stronger (i.e.,
pixel-level) supervision brings a larger gap between seen and unseen classes.
To mitigate it, we propose a novel architecture named AlignZeg, which embodies
a comprehensive improvement of the segmentation pipeline, including proposal
extraction, classification, and correction, to better fit the goal of zero-shot
segmentation. (1) Mutually-Refined Proposal Extraction. AlignZeg harnesses a
mutual interaction between mask queries and visual features, facilitating
detailed class-agnostic mask proposal extraction. (2) Generalization-Enhanced
Proposal Classification. AlignZeg introduces synthetic data and incorporates
multiple background prototypes to allocate a more generalizable feature space.
(3) Predictive Bias Correction. During the inference stage, AlignZeg uses a
class indicator to find potential unseen class proposals followed by a
prediction postprocess to correct the prediction bias. Experiments demonstrate
that AlignZeg markedly enhances zero-shot semantic segmentation, as shown by an
average 3.8% increase in hIoU, primarily attributed to a 7.1% improvement in
identifying unseen classes, and we further validate that the improvement comes
from alleviating the objective misalignment issue. |
This supplementary material provides further technical specifications, additional ablation studies, expanded visualization outcomes, and more thorough comparisons with related methodologies for the AlignZeg approach. |
The supplementary materials offer a deeper understanding of the AlignZeg method, its effectiveness in Generalized Zero-Shot Semantic Segmentation, and its advancements over existing techniques. |
The paper elaborates on the technical intricacies of AlignZeg, presents supplementary ablation experiments focusing on parameters like λ3 and M, and provides visual representations of proposal features and segmentation results. Additionally, it delves into comparative analyses with other methodologies like ZegCLIP, SAN, DeOP, MAFT, and PMOSR. |
The effectiveness of the feature expansion strategy is validated through ablation studies on the weight (λ3) for the loss Lvir.
Visualizations of proposal features highlight the improved discriminative capability of AlignZeg compared to baseline methods.
AlignZeg demonstrates superior performance in complex scenarios on datasets such as COCO-Stuff 164K, accurately identifying both seen and unseen class regions across various settings. |
The reliance on fixed category prototypes in AlignZeg might limit the mitigation of feature entanglement for certain closely related categories.
Future research could explore the adaptation of category prototypes to further enhance the model's generalization capabilities. |
zero-shot learning, semantic segmentation, computer vision, deep learning, clip |
2404.05666
Report |
YaART: Yet Another ART Rendering Technology |
Sergey Kastryulin, Artem Konev, Alexander Shishenya, Eugene Lyapustin, Artem Khurshudov, Alexander Tselousov, Nikita Vinokurov, Denis Kuznedelev, Alexander Markovich, Grigoriy Livshits, Alexey Kirillov, Anastasiia Tabisheva, Liubov Chubarova, Marina Kaminskaia, Alexander Ustyuzhanin, Artemii Shvetsov, Daniil Shlenskii, Valerii Startsev, Dmitrii Kornilov, Mikhail Romanov, Artem Babenko, Sergei Ovcharenko, Valentin Khrulkov |
In the rapidly progressing field of generative models, the development of
efficient and high-fidelity text-to-image diffusion systems represents a
significant frontier. This study introduces YaART, a novel production-grade
text-to-image cascaded diffusion model aligned to human preferences using
Reinforcement Learning from Human Feedback (RLHF). During the development of
YaART, we especially focus on the choices of the model and training dataset
sizes, the aspects that were not systematically investigated for text-to-image
cascaded diffusion models before. In particular, we comprehensively analyze how
these choices affect both the efficiency of the training process and the
quality of the generated images, which are highly important in practice.
Furthermore, we demonstrate that models trained on smaller datasets of
higher-quality images can successfully compete with those trained on larger
datasets, establishing a more efficient scenario of diffusion models training.
From the quality perspective, YaART is consistently preferred by users over
many existing state-of-the-art models. |
Introduces YaART, a production-grade text-to-image cascaded diffusion model fine-tuned with RLHF, emphasizing efficient data and computational resource usage. |
Addresses the challenge of balancing model scale, data size, and computational cost in achieving high-fidelity text-to-image generation. |
Employs a cascaded diffusion model architecture with three stages (GEN64, SR256, SR1024), trained using a large, high-quality dataset filtered with a Sample Fidelity Classifier. The model is further enhanced using supervised fine-tuning and RLHF for improved aesthetics and reduced defects. |
Scaling model size improves training efficiency and generation quality in cascaded diffusion models.
Training on a small, high-quality dataset can achieve comparable results to training on a larger dataset.
RLHF significantly improves image aesthetics and reduces defects while preserving image-text relevance. |
Current diffusion models require substantial human supervision for optimal results (prompt engineering, parameter tuning, post-filtering).
Text generation quality is currently insufficient for practical use, leading to the exclusion of text-containing images from the training dataset. |
diffusion models, text-to-image generation, cascaded diffusion, rlhf, dataset scaling |
2404.05662
Report |
BinaryDM: Towards Accurate Binarization of Diffusion Model |
Xingyu Zheng, Haotong Qin, Xudong Ma, Mingyuan Zhang, Haojie Hao, Jiakai Wang, Zixiang Zhao, Jinyang Guo, Xianglong Liu |
With the advancement of diffusion models (DMs) and the substantially
increased computational requirements, quantization emerges as a practical
solution to obtain compact and efficient low-bit DMs. However, the highly
discrete representation leads to severe accuracy degradation, hindering the
quantization of diffusion models to ultra-low bit-widths. In this paper, we
propose BinaryDM, a novel accurate quantization-aware training approach to push
the weights of diffusion models towards the limit of 1-bit. Firstly, we present
a Learnable Multi-basis Binarizer (LMB) to recover the representations
generated by the binarized DM, which improves the information in details of
representations crucial to the DM. Secondly, a Low-rank Representation
Mimicking (LRM) is applied to enhance the binarization-aware optimization of
the DM, alleviating the optimization direction ambiguity caused by fine-grained
alignment. Moreover, a progressive initialization strategy is applied to
training DMs to avoid convergence difficulties. Comprehensive experiments
demonstrate that BinaryDM achieves significant accuracy and efficiency gains
compared to SOTA quantization methods of DMs under ultra-low bit-widths. As the
first binarization method for diffusion models, BinaryDM achieves impressive
16.0 times FLOPs and 27.1 times storage savings with 1-bit weight and 4-bit
activation, showcasing its substantial advantages and potential for deploying
DMs on resource-limited scenarios. |
This paper proposes BinaryDM, a novel quantization-aware training approach to achieve accurate 1-bit weight diffusion models. |
Quantization, especially binarization, is crucial for deploying diffusion models on resource-limited devices by offering compact model size and efficient inference. However, existing methods suffer severe accuracy degradation when applied to diffusion models with ultra-low bit-widths. |
BinaryDM introduces two key components: 1) a Learnable Multi-basis Binarizer (LMB) to recover rich representations from binarized weights, and 2) a Low-rank Representation Mimicking (LRM) strategy to enhance optimization stability and accuracy by aligning binarized models with full-precision counterparts in a low-rank space. Additionally, a progressive initialization strategy is employed to facilitate training convergence. |
BinaryDM achieves significant accuracy and efficiency gains compared to existing quantization methods for diffusion models under ultra-low bit-widths (1-bit weight and 4/8-bit activations).
The proposed method demonstrates strong performance on both unconditional and conditional image generation tasks across various datasets, including CIFAR-10, LSUN, FFHQ, and ImageNet.
BinaryDM achieves impressive compression and acceleration, yielding up to 16.0× FLOPs and 27.1× storage savings. |
The training process of BinaryDM is still computationally intensive compared to post-training quantization methods.
Further research can explore extending BinaryDM to other diffusion model variants and exploring its performance in more downstream applications. |
diffusion models, model quantization, binarization, generative models, model compression |
2404.05657
Report |
MLP Can Be A Good Transformer Learner |
Sihao Lin, Pumeng Lyu, Dongrui Liu, Tao Tang, Xiaodan Liang, Andy Song, Xiaojun Chang |
Self-attention mechanism is the key of the Transformer but often criticized
for its computation demands. Previous token pruning works motivate their
methods from the view of computation redundancy but still need to load the full
network and require same memory costs. This paper introduces a novel strategy
that simplifies vision transformers and reduces computational load through the
selective removal of non-essential attention layers, guided by entropy
considerations. We identify that regarding the attention layer in bottom
blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit
the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited
since they exhibit smaller feature entropy compared to those MLPs in the top
blocks. Therefore, we propose to integrate the uninformative attention layers
into their subsequent counterparts by degenerating them into identical mapping,
yielding only MLP in certain transformer blocks. Experimental results on
ImageNet-1k show that the proposed method can remove 40% attention layer of
DeiT-B, improving throughput and memory bound without performance compromise.
Code is available at https://github.com/sihaoevery/lambda_vit. |
This paper proposes a novel method for simplifying vision transformers by selectively removing non-essential attention layers based on entropy considerations, leading to reduced computational load and memory footprint without sacrificing performance. |
The self-attention mechanism in transformers, while powerful, is computationally demanding. Existing token pruning methods address computational redundancy but don't reduce memory costs. This work aims to push the memory bound by directly removing uninformative attention layers. |
The method leverages entropy to quantify the information carried by attention layers. It employs a novel Entropy-based Selection Strategy (NOSE) to identify combinations of attention layers with minimal impact on final output. A dilution learning technique then degenerates selected attention layers into identity mappings, effectively integrating them into subsequent MLP layers. |
The method can remove 40% of attention layers in DeiT-B without performance degradation on ImageNet-1k.
It improves throughput by up to 36.5% and memory bound by over 20% compared to existing token pruning methods.
The learned features exhibit superior transferability, outperforming competing methods in linear probing experiments on CIFAR-100. |
The paper primarily focuses on DeiT architecture; exploring other transformer variants could broaden applicability.
The impact of removing attention layers on downstream tasks beyond classification and segmentation requires further investigation. |
vision transformer, attention pruning, entropy, model compression, memory bound |
2404.05626
Report |
Learning a Category-level Object Pose Estimator without Pose Annotations |
Fengrui Tian, Yaoyao Liu, Adam Kortylewski, Yueqi Duan, Shaoyi Du, Alan Yuille, Angtian Wang |
3D object pose estimation is a challenging task. Previous works always
require thousands of object images with annotated poses for learning the 3D
pose correspondence, which is laborious and time-consuming for labeling. In
this paper, we propose to learn a category-level 3D object pose estimator
without pose annotations. Instead of using manually annotated images, we
leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under
controlled pose differences and propose to learn our object pose estimator with
those images. Directly using the original diffusion model leads to images with
noisy poses and artifacts. To tackle this issue, firstly, we exploit an image
encoder, which is learned from a specially designed contrastive pose learning,
to filter the unreasonable details and extract image feature maps.
Additionally, we propose a novel learning strategy that allows the model to
learn object poses from those generated image sets without knowing the
alignment of their canonical poses. Experimental results show that our method
has the capability of category-level object pose estimation from a single shot
setting (as pose definition), while significantly outperforming other
state-of-the-art methods on the few-shot category-level object pose estimation
benchmarks. |
This paper presents detailed results of a new method for 3D pose estimation on the PASCAL3D+ dataset, particularly focusing on few-shot learning scenarios. |
Accurately estimating 3D object pose from a single image is crucial in various applications but remains challenging, especially with limited training data. |
The methodology leverages a diffusion model (Zero123) to generate multiple views of an object with varying poses, which are then used to optimize neural meshes for pose estimation. |
The method achieves promising results even when only one annotated instance per object category is available (e.g., 87.4% accuracy for buses).
Performance improves with more annotated instances, demonstrating the effectiveness of the few-shot learning approach.
Detailed results are provided for seven object categories, including bus, car, boat, motorbike, bicycle, and aeroplane. |
The paper currently lacks details about the neural mesh optimization process.
Future work will focus on publicly releasing the implementation code. |
3d pose estimation, few-shot learning, diffusion models, neural meshes, pascal3d+ |
2404.05621
Report |
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning |
Matteo Farina, Massimiliano Mancini, Elia Cunegatti, Gaowen Liu, Giovanni Iacca, Elisa Ricci |
While excellent in transfer learning, Vision-Language models (VLMs) come with
high computational costs due to their large number of parameters. To address
this issue, removing parameters via model pruning is a viable solution.
However, existing techniques for VLMs are task-specific, and thus require
pruning the network from scratch for each new task of interest. In this work,
we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP).
Given a pretrained VLM, the goal is to find a unique pruned counterpart
transferable to multiple unknown downstream tasks. In this challenging setting,
the transferable representations already encoded in the pretrained model are a
key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a
first, gradient-free, pruning framework for TA-VLP where: (i) the importance of
a parameter is expressed in terms of its magnitude and its information flow, by
incorporating the saliency of the neurons it connects; and (ii) pruning is
driven by the emergent (multimodal) distribution of the VLM parameters after
pretraining. We benchmark eight state-of-the-art pruning algorithms in the
context of TA-VLP, experimenting with two VLMs, three vision-language tasks,
and three pruning ratios. Our experimental results show that MULTIFLOW
outperforms recent sophisticated, combinatorial competitors in the vast
majority of the cases, paving the way towards addressing TA-VLP. The code is
publicly available at https://github.com/FarinaMatteo/multiflow. |
This paper proposes Task-Agnostic Vision-Language Model Pruning (TA-VLP), aiming to prune a VLM once while maintaining transferability to unknown downstream tasks. |
Current VLM pruning methods are task-specific, demanding costly re-pruning for each new task. TA-VLP addresses this by enabling a single pruning step for multiple unknown downstream tasks. |
The paper introduces Multimodal Flow Pruning (MFP), a gradient-free method for TA-VLP. MFP models each layer as a bipartite graph, where a parameter's importance is determined by its magnitude and the saliency of the neurons it connects. It also incorporates the multimodal distribution of pretrained VLM parameters to avoid biases. |
MFP outperforms or matches state-of-the-art pruning methods on various vision-language tasks (Image-Text Retrieval, Image Captioning, Visual Question Answering) across different pruning ratios.
Different VLMs (BLIP, XVLM) and tasks exhibit varying degrees of 'prunability'.
MFP demonstrates robustness even at extreme sparsity (90%) with XVLM, highlighting its effectiveness for aggressive compression. |
The paper focuses on unstructured pruning, limiting its immediate impact on reducing FLOPs and runtime.
Future work could explore extending MFP to structured pruning, leveraging its neuron-level importance formulation. |
vision-language models, model pruning, transfer learning, multimodality, information flow |
2404.05607
Report |
A Training-Free Plug-and-Play Watermark Framework for Stable Diffusion |
Guokai Zhang, Lanjun Wang, Yuting Su, An-An Liu |
Nowadays, the family of Stable Diffusion (SD) models has gained prominence
for its high quality outputs and scalability. This has also raised security
concerns on social media, as malicious users can create and disseminate harmful
content. Existing approaches involve training components or entire SDs to embed
a watermark in generated images for traceability and responsibility
attribution. However, in the era of AI-generated content (AIGC), the rapid
iteration of SDs renders retraining with watermark models costly. To address
this, we propose a training-free plug-and-play watermark framework for SDs.
Without modifying any components of SDs, we embed diverse watermarks in the
latent space, adapting to the denoising process. Our experimental findings
reveal that our method effectively harmonizes image quality and watermark
invisibility. Furthermore, it performs robustly under various attacks. We also
have validated that our method is generalized to multiple versions of SDs, even
without retraining the watermark model. |
This paper proposes a training-free plug-and-play watermark framework for Stable Diffusion models, enabling embedding diverse watermarks in the latent space without retraining. |
The rapid evolution of SD models makes retraining watermark models costly and impractical. This framework offers a flexible and efficient alternative for embedding traceable watermarks in generated images. |
The method involves training a watermark encoder-decoder architecture using a frozen VAE encoder-decoder from SD. The compressed watermark is embedded in the latent code after denoising, minimizing impact on image quality. |
Achieves high watermark invisibility, evidenced by high PSNR and SSIM scores.
Maintains good watermark extraction quality, with high NC and low CER values.
Demonstrates generalization across different SD versions without retraining. |
Watermark robustness against high-angle rotations requires further investigation.
Localized pixel variations may occur in specific samples after watermark embedding. |
watermarking, stable diffusion, text-to-image synthesis, training-free, plug-and-play |
2404.05603
Report |
Self-Explainable Affordance Learning with Embodied Caption |
Zhipeng Zhang, Zhimin Wei, Guolei Sun, Peng Wang, Luc Van Gool |
In the field of visual affordance learning, previous methods mainly used
abundant images or videos that delineate human behavior patterns to identify
action possibility regions for object manipulation, with a variety of
applications in robotic tasks. However, they encounter a main challenge of
action ambiguity, illustrated by the vagueness like whether to beat or carry a
drum, and the complexities involved in processing intricate scenes. Moreover,
it is important for human intervention to rectify robot errors in time. To
address these issues, we introduce Self-Explainable Affordance learning (SEA)
with embodied caption. This innovation enables robots to articulate their
intentions and bridge the gap between explainable vision-language caption and
visual affordance learning. Due to a lack of appropriate dataset, we unveil a
pioneering dataset and metrics tailored for this task, which integrates images,
heatmaps, and embodied captions. Furthermore, we propose a novel model to
effectively combine affordance grounding with self-explanation in a simple but
efficient manner. Extensive quantitative and qualitative experiments
demonstrate our method's effectiveness. |
This paper introduces Self-Explainable Affordance Learning (SEA), a new paradigm for robots to not only learn touchable regions for object manipulation, but also generate embodied captions explaining their intended actions and target objects. |
Existing visual affordance learning methods suffer from action ambiguity in complex scenes, lacking interpretability for human understanding and potential error correction. |
A novel SEA dataset with embodied captions is created based on AGD20K. The proposed SEA model utilizes DINO-ViT and CLIP for visual and multimodal embedding respectively, a Self-Explainable Former for action and object classification, and a Pixel-level Fusion architecture for affordance map localization. |
SEA effectively combines affordance grounding with self-explanation, outperforming baselines on affordance grounding metrics (KLD, SIM, NSS).
The model successfully generates self-explainable captions, achieving high accuracy in object-action identification using top-k metrics.
Qualitative results showcase reduced ambiguity and improved interpretability in predicting touchable regions and associated actions. |
Limitations exist in handling complex, open-world scenarios with intricate sentence structures.
Future work will explore advanced language models and feedback mechanisms for enhanced human-robot interaction. |
embodied caption, visual affordance learning, self-explainable ai, robotics, vision-language |
2404.05595
Report |
UniFL: Improve Stable Diffusion via Unified Feedback Learning |
Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Min Zheng, Lean Fu, Guanbin Li |
Diffusion models have revolutionized the field of image generation, leading
to the proliferation of high-quality models and diverse downstream
applications. However, despite these significant advancements, the current
competitive solutions still suffer from several limitations, including inferior
visual quality, a lack of aesthetic appeal, and inefficient inference, without
a comprehensive solution in sight. To address these challenges, we present
UniFL, a unified framework that leverages feedback learning to enhance
diffusion models comprehensively. UniFL stands out as a universal, effective,
and generalizable solution applicable to various diffusion models, such as
SD1.5 and SDXL. Notably, UniFL incorporates three key components: perceptual
feedback learning, which enhances visual quality; decoupled feedback learning,
which improves aesthetic appeal; and adversarial feedback learning, which
optimizes inference speed. In-depth experiments and extensive user studies
validate the superior performance of our proposed method in enhancing both the
quality of generated models and their acceleration. For instance, UniFL
surpasses ImageReward by 17% user preference in terms of generation quality and
outperforms LCM and SDXL Turbo by 57% and 20% in 4-step inference. Moreover, we
have verified the efficacy of our approach in downstream tasks, including Lora,
ControlNet, and AnimateDiff. |
UniFL, a unified feedback learning framework for improving visual quality, aesthetics, and inference speed of diffusion models, particularly stable diffusion. |
Existing diffusion models suffer from inferior visual quality, lack of aesthetic appeal, and inefficient inference. UniFL offers a comprehensive solution addressing these challenges simultaneously. |
UniFL leverages three key components: 1) Perceptual Feedback Learning (PeFL) using existing perceptual models for visual quality; 2) Decoupled Feedback Learning with dimension-specific reward models for aesthetics; 3) Adversarial Feedback Learning for inference acceleration. |
UniFL significantly enhances generation quality, outperforming ImageReward by 17% in user preference.
UniFL achieves superior acceleration, surpassing LCM by 57% in a 4-step inference user study.
UniFL demonstrates strong generalization across downstream tasks like LoRA, ControlNet, and AnimateDiff. |
Exploring larger visual perception models for enhanced supervision in PeFL.
Investigating extreme acceleration possibilities, particularly towards 1-step inference. |
diffusion models, feedback learning, text-to-image generation, inference acceleration, aesthetic quality |
2404.05580
Report |
Responsible Visual Editing |
Minheng Ni, Yeli Shen, Lei Zhang, Wangmeng Zuo |
With recent advancements in visual synthesis, there is a growing risk of
encountering images with detrimental effects, such as hate, discrimination, or
privacy violations. The research on transforming harmful images into
responsible ones remains unexplored. In this paper, we formulate a new task,
responsible visual editing, which entails modifying specific concepts within an
image to render it more responsible while minimizing changes. However, the
concept that needs to be edited is often abstract, making it challenging to
locate what needs to be modified and plan how to modify it. To tackle these
challenges, we propose a Cognitive Editor (CoEditor) that harnesses the large
multimodal model through a two-stage cognitive process: (1) a perceptual
cognitive process to focus on what needs to be modified and (2) a behavioral
cognitive process to strategize how to modify. To mitigate the negative
implications of harmful images on research, we create a transparent and public
dataset, AltBear, which expresses harmful information using teddy bears instead
of humans. Experiments demonstrate that CoEditor can effectively comprehend
abstract concepts within complex scenes and significantly surpass the
performance of baseline models for responsible visual editing. We find that the
AltBear dataset corresponds well to the harmful content found in real images,
offering a consistent experimental evaluation, thereby providing a safer
benchmark for future research. Moreover, CoEditor also shows great results in
general editing. We release our code and dataset at
https://github.com/kodenii/Responsible-Visual-Editing. |
Introduces 'responsible visual editing', modifying specific concepts in images to make them safer, fairer, or more privacy-conscious. |
Addresses the growing risk of harmful images created by advanced visual synthesis technology. |
Proposes CoEditor, a model leveraging large multimodal models (LMMs) with a two-stage cognitive process: (1) perceptual cognition to identify regions needing modification and (2) behavioral cognition to strategize the modification. |
CoEditor significantly outperforms baseline models in responsible image editing.
The proposed AltBear dataset, using teddy bears to depict harmful content, shows high consistency with real data while mitigating ethical risks.
CoEditor demonstrates strong performance in general image editing as well. |
High computational cost due to reliance on LMMs.
GPT API's non-deterministic nature poses reproducibility challenges. |
responsible visual editing, image editing, large multimodal model, responsible ai, altbear dataset |
2404.05578
Report |
Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning |
Mahsa Ehsanpour, Ian Reid, Hamid Rezatofighi |
For a complete comprehension of multi-person scenes, it is essential to go
beyond basic tasks like detection and tracking. Higher-level tasks, such as
understanding the interactions and social activities among individuals, are
also crucial. Progress towards models that can fully understand scenes
involving multiple people is hindered by a lack of sufficient annotated data
for such high-level tasks. To address this challenge, we introduce Social-MAE,
a simple yet effective transformer-based masked autoencoder framework for
multi-person human motion data. The framework uses masked modeling to pre-train
the encoder to reconstruct masked human joint trajectories, enabling it to
learn generalizable and data efficient representations of motion in human
crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a
lighter-weight transformer as the MAE decoder which operates on multi-person
joints' trajectory in the frequency domain. After the reconstruction task, the
MAE decoder is replaced with a task-specific decoder and the model is
fine-tuned end-to-end for a variety of high-level social tasks. Our proposed
model combined with our pre-training approach achieves the state-of-the-art
results on various high-level social tasks, including multi-person pose
forecasting, social grouping, and social action understanding. These
improvements are demonstrated across four popular multi-person datasets
encompassing both human 2D and 3D body pose. |
This paper presents Social-MAE, a transformer-based masked autoencoder framework for learning representations of multi-person motion data by reconstructing masked human joint trajectories. |
Existing methods for high-level social tasks in multi-person scenes suffer from a lack of sufficient annotated data. Social-MAE addresses this challenge using self-supervised pre-training via masked modeling, enabling the learning of generalizable and data-efficient motion representations. |
Social-MAE consists of a transformer encoder and a lighter-weight transformer decoder operating on multi-person joint trajectories in the frequency domain. The model is pre-trained by masking and reconstructing joint trajectories. For specific downstream tasks, the decoder is replaced with a task-specific one and fine-tuned with full supervision. |
Social-MAE achieves state-of-the-art results on the SoMoF benchmark for multi-person pose forecasting.
Social-MAE outperforms previous methods on CMU-Mocap and MuPoTS-3D datasets for multi-person pose forecasting.
Social-MAE sets new state-of-the-art results on JRDB-Act for social grouping and action detection using only pose data as input. |
Social-MAE currently only uses pose data and could benefit from incorporating visual features for richer context.
The model's performance on social grouping could be further improved by incorporating 3D information. |
social masked autoencoder, unsupervised pre-training, multi-person motion representation, pose forecasting, social grouping, action understanding |
2404.05519
Report |
Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models |
Saman Motamed, Wouter Van Gansbeke, Luc Van Gool |
With recent advances in image and video diffusion models for content
creation, a plethora of techniques have been proposed for customizing their
generated content. In particular, manipulating the cross-attention layers of
Text-to-Image (T2I) diffusion models has shown great promise in controlling the
shape and location of objects in the scene. Transferring image-editing
techniques to the video domain, however, is extremely challenging as object
motion and temporal consistency are difficult to capture accurately. In this
work, we take a first look at the role of cross-attention in Text-to-Video
(T2V) diffusion models for zero-shot video editing. While one-shot models have
shown potential in controlling motion and camera movement, we demonstrate
zero-shot control over object shape, position and movement in T2V models. We
show that despite the limitations of current T2V models, cross-attention
guidance can be a promising approach for editing videos. |
This paper presents an initial exploration of using cross-attention layers in Text-to-Video (T2V) diffusion models for zero-shot video editing, focusing on controlling object size, position, and motion. |
Enabling zero-shot editing in T2V models provides greater flexibility and user control over generated video content without requiring additional training data or resources. |
The authors investigate two approaches: 1) **Forward Guidance**: Directly manipulating cross-attentions during the denoising process (similar to Prompt-to-Prompt), and 2) **Backward Guidance**: Using an energy-based loss function to guide the model towards desired cross-attention maps for specific tokens (inspired by Diffusion self-guidance and Training-Free Layout Control). |
Forward guidance in T2V models faces similar limitations as in image editing, such as size/shape mismatches and cross-attention overlap, hindering its effectiveness.
Backward guidance shows promise for zero-shot editing in T2V models, successfully enabling control over object size and motion by manipulating cross-attention maps.
Current T2V models exhibit limitations like noisy cross-attention maps compared to T2I models, necessitating further research to improve their quality and enable more robust editing techniques. |
Current limitations in T2V models, particularly noisy cross-attention maps, restrict the effectiveness of the proposed editing techniques.
The study focuses on manipulating object size and motion, leaving other editing aspects like background control and maintaining fidelity for future work. |
text-to-video synthesis, video editing, diffusion models, cross-attention, zero-shot learning |
2404.05384
Report |
Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance |
Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, Yu Liu |
Classifier-Free Guidance (CFG) has been widely used in text-to-image
diffusion models, where the CFG scale is introduced to control the strength of
text guidance on the whole image space. However, we argue that a global CFG
scale results in spatial inconsistency on varying semantic strengths and
suboptimal image quality. To address this problem, we present a novel approach,
Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance
degrees for different semantic units in text-to-image diffusion models.
Specifically, we first design a training-free semantic segmentation method to
partition the latent image into relatively independent semantic regions at each
denoising step. In particular, the cross-attention map in the denoising U-net
backbone is renormalized for assigning each patch to the corresponding token,
while the self-attention map is used to complete the semantic regions. Then, to
balance the amplification of diverse semantic units, we adaptively adjust the
CFG scales across different semantic regions to rescale the text guidance
degrees into a uniform level. Finally, extensive experiments demonstrate the
superiority of S-CFG over the original CFG strategy on various text-to-image
diffusion models, without requiring any extra training cost. our codes are
available at https://github.com/SmilesDZgk/S-CFG. |
This paper introduces Semantic-aware Classifier-Free Guidance (S-CFG), a novel approach to enhance text-to-image diffusion models by customizing guidance degrees for different semantic units within an image. |
Existing Classifier-Free Guidance (CFG) methods apply a global scale, leading to spatial inconsistency in varying semantic strengths and potentially suboptimal image quality. |
S-CFG leverages attention maps from the diffusion model's U-net backbone to segment the latent image into semantic regions. It then adaptively adjusts CFG scales across these regions, aiming to unify the classifier score norm and balance semantic information. |
S-CFG consistently outperforms the original CFG strategy across various text-to-image diffusion models, including Stable Diffusion and DeepFloyd IF, as evidenced by FID-30K and CLIP Score metrics.
Human evaluation confirms the superiority of S-CFG in terms of both image quality and image-text alignment.
Qualitative analysis reveals notable improvements in generated samples, showcasing enhanced semantic expressiveness, entity portrayal, and fine-grained structure completion. |
The assumption of complete independence among different semantic units may not always hold true in practice.
Current evaluation metrics might not fully capture all aspects of image quality improvement achieved by S-CFG. |
text-to-image generation, diffusion models, classifier-free guidance, semantic segmentation, attention mechanisms |
2404.05331
Report |
Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt |
Zhiqi Huang, Huixin Xiong, Haoyu Wang, Longguang Wang, Zhiheng Li |
Text-to-image generation has witnessed great progress, especially with the
recent advancements in diffusion models. Since texts cannot provide detailed
conditions like object appearance, reference images are usually leveraged for
the control of objects in the generated images. However, existing methods still
suffer limited accuracy when the relationship between the foreground and
background is complicated. To address this issue, we develop a framework termed
Mask-ControlNet by introducing an additional mask prompt. Specifically, we
first employ large vision models to obtain masks to segment the objects of
interest in the reference image. Then, the object images are employed as
additional prompts to facilitate the diffusion model to better understand the
relationship between foreground and background regions during image generation.
Experiments show that the mask prompts enhance the controllability of the
diffusion model to maintain higher fidelity to the reference image while
achieving better image quality. Comparison with previous text-to-image
generation methods demonstrates our method's superior quantitative and
qualitative performance on the benchmark datasets. |
This paper proposes Mask-ControlNet, a framework for higher-quality image generation by introducing an additional mask prompt to decouple and model the foreground and background relationship in reference images. |
Existing text-to-image generation methods struggle to accurately control object appearance and maintain fidelity to reference images, particularly in complex compositions. |
The method uses large vision models (SAM) to obtain object masks from reference images. These masks, along with the reference image and text prompts, are used as conditional information for a diffusion model during image synthesis. |
Mask-ControlNet generates higher-quality images with fewer artifacts compared to previous methods (DreamBooth, ControlNet+LoRA, Outpainting).
The method shows superior quantitative performance in FID, PSNR, SSIM, LPIPS, and user studies.
Mask prompts effectively decouple foreground and background, leading to better object fidelity, reduced background overfitting, and improved foreground-background harmony. |
The reliance on pre-trained large vision models like SAM might limit generalizability.
Future work includes exploring the impact of mask quality and different mask generation techniques on the generated images. |
image generation, diffusion model, controllable image synthesis, object fidelity, mask prompt |
2404.05268
Report |
MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation |
Jiaxiu Jiang, Yabo Zhang, Kailai Feng, Xiaohe Wu, Wangmeng Zuo |
Customized text-to-image generation aims to synthesize instantiations of
user-specified concepts and has achieved unprecedented progress in handling
individual concept. However, when extending to multiple customized concepts,
existing methods exhibit limitations in terms of flexibility and fidelity, only
accommodating the combination of limited types of models and potentially
resulting in a mix of characteristics from different concepts. In this paper,
we introduce the Multi-concept guidance for Multi-concept customization, termed
MC$^2$, for improved flexibility and fidelity. MC$^2$ decouples the
requirements for model architecture via inference time optimization, allowing
the integration of various heterogeneous single-concept customized models. It
adaptively refines the attention weights between visual and textual tokens,
directing image regions to focus on their associated words while diminishing
the impact of irrelevant ones. Extensive experiments demonstrate that MC$^2$
even surpasses previous methods that require additional training in terms of
consistency with input prompt and reference images. Moreover, MC$^2$ can be
extended to elevate the compositional capabilities of text-to-image generation,
yielding appealing results. Code will be publicly available at
https://github.com/JIANGJiaXiu/MC-2. |
MC$^2$ is proposed as a novel method to synthesize compositions of multiple customized concepts by integrating separately trained single-concept customized models, without joint training, model merging, or extra conditioning information like bounding boxes. |
Existing methods for multi-concept customization in text-to-image generation lack flexibility and fidelity, limiting the types of models that can be combined and potentially leading to mixed concept characteristics. |
MC$^2$ leverages inference time optimization with multi-concept guidance (MCG). It analyzes cross-attention maps during diffusion to identify regions activated by different concepts, then refines attention weights to spatially disentangle them, promoting proper attribute binding. |
MC$^2$ demonstrates higher fidelity to reference images compared to baselines, even surpassing methods requiring additional training.
Quantitative metrics on CustomConcept101 and a compositional generation benchmark show superior performance in subject/prompt fidelity and object/prompt similarity.
User studies confirm MC$^2$'s effectiveness, with users finding its outputs more aligned with prompts and reference concepts. |
The current implementation using parallel diffusion models is memory intensive.
The composed customized models are limited to those trained from the same diffusion model. |
text-to-image generation, customized multi-concept generation, compositional generation, diffusion models, cross-attention |
2404.05236
Report |
Stylizing Sparse-View 3D Scenes with Hierarchical Neural Representation |
Y. Wang, A. Gao, Y. Gong, Y. Zeng |
Recently, a surge of 3D style transfer methods has been proposed that
leverage the scene reconstruction power of a pre-trained neural radiance field
(NeRF). To successfully stylize a scene this way, one must first reconstruct a
photo-realistic radiance field from collected images of the scene. However,
when only sparse input views are available, pre-trained few-shot NeRFs often
suffer from high-frequency artifacts, which are generated as a by-product of
high-frequency details for improving reconstruction quality. Is it possible to
generate more faithful stylized scenes from sparse inputs by directly
optimizing encoding-based scene representation with target style? In this
paper, we consider the stylization of sparse-view scenes in terms of
disentangling content semantics and style textures. We propose a coarse-to-fine
sparse-view scene stylization framework, where a novel hierarchical
encoding-based neural representation is designed to generate high-quality
stylized scenes directly from implicit scene representations. We also propose a
new optimization strategy with content strength annealing to achieve realistic
stylization and better content preservation. Extensive experiments demonstrate
that our method can achieve high-quality stylization of sparse-view scenes and
outperforms fine-tuning-based baselines in terms of stylization quality and
efficiency. |
This paper proposes a novel coarse-to-fine 3D scene stylization framework for generating high-quality stylized scenes from sparse input views. |
Existing style transfer methods struggle to produce high-quality stylized 3D scenes from sparse inputs due to the difficulty in reconstructing accurate high-frequency details. |
The method uses a hierarchical encoding-based neural representation to disentangle content semantics and style textures. It first reconstructs coarse geometry from sparse inputs and then utilizes a multi-resolution feature grid to generate stylized details guided by the coarse geometry. |
The method generates high-quality stylized scenes with multi-view consistency from sparse inputs.
It outperforms state-of-the-art methods both quantitatively and qualitatively in terms of stylization quality and efficiency.
The proposed content strength annealing strategy effectively balances content preservation and style matching. |
The method's performance depends on the quality of the coarse geometry reconstruction.
The current implementation focuses on style transfer of static scenes. |
3d style transfer, neural radiance fields, sparse-view synthesis, hierarchical representation, content annealing |
2404.05225
Report |
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding |
Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, Cong Yao |
Recently, leveraging large language models (LLMs) or multimodal large
language models (MLLMs) for document understanding has been proven very
promising. However, previous works that employ LLMs/MLLMs for document
understanding have not fully explored and utilized the document layout
information, which is vital for precise document understanding. In this paper,
we propose LayoutLLM, an LLM/MLLM based method for document understanding. The
core of LayoutLLM is a layout instruction tuning strategy, which is specially
designed to enhance the comprehension and utilization of document layouts. The
proposed layout instruction tuning strategy consists of two components:
Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture
the characteristics of document layout in Layout-aware Pre-training, three
groups of pre-training tasks, corresponding to document-level, region-level and
segment-level information, are introduced. Furthermore, a novel module called
layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on
regions relevant to the question and generate accurate answers. LayoutCoT is
effective for boosting the performance of document understanding. Meanwhile, it
brings a certain degree of interpretability, which could facilitate manual
inspection and correction. Experiments on standard benchmarks show that the
proposed LayoutLLM significantly outperforms existing methods that adopt
open-source 7B LLMs/MLLMs for document understanding. The training data of the
LayoutLLM is publicly available at
https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM |
This paper introduces LayoutLLM, an LLM/MLLM-based document understanding method enhanced with a novel layout instruction tuning strategy. |
Existing LLM/MLLM document understanding approaches fail to effectively utilize crucial document layout information, limiting their accuracy and interpretability. |
LayoutLLM integrates a document pre-trained model encoder and employs a two-stage layout instruction tuning strategy: 1) Layout-aware Pre-training with document, region, and segment-level tasks. 2) Layout-aware Supervised Fine-tuning using a novel LayoutCoT module for interpretable, step-by-step reasoning. |
LayoutLLM significantly outperforms existing zero-shot LLM/MLLM methods on document understanding benchmarks.
Layout-aware pre-training significantly enhances the model's understanding of document layouts at different levels.
The LayoutCoT module effectively boosts performance, particularly for complex tasks, and provides interpretability. |
LayoutLLM currently lacks the ability to refuse false-positive outputs and provide hints.
Despite improvements from layout-aware pre-training, precisely understanding complex region-level relationships remains challenging. |
document understanding, large language models, multimodal learning, document layout analysis, instruction tuning |
2404.05188
Report |
Have You Merged My Model? On The Robustness of Large Language Model IP Protection Methods Against Model Merging |
Tianshuo Cong, Delong Ran, Zesen Liu, Xinlei He, Jinyuan Liu, Yichen Gong, Qi Li, Anyu Wang, Xiaoyun Wang |
Model merging is a promising lightweight model empowerment technique that
does not rely on expensive computing devices (e.g., GPUs) or require the
collection of specific training data. Instead, it involves editing different
upstream model parameters to absorb their downstream task capabilities.
However, uncertified model merging can infringe upon the Intellectual Property
(IP) rights of the original upstream models. In this paper, we conduct the
first study on the robustness of IP protection methods in model merging
scenarios. We investigate two state-of-the-art IP protection techniques:
Quantization Watermarking and Instructional Fingerprint, along with various
advanced model merging technologies, such as Task Arithmetic, TIES-MERGING, and
so on. Experimental results indicate that current Large Language Model (LLM)
watermarking techniques cannot survive in the merged models, whereas model
fingerprinting techniques can. Our research aims to highlight that model
merging should be an indispensable consideration in the robustness assessment
of model IP protection techniques, thereby promoting the healthy development of
the open-source LLM community. |
This paper presents the first robustness analysis of Large Language Model (LLM) Intellectual Property (IP) protection methods against model merging attacks. |
Unauthorized model merging can infringe on the IP rights of upstream LLM developers, hindering the open-source LLM community's growth. |
The authors evaluate the robustness of Quantization Watermarking and Instructional Fingerprinting techniques against four model merging algorithms: Model Soups, Task Arithmetic, TIES-MERGING, and DARE. |
Model merging successfully combines the functionalities of different LLMs, creating a composite model with multifunctionality.
Existing LLM watermarking techniques are vulnerable to model merging attacks, with the watermark information being effectively removed during the merging process.
LLM fingerprinting, specifically Instructional Fingerprinting, demonstrates stronger robustness against model merging compared to watermarking, successfully retaining fingerprint information in the merged models. |
The paper focuses on merging only two models, leaving the exploration of more complex merging scenarios involving multiple models as future work.
Further investigation into more advanced merging algorithms and their impact on IP protection methods is needed. |
large language models, model merging, ip protection, watermarking, fingerprinting |
2404.05072
Report |
Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind |
Chiara Plizzari, Shubham Goel, Toby Perrett, Jacob Chalk, Angjoo Kanazawa, Dima Damen |
As humans move around, performing their daily tasks, they are able to recall
where they have positioned objects in their environment, even if these objects
are currently out of sight. In this paper, we aim to mimic this spatial
cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind
- 3D tracking active objects using observations captured through an egocentric
camera. We introduce Lift, Match and Keep (LMK), a method which lifts partial
2D observations to 3D world coordinates, matches them over time using visual
appearance, 3D location and interactions to form object tracks, and keeps these
object tracks even when they go out-of-view of the camera - hence keeping in
mind what is out of sight. We test LMK on 100 long videos from EPIC-KITCHENS.
Our results demonstrate that spatial cognition is critical for correctly
locating objects over short and long time scales. E.g., for one long egocentric
video, we estimate the 3D location of 50 active objects. Of these, 60% can be
correctly positioned in 3D after 2 minutes of leaving the camera view. |
Introduces "Out of Sight, Not Out of Mind" (OSNOM) task and the Lift, Match, and Keep (LMK) method for 3D tracking of active objects in egocentric videos, even when out-of-view. |
Spatial cognition, the ability to track objects even when unseen, is crucial for humans and essential for building AI agents that can understand and interact with the world like humans do. |
LMK lifts 2D object observations to 3D using scene geometry and depth, matches these 3D observations over time using appearance and location, and maintains object tracks even when objects are out-of-sight. |
LMK significantly outperforms baselines, demonstrating the importance of 3D tracking and object permanence for this task.
The method achieves 64% accuracy in locating objects after 1 minute and 37% accuracy after 10 minutes of being out-of-view.
Combining visual appearance and 3D location information is crucial for robust tracking, especially in cluttered scenes. |
The current work relies on ground-truth masks; future work will focus on incorporating object detectors for end-to-end learning.
Future research will explore extending OSNOM to multiple videos over longer timescales and investigate its applicability in real-world assistive scenarios. |
egocentric vision, object tracking, 3d understanding, spatial cognition, object permanence |
2404.05014
Report |
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators |
Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo |
Recent advances in Text-to-Video generation (T2V) have achieved remarkable
success in synthesizing high-quality general videos from textual descriptions.
A largely overlooked problem in T2V is that existing models have not adequately
encoded physical knowledge of the real world, thus generated videos tend to
have limited motion and poor variations. In this paper, we propose
\textbf{MagicTime}, a metamorphic time-lapse video generation model, which
learns real-world physics knowledge from time-lapse videos and implements
metamorphic generation. First, we design a MagicAdapter scheme to decouple
spatial and temporal training, encode more physical knowledge from metamorphic
videos, and transform pre-trained T2V models to generate metamorphic videos.
Second, we introduce a Dynamic Frames Extraction strategy to adapt to
metamorphic time-lapse videos, which have a wider variation range and cover
dramatic object metamorphic processes, thus embodying more physical knowledge
than general videos. Finally, we introduce a Magic Text-Encoder to improve the
understanding of metamorphic video prompts. Furthermore, we create a time-lapse
video-text dataset called \textbf{ChronoMagic}, specifically curated to unlock
the metamorphic video generation ability. Extensive experiments demonstrate the
superiority and effectiveness of MagicTime for generating high-quality and
dynamic metamorphic videos, suggesting time-lapse video generation is a
promising path toward building metamorphic simulators of the physical world. |
This paper introduces MagicTime, a novel framework designed to enhance text-to-video generation models by incorporating physical world knowledge, specifically enabling them to generate metamorphic time-lapse videos. |
Existing T2V models struggle to produce videos that accurately depict complex real-world processes like melting or blooming due to limited encoding of physical knowledge. MagicTime aims to address this limitation by leveraging the characteristics of metamorphic videos, which comprehensively capture object transformation. |
MagicTime utilizes a MagicAdapter scheme for decoupled spatial and temporal training, leveraging time-lapse videos. It introduces Dynamic Frames Extraction to prioritize metamorphic features and a Magic Text-Encoder for better understanding metamorphic prompts. A new time-lapse dataset, ChronoMagic, is also created for training and evaluation. |
MagicTime successfully generates high-quality metamorphic videos demonstrating a clear understanding of physical processes like melting and blooming.
Quantitative analysis shows MagicTime outperforms existing T2V methods in metrics such as FID, FVD, and CLIP Similarity, indicating superior video quality and text-alignment.
Human evaluation confirms MagicTime's superiority in producing visually appealing and semantically accurate metamorphic videos compared to baseline models. |
Current evaluation metrics for T2V models may not fully encapsulate the nuances of metamorphic video generation, necessitating more robust evaluation methods.
Expanding the ChronoMagic dataset with diverse scenarios and complexities can further enhance the generalization capabilities of MagicTime. |
text-to-video generation, metamorphic videos, time-lapse videos, physical knowledge encoding, diffusion models |
2404.04956
Report |
Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models |
Zijin Yang, Kai Zeng, Kejiang Chen, Han Fang, Weiming Zhang, Nenghai Yu |
Ethical concerns surrounding copyright protection and inappropriate content
generation pose challenges for the practical implementation of diffusion
models. One effective solution involves watermarking the generated images.
However, existing methods often compromise the model performance or require
additional training, which is undesirable for operators and users. To address
this issue, we propose Gaussian Shading, a diffusion model watermarking
technique that is both performance-lossless and training-free, while serving
the dual purpose of copyright protection and tracing of offending content. Our
watermark embedding is free of model parameter modifications and thus is
plug-and-play. We map the watermark to latent representations following a
standard Gaussian distribution, which is indistinguishable from latent
representations obtained from the non-watermarked diffusion model. Therefore we
can achieve watermark embedding with lossless performance, for which we also
provide theoretical proof. Furthermore, since the watermark is intricately
linked with image semantics, it exhibits resilience to lossy processing and
erasure attempts. The watermark can be extracted by Denoising Diffusion
Implicit Models (DDIM) inversion and inverse sampling. We evaluate Gaussian
Shading on multiple versions of Stable Diffusion, and the results demonstrate
that Gaussian Shading not only is performance-lossless but also outperforms
existing methods in terms of robustness. |
This paper introduces Gaussian Shading, a novel watermarking technique for diffusion models that is both performance-lossless and training-free. |
This technique addresses the critical need for copyright protection and content authentication of AI-generated images without compromising the quality of the generated output. |
The method maps a watermark to latent representations following a standard Gaussian distribution, ensuring it remains indistinguishable from non-watermarked representations. This watermark is then diffused throughout the image semantics during the generation process, making it robust to alterations. |
Gaussian Shading maintains high true positive rates (over 99%) in watermark detection even under significant noise perturbation.
The method exhibits superior robustness in traceability tasks, achieving over 97% bit accuracy under various attacks.
Unlike existing methods, Gaussian Shading demonstrates statistically insignificant impact on the visual quality and image-text similarity of generated images, effectively preserving model performance. |
The current implementation relies on DDIM inversion, limiting its applicability to diffusion models utilizing continuous-time samplers based on ODE solvers.
The method's reliance on stream ciphers necessitates secure key management and assumes the model is not publicly accessible to prevent forgery attacks. |
watermarking, diffusion models, copyright protection, content authentication, ai-generated images |
2404.04946
Report |
AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment |
Yuanfeng Xu, Yuhao Chen, Zhongzhan Huang, Zijian He, Guangrun Wang, Philip Torr, Liang Lin |
Recent video editing advancements rely on accurate pose sequences to animate
subjects. However, these efforts are not suitable for cross-species animation
due to pose misalignment between species (for example, the poses of a cat
differs greatly from that of a pig due to differences in body structure). In
this paper, we present AnimateZoo, a zero-shot diffusion-based video generator
to address this challenging cross-species animation issue, aiming to accurately
produce animal animations while preserving the background. The key technique
used in our AnimateZoo is subject alignment, which includes two steps. First,
we improve appearance feature extraction by integrating a Laplacian detail
booster and a prompt-tuning identity extractor. These components are
specifically designed to capture essential appearance information, including
identity and fine details. Second, we align shape features and address
conflicts from differing subjects by introducing a scale-information remover.
This ensures accurate cross-species animation. Moreover, we introduce two
high-quality animal video datasets featuring a wide variety of species. Trained
on these extensive datasets, our model is capable of generating videos
characterized by accurate movements, consistent appearance, and high-fidelity
frames, without the need for the pre-inference fine-tuning that prior arts
required. Extensive experiments showcase the outstanding performance of our
method in cross-species action following tasks, demonstrating exceptional shape
adaptation capability. The project page is available at
https://justinxu0.github.io/AnimateZoo/. |
AnimateZoo, a zero-shot diffusion-based video generator for cross-species animation using pose sequences from different animals, preserving background and enabling accurate action inheritance. |
Addresses the limitations of existing intra-species animation methods that struggle with pose misalignment between species, enabling cross-species animation with accurate movements, consistent appearance, and high-fidelity frames. |
Employs subject alignment through: 1) Laplacian detail booster and prompt-tuning identity extractor for appearance feature extraction, 2) Scale-information remover to align shape features and address conflicts from differing subjects. |
Generates videos with accurate movements and consistent appearance across different animal species.
Preserves background information from the source video while animating the target subject.
Outperforms existing methods in cross-species animation tasks, demonstrating superior shape adaptability. |
Struggles with accurately depicting interactions between multiple objects, particularly in cases of occlusion.
Reliance on accurate segmentation of the reference subject, which may be challenging in complex scenes. |
video editing, cross-species animation, subject alignment, diffusion models, zero-shot learning |
2404.04913
Report |
CodecNeRF: Toward Fast Encoding and Decoding, Compact, and High-quality Novel-view Synthesis |
Gyeongjin Kang, Younggeun Lee, Eunbyung Park |
Neural Radiance Fields (NeRF) have achieved huge success in effectively
capturing and representing 3D objects and scenes. However, several factors have
impeded its further proliferation as next-generation 3D media. To establish a
ubiquitous presence in everyday media formats, such as images and videos, it is
imperative to devise a solution that effectively fulfills three key objectives:
fast encoding and decoding time, compact model sizes, and high-quality
renderings. Despite significant advancements, a comprehensive algorithm that
adequately addresses all objectives has yet to be fully realized. In this work,
we present CodecNeRF, a neural codec for NeRF representations, consisting of a
novel encoder and decoder architecture that can generate a NeRF representation
in a single forward pass. Furthermore, inspired by the recent
parameter-efficient finetuning approaches, we develop a novel finetuning method
to efficiently adapt the generated NeRF representations to a new test instance,
leading to high-quality image renderings and compact code sizes. The proposed
CodecNeRF, a newly suggested encoding-decoding-finetuning pipeline for NeRF,
achieved unprecedented compression performance of more than 150x and 20x
reduction in encoding time while maintaining (or improving) the image quality
on widely used 3D object datasets, such as ShapeNet and Objaverse. |
CodecNeRF, a neural codec for NeRF representation with novel encoder and decoder architectures for fast encoding/decoding and compact model size, and a parameter-efficient finetuning method for high-quality rendering. |
To enable ubiquitous presence of NeRF in everyday media formats, addressing the need for fast encoding/decoding, compact model sizes, and high-quality renderings. |
An encoder-decoder architecture generates a NeRF representation in a single forward pass. Parameter-efficient finetuning adapts the representation to new instances using low-rank adaptation and tensor factorization. Entropy coding further compresses finetuned parameters. |
Achieved over 150x compression and 20x encoding speedup compared to per-scene optimization.
Maintains or improves image quality on ShapeNet and Objaverse datasets.
Demonstrated superior generalization performance and fast convergence compared to existing methods. |
Extending to complex scenes like large-scale environments requires further exploration.
Supporting other NeRF representations, such as instant NGP, necessitates architectural modifications. |
neural radiance fields, nerf compression, neural codec, parameter-efficient finetuning, novel view synthesis |
2404.04908
Report |
Dual-Camera Smooth Zoom on Mobile Phones |
Renlong Wu, Zhilu Zhang, Yu Yang, Wangmeng Zuo |
When zooming between dual cameras on a mobile, noticeable jumps in geometric
content and image color occur in the preview, inevitably affecting the user's
zoom experience. In this work, we introduce a new task, ie, dual-camera smooth
zoom (DCSZ) to achieve a smooth zoom preview. The frame interpolation (FI)
technique is a potential solution but struggles with ground-truth collection.
To address the issue, we suggest a data factory solution where continuous
virtual cameras are assembled to generate DCSZ data by rendering reconstructed
3D models of the scene. In particular, we propose a novel dual-camera smooth
zoom Gaussian Splatting (ZoomGS), where a camera-specific encoding is
introduced to construct a specific 3D model for each virtual camera. With the
proposed data factory, we construct a synthetic dataset for DCSZ, and we
utilize it to fine-tune FI models. In addition, we collect real-world dual-zoom
images without ground-truth for evaluation. Extensive experiments are conducted
with multiple FI methods. The results show that the fine-tuned FI models
achieve a significant performance improvement over the original ones on DCSZ
task. The datasets, codes, and pre-trained models will be publicly available. |
This paper proposes Dual-Camera Smooth Zoom (DCSZ) to generate a fluid zoom preview on mobile phones, addressing the issue of jumps in geometric content and color when switching between dual cameras. |
The jumps that occur during zoom preview on smartphones with dual cameras significantly impact user experience. This work provides a solution to create a smoother and more visually appealing zoom transition. |
The proposed method uses a data factory approach. It leverages 3D reconstruction to generate virtual camera views between the actual ultra-wide and wide cameras. This synthetic data is then used to fine-tune existing frame interpolation models for smoother zoom transitions. |
Fine-tuned FI models show significant performance improvement over pre-trained models on DCSZ.
The proposed ZoomGS method for constructing camera-specific 3D models outperforms standard 3DGS.
The synthetic data generated by the data factory effectively improves FI model performance in real-world scenarios. |
The generalization of the fine-tuned FI model to other mobile devices with different dual-camera setups needs further investigation.
Future work can explore optimizing the data factory for real-time performance on mobile devices. |
dual-camera zoom, frame interpolation, 3d reconstruction, gaussian splatting, smooth zoom |
2404.04875
Report |
NeRF2Points: Large-Scale Point Cloud Generation From Street Views' Radiance Field Optimization |
Peng Tu, Xun Zhou, Mingming Wang, Xiaojun Yang, Bo Peng, Ping Chen, Xiu Su, Yawen Huang, Yefeng Zheng, Chang Xu |
Neural Radiance Fields (NeRF) have emerged as a paradigm-shifting methodology
for the photorealistic rendering of objects and environments, enabling the
synthesis of novel viewpoints with remarkable fidelity. This is accomplished
through the strategic utilization of object-centric camera poses characterized
by significant inter-frame overlap. This paper explores a compelling,
alternative utility of NeRF: the derivation of point clouds from aggregated
urban landscape imagery. The transmutation of street-view data into point
clouds is fraught with complexities, attributable to a nexus of interdependent
variables. First, high-quality point cloud generation hinges on precise camera
poses, yet many datasets suffer from inaccuracies in pose metadata. Also, the
standard approach of NeRF is ill-suited for the distinct characteristics of
street-view data from autonomous vehicles in vast, open settings. Autonomous
vehicle cameras often record with limited overlap, leading to blurring,
artifacts, and compromised pavement representation in NeRF-based point clouds.
In this paper, we present NeRF2Points, a tailored NeRF variant for urban point
cloud synthesis, notable for its high-quality output from RGB inputs alone. Our
paper is supported by a bespoke, high-resolution 20-kilometer urban street
dataset, designed for point cloud generation and evaluation. NeRF2Points
adeptly navigates the inherent challenges of NeRF-based point cloud synthesis
through the implementation of the following strategic innovations: (1)
Integration of Weighted Iterative Geometric Optimization (WIGO) and Structure
from Motion (SfM) for enhanced camera pose accuracy, elevating street-view data
precision. (2) Layered Perception and Integrated Modeling (LPiM) is designed
for distinct radiance field modeling in urban environments, resulting in
coherent point cloud representations. |
This paper presents NeRF2Points, a novel NeRF-based framework designed for generating high-quality, dense point clouds from street-view imagery, offering a cost-effective alternative to lidar systems. |
Generating point clouds from street-view imagery is crucial for autonomous navigation, enhancing driving recognition algorithms, and improving simulation and data annotation. Existing methods struggle with inaccuracies in camera poses and the unique characteristics of street-view data. |
NeRF2Points addresses challenges by: (1) Using a combination of WIGO and SfM for enhanced camera pose accuracy. (2) Implementing Layered Perception and Integrated Modeling (LPiM) to model road and street scene point clouds separately and then merge them. (3) Introducing geometric-aware consistency regularization (spatial dynamic and temporal invariant consistency) to address artifacts caused by sparse viewpoints. |
NeRF2Points outperforms state-of-the-art NeRF methods in terms of point cloud accuracy (Chamfer Distance) and image quality (PSNR and SSIM) on a new 20km street-view dataset.
The LPiM strategy effectively addresses pavement collapse, a common issue when generating point clouds from street-view data.
Geometric-aware consistency regularization significantly reduces artifacts like floaters, blurriness, and geometric inconsistencies. |
The impact of temporal invariant consistency regularization, while positive, is relatively small compared to other components.
Future work will explore 4D point cloud reconstruction using NeRF2Points. |
neural radiance fields, point cloud generation, street views, self-driving, 3d reconstruction |
2404.04860
Report |
ByteEdit: Boost, Comply and Accelerate Generative Image Editing |
Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, Xuefeng Xiao, Yitong Wang, Min Zheng, Lean Fu |
Recent advancements in diffusion-based generative image editing have sparked
a profound revolution, reshaping the landscape of image outpainting and
inpainting tasks. Despite these strides, the field grapples with inherent
challenges, including: i) inferior quality; ii) poor consistency; iii)
insufficient instrcution adherence; iv) suboptimal generation efficiency. To
address these obstacles, we present ByteEdit, an innovative feedback learning
framework meticulously designed to Boost, Comply, and Accelerate Generative
Image Editing tasks. ByteEdit seamlessly integrates image reward models
dedicated to enhancing aesthetics and image-text alignment, while also
introducing a dense, pixel-level reward model tailored to foster coherence in
the output. Furthermore, we propose a pioneering adversarial and progressive
feedback learning strategy to expedite the model's inference speed. Through
extensive large-scale user evaluations, we demonstrate that ByteEdit surpasses
leading generative image editing products, including Adobe, Canva, and MeiTu,
in both generation quality and consistency. ByteEdit-Outpainting exhibits a
remarkable enhancement of 388% and 135% in quality and consistency,
respectively, when compared to the baseline model. Experiments also verfied
that our acceleration models maintains excellent performance results in terms
of quality and consistency. |
Introduces ByteEdit, a novel feedback learning framework to enhance diffusion-based generative image editing in terms of generation quality, consistency, instruction adherence, and speed. |
Addresses the limitations of existing diffusion-based image editing methods, which often suffer from inferior quality, poor consistency, weak instruction adherence, and slow generation speed. |
Utilizes perceptual feedback learning (PeFL) with aesthetic, alignment, and coherent reward models trained on large datasets with human feedback. It also introduces adversarial and progressive training strategies to accelerate the generation process. |
Significantly outperforms existing state-of-the-art generative image editing products like Adobe, Canva, and MeiTu in terms of generation quality and consistency.
Demonstrates a remarkable improvement of 388% and 135% in quality and consistency for outpainting compared to the baseline model.
Achieves acceleration in inference speed while maintaining excellent performance in terms of quality and consistency. |
Exploring more targeted reward models tailored to specific editing tasks to enhance performance.
Investigating further integration with advanced techniques like LCM and SDXL-turbo for even faster processing speeds. |
image editing, generative models, diffusion models, feedback learning, image outpainting, image inpainting |
2404.04828
Report |
Strictly-ID-Preserved and Controllable Accessory Advertising Image Generation |
Youze Xue, Binghui Chen, Yifeng Geng, Xuansong Xie, Jiansheng Chen, Hongbing Ma |
Customized generative text-to-image models have the ability to produce images
that closely resemble a given subject. However, in the context of generating
advertising images for e-commerce scenarios, it is crucial that the generated
subject's identity aligns perfectly with the product being advertised. In order
to address the need for strictly-ID preserved advertising image generation, we
have developed a Control-Net based customized image generation pipeline and
have taken earring model advertising as an example. Our approach facilitates a
seamless interaction between the earrings and the model's face, while ensuring
that the identity of the earrings remains intact. Furthermore, to achieve a
diverse and controllable display, we have proposed a multi-branch
cross-attention architecture, which allows for control over the scale, pose,
and appearance of the model, going beyond the limitations of text prompts. Our
method manages to achieve fine-grained control of the generated model's face,
resulting in controllable and captivating advertising effects. |
This paper proposes a Control-Net based pipeline for generating advertising images of accessories (specifically earrings) that strictly preserves the product's identity while offering fine-grained control over the model wearing them. |
Existing customized text-to-image models struggle to strictly maintain product identity, crucial for e-commerce advertising. Current methods either fail to perfectly preserve product appearance or lack control over the model's pose, scale, and appearance for optimal advertising impact. |
The pipeline leverages Control-Net with the earring image as conditioning, training on earring-model images to generate a contextually appropriate model face. It employs a multi-branch cross-attention architecture to control the model's scale, pose, and appearance independently. A standard-deviation based normalization (STD-Norm) mechanism and a time-dependent weighting (TDW) strategy balance the influence of different control branches. |
The method generates strictly ID-preserved earring-model images, accurately retaining earring shape, size, and appearance.
It achieves fine-grained control over the model's scale, pose, and appearance, surpassing textual control methods in accuracy and flexibility.
Quantitative experiments and user studies confirm the superiority of the method in ID preservation, control effectiveness, and overall image quality compared to existing alternatives. |
Current implementation relies on copying the earring image for strict ID-preservation, limiting automatic adjustments to earring rotation and lighting.
Future work will explore techniques to enable automatic adaptation of the earring image while maintaining strict ID-preservation. |
generative models, control-net, strictly-id-preservation, advertising image generation, e-commerce |
2404.04650
Report |
InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization |
Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, Di Huang |
Recent strides in the development of diffusion models, exemplified by
advancements such as Stable Diffusion, have underscored their remarkable
prowess in generating visually compelling images. However, the imperative of
achieving a seamless alignment between the generated image and the provided
prompt persists as a formidable challenge. This paper traces the root of these
difficulties to invalid initial noise, and proposes a solution in the form of
Initial Noise Optimization (InitNO), a paradigm that refines this noise.
Considering text prompts, not all random noises are effective in synthesizing
semantically-faithful images. We design the cross-attention response score and
the self-attention conflict score to evaluate the initial noise, bifurcating
the initial latent space into valid and invalid sectors. A strategically
crafted noise optimization pipeline is developed to guide the initial noise
towards valid regions. Our method, validated through rigorous experimentation,
shows a commendable proficiency in generating images in strict accordance with
text prompts. Our code is available at https://github.com/xiefan-guo/initno. |
This paper proposes Initial Noise Optimization (InitNO) to improve the semantic fidelity of text-to-image diffusion models by optimizing the initial latent noise. |
Existing diffusion models often produce images misaligned with text prompts, exhibiting subject neglect, mixing, and incorrect attribute binding. |
InitNO partitions the initial latent space into valid/invalid regions based on cross-attention response and self-attention conflict scores. Then, it optimizes the noise distribution to steer it into the valid region while maintaining consistency with the standard Gaussian distribution. |
InitNO significantly improves image-text alignment compared to state-of-the-art methods as measured by CLIP similarity scores.
User studies confirm that InitNO generates more visually appealing and semantically accurate images.
InitNO is a plug-and-play method, seamlessly integrating into existing diffusion models for training-free controllable generation tasks like layout-to-image synthesis. |
InitNO incurs higher computational cost than the baseline Stable Diffusion model due to the noise optimization procedure.
The selection of target tokens currently relies on manual input or external language models, which could be automated in future work. |
text-to-image synthesis, diffusion models, latent space optimization, semantic fidelity, attention mechanisms |
2404.04617
Report |
Empowering Image Recovery_ A Multi-Attention Approach |
Juan Wen, Yawei Li, Chao Zhang, Weiyan Hou, Radu Timofte, Luc Van Gool |
We propose Diverse Restormer (DART), a novel image restoration method that
effectively integrates information from various sources (long sequences, local
and global regions, feature dimensions, and positional dimensions) to address
restoration challenges. While Transformer models have demonstrated excellent
performance in image restoration due to their self-attention mechanism, they
face limitations in complex scenarios. Leveraging recent advancements in
Transformers and various attention mechanisms, our method utilizes customized
attention mechanisms to enhance overall performance. DART, our novel network
architecture, employs windowed attention to mimic the selective focusing
mechanism of human eyes. By dynamically adjusting receptive fields, it
optimally captures the fundamental features crucial for image resolution
reconstruction. Efficiency and performance balance are achieved through the
LongIR attention mechanism for long sequence image restoration. Integration of
attention mechanisms across feature and positional dimensions further enhances
the recovery of fine details. Evaluation across five restoration tasks
consistently positions DART at the forefront. Upon acceptance, we commit to
providing publicly accessible code and models to ensure reproducibility and
facilitate further research. |
The paper presents Diverse Restormer (DART), a novel image restoration method using a multi-attention transformer to integrate information from various sources. |
Existing transformer models, while effective, face limitations in handling complex scenarios and high-resolution images. This method aims to enhance image restoration by leveraging customized attention mechanisms. |
DART employs a SwinIR backbone with key additions: 1) LongIR attention for efficient long sequence processing, 2) feature dimension attention for emphasizing relevant features, and 3) positional dimension attention for focusing on specific image regions. |
DART achieves state-of-the-art results on synthetic data benchmarks for denoising and super-resolution, outperforming existing methods with fewer parameters.
On real image restoration tasks like motion and defocus deblurring, DART surpasses current best-performing methods, showcasing its effectiveness on real-world challenges.
Ablation studies confirm the contribution of each attention mechanism, and analysis highlights DART's ability to utilize a wider pixel range for superior reconstruction. |
The paper acknowledges the potential for further efficiency improvements in the DART model.
Future work may explore extending DART to other image restoration tasks beyond the ones evaluated. |
image restoration, transformer, attention mechanism, deep learning, computer vision |
2404.04562
Report |
Diffusion Time-step Curriculum for One Image to 3D Generation |
Xuanyu Yi, Zike Wu, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Hanwang Zhang |
Score distillation sampling~(SDS) has been widely adopted to overcome the
absence of unseen views in reconstructing 3D objects from a \textbf{single}
image. It leverages pre-trained 2D diffusion models as teacher to guide the
reconstruction of student 3D models. Despite their remarkable success,
SDS-based methods often encounter geometric artifacts and texture saturation.
We find out the crux is the overlooked indiscriminate treatment of diffusion
time-steps during optimization: it unreasonably treats the student-teacher
knowledge distillation to be equal at all time-steps and thus entangles
coarse-grained and fine-grained modeling. Therefore, we propose the Diffusion
Time-step Curriculum one-image-to-3D pipeline (DTC123), which involves both the
teacher and student models collaborating with the time-step curriculum in a
coarse-to-fine manner. Extensive experiments on NeRF4, RealFusion15, GSO and
Level50 benchmark demonstrate that DTC123 can produce multi-view consistent,
high-quality, and diverse 3D assets. Codes and more generation demos will be
released in https://github.com/yxymessi/DTC123. |
This paper proposes DTC123, a novel diffusion time-step curriculum-based pipeline for enhancing the quality and consistency of single-image 3D generation using Score Distillation Sampling. |
Existing SDS-based methods for single-image 3D generation suffer from geometric artifacts and texture saturation due to the indiscriminate treatment of diffusion time-steps during optimization. |
DTC123 implements a coarse-to-fine optimization strategy guided by a diffusion time-step curriculum. This includes an annealed time-step schedule, progressive student representation (using NeRF and DMTet), and coarse-to-fine teacher guidance (combining Zero-1-to-3 and Stable Diffusion). |
DTC123 generates multi-view consistent, high-fidelity 3D assets, outperforming state-of-the-art methods on benchmarks like NeRF4, RealFusion15, and GSO.
The proposed time-step curriculum significantly improves the robustness of the generation process, reducing failures like Janus faces and geometric distortions.
DTC123 enables multi-instance generation and 3D editing through user-specified prompts. |
The current implementation relies on a two-stage approach with different 3D representations for efficiency, which can be further explored for end-to-end generation.
Exploring more advanced teacher diffusion models and student 3D representations could further improve generation quality. |
3d generation, score distillation sampling, diffusion models, single-image 3d reconstruction, time-step curriculum |
2404.04544
Report |
BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion |
Gwanghyun Kim, Hayeon Kim, Hoigi Seo, Dong Un Kang, Se Young Chun |
Generating higher-resolution human-centric scenes with details and controls
remains a challenge for existing text-to-image diffusion models. This challenge
stems from limited training image size, text encoder capacity (limited tokens),
and the inherent difficulty of generating complex scenes involving multiple
humans. While current methods attempted to address training size limit only,
they often yielded human-centric scenes with severe artifacts. We propose
BeyondScene, a novel framework that overcomes prior limitations, generating
exquisite higher-resolution (over 8K) human-centric scenes with exceptional
text-image correspondence and naturalness using existing pretrained diffusion
models. BeyondScene employs a staged and hierarchical approach to initially
generate a detailed base image focusing on crucial elements in instance
creation for multiple humans and detailed descriptions beyond token limit of
diffusion model, and then to seamlessly convert the base image to a
higher-resolution output, exceeding training image size and incorporating
details aware of text and instances via our novel instance-aware hierarchical
enlargement process that consists of our proposed high-frequency injected
forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing
methods in terms of correspondence with detailed text descriptions and
naturalness, paving the way for advanced applications in higher-resolution
human-centric scene creation beyond the capacity of pretrained diffusion models
without costly retraining. Project page:
https://janeyeon.github.io/beyond-scene. |
This supplementary material details BeyondScene, a novel framework for generating high-resolution, human-centric scenes from text descriptions and poses. The method excels in capturing fine details and ensuring text-image correspondence. |
Existing methods for generating large-scale human-centric scenes struggle with limitations in capturing fine details, maintaining text-image consistency, and controlling human instance generation. BeyondScene addresses these challenges, presenting a solution for producing high-fidelity images that accurately represent complex scenes. |
BeyondScene employs a two-stage process: (1) Detailed Base Image Generation: Human instances are generated based on text and pose using SDXL-ControlNet-Openpose, segmented, and then seamlessly integrated into a background. (2) Instance-Aware Hierarchical Enlargement: The base image is progressively upsampled using High Frequency-Injected Forward Diffusion and Adaptive Joint Diffusion, employing adaptive stride and conditioning for detailed refinement. |
BeyondScene outperforms baselines in generating high-resolution (up to 8K) human-centric scenes with superior text-image correspondence and naturalness, as evidenced by user studies and MLLM-based evaluations.
The method demonstrates superior performance in capturing fine details and anatomical accuracy compared to combining ControlNet with super-resolution techniques.
BeyondScene achieves comparable or greater efficiency in terms of GPU memory usage and FLOPs compared to existing joint diffusion methods. |
While BeyondScene shows promising results, it currently relies on pretrained models like SDXL, potentially inheriting some of their limitations.
Further research could explore expanding the framework to incorporate diverse human appearances, encompassing different ethnicities and body types. |
image generation, human-centric scene synthesis, high-resolution images, text-to-image generation, diffusion models |
2404.04526
Report |
DATENeRF: Depth-Aware Text-based Editing of NeRFs |
Sara Rojas, Julien Philip, Kai Zhang, Sai Bi, Fujun Luan, Bernard Ghanem, Kalyan Sunkavall |
Recent advancements in diffusion models have shown remarkable proficiency in
editing 2D images based on text prompts. However, extending these techniques to
edit scenes in Neural Radiance Fields (NeRF) is complex, as editing individual
2D frames can result in inconsistencies across multiple views. Our crucial
insight is that a NeRF scene's geometry can serve as a bridge to integrate
these 2D edits. Utilizing this geometry, we employ a depth-conditioned
ControlNet to enhance the coherence of each 2D image modification. Moreover, we
introduce an inpainting approach that leverages the depth information of NeRF
scenes to distribute 2D edits across different images, ensuring robustness
against errors and resampling challenges. Our results reveal that this
methodology achieves more consistent, lifelike, and detailed edits than
existing leading methods for text-driven NeRF scene editing. |
Presents DATENeRF, a method for consistent text-driven editing of NeRF scenes, leveraging depth-aware ControlNet and a novel projection inpainting scheme. |
Existing methods struggle to maintain consistency and quality when editing NeRFs using text prompts, leading to blurry textures and geometric artifacts. DATENeRF addresses these challenges by explicitly using depth information for consistent 2D edits, ultimately leading to higher-quality NeRF editing. |
DATENeRF utilizes a depth-conditioned ControlNet for inpainting masked regions of individual input images. To ensure consistency, the method projects edited pixels from a reference view to other views using the NeRF depth, employing a hybrid inpainting scheme to refine the results and mitigate reprojection artifacts. Finally, an edited NeRF is optimized using the consistent 2D edits. |
DATENeRF generates more realistic and detailed edits than previous methods, accurately reflecting the input text prompts.
The use of depth information significantly improves the consistency of edits across multiple views.
The method demonstrates faster convergence compared to state-of-the-art techniques like Instruct-NeRF2NeRF. |
Limited to edits that do not involve significant geometric changes in the scene.
Performance depends on the accuracy of the NeRF geometry and the editing model's ability to generate content consistent with the depth maps, particularly in complex, large-scale scenes. |
nerf editing, text-based editing, diffusion models, controlnet, 3d scene editing |
2404.04478
Report |
Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models |
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang |
Transformers have catalyzed advancements in computer vision and natural
language processing (NLP) fields. However, substantial computational complexity
poses limitations for their application in long-context tasks, such as
high-resolution image generation. This paper introduces a series of
architectures adapted from the RWKV model used in the NLP, with requisite
modifications tailored for diffusion model applied to image generation tasks,
referred to as Diffusion-RWKV. Similar to the diffusion with Transformers, our
model is designed to efficiently handle patchnified inputs in a sequence with
extra conditions, while also scaling up effectively, accommodating both
large-scale parameters and extensive datasets. Its distinctive advantage
manifests in its reduced spatial aggregation complexity, rendering it
exceptionally adept at processing high-resolution images, thereby eliminating
the necessity for windowing or group cached operations. Experimental results on
both condition and unconditional image generation tasks demonstrate that
Diffison-RWKV achieves performance on par with or surpasses existing CNN or
Transformer-based diffusion models in FID and IS metrics while significantly
reducing total computation FLOP usage. |
This paper introduces Diffusion-RWKV, adapting the RWKV architecture from NLP for image generation tasks using diffusion models. Diffusion-RWKV efficiently handles long-range dependencies in image data while maintaining linear computational complexity, making it a computationally efficient alternative to Transformer-based diffusion models. |
Transformers, while powerful, face limitations in high-resolution image generation due to their quadratic computational complexity, especially with long sequences. This necessitates exploring alternative architectures that offer comparable performance with reduced computational demands. |
Diffusion-RWKV leverages a bidirectional RWKV (Bi-RWKV) backbone for sequential image data processing. It incorporates modifications like image patchnification, skip connections between Bi-RWKV blocks, and different conditional information incorporation techniques (in-context, adaLN, adaLN-Zero). The study analyzes the computational complexity of Diffusion-RWKV and explores various model configurations and scaling options. |
Diffusion-RWKV achieves FID scores comparable to Transformer-based diffusion models (like DiT and U-ViT) on CIFAR-10 and CelebA datasets while using fewer parameters.
Ablation studies demonstrate the impact of patch size, skip connections, and conditioning methods on model performance, with smaller patch sizes and the adaLN-Zero block proving beneficial.
On ImageNet, Diffusion-RWKV exhibits strong performance for class-conditional image generation at resolutions of 256x256 and 512x512, achieving competitive FID scores with reduced computational cost compared to DiT. |
Future work could explore integrating advanced strategies from transformer-based models (e.g., from SiT) into the Diffusion-RWKV backbone.
Further investigation into optimizing the model for even higher-resolution image generation is warranted. |
image generation, diffusion models, rwkv, linear complexity, transformer alternative |
2404.04474
Report |
RoNet: Rotation-oriented Continuous Image Translation |
Yi Li, Xin Xie, Lina Lei, Haiyan Fu, Yanqing Guo |
The generation of smooth and continuous images between domains has recently
drawn much attention in image-to-image (I2I) translation. Linear relationship
acts as the basic assumption in most existing approaches, while applied to
different aspects including features, models or labels. However, the linear
assumption is hard to conform with the element dimension increases and suffers
from the limit that having to obtain both ends of the line. In this paper, we
propose a novel rotation-oriented solution and model the continuous generation
with an in-plane rotation over the style representation of an image, achieving
a network named RoNet. A rotation module is implanted in the generation network
to automatically learn the proper plane while disentangling the content and the
style of an image. To encourage realistic texture, we also design a patch-based
semantic style loss that learns the different styles of the similar object in
different domains. We conduct experiments on forest scenes (where the complex
texture makes the generation very challenging), faces, streetscapes and the
iphone2dslr task. The results validate the superiority of our method in terms
of visual quality and continuity. |
This paper proposes RoNet, a novel rotation-oriented network for continuous image-to-image translation that overcomes limitations of linear interpolation methods. |
Continuous image translation with smooth transitions between domains is challenging, and existing linear interpolation methods suffer from limitations like requiring both source and target domain data and struggling with high-dimensional data. |
RoNet uses a rotation module to learn a rotation plane for style representations, enabling continuous generation by rotating the representation within this plane. It also introduces a patch-based semantic style loss to improve texture realism. |
RoNet generates realistic and continuous image translations across various domains like seasons, time of day, and artistic styles.
It outperforms existing methods in both visual quality and quantitative metrics like LPIPS, FID, and KID.
Ablation studies demonstrate the effectiveness of each component, particularly the rotation module and the semantic style loss. |
The paper mainly focuses on cyclic translations and could explore more complex manifold representations.
Further research on automatically handling imbalanced datasets in continuous image translation is promising. |
image-to-image translation, continuous generation, style representation, rotation module, semantic style loss |
2404.04469
Report |
Mixed-Query Transformer: A Unified Image Segmentation Architecture |
Pei Wang, Zhaowei Cai, Hao Yang, Ashwin Swaminathan, R. Manmatha, Stefano Soatto |
Existing unified image segmentation models either employ a unified
architecture across multiple tasks but use separate weights tailored to each
dataset, or apply a single set of weights to multiple datasets but are limited
to a single task. In this paper, we introduce the Mixed-Query Transformer
(MQ-Former), a unified architecture for multi-task and multi-dataset image
segmentation using a single set of weights. To enable this, we propose a mixed
query strategy, which can effectively and dynamically accommodate different
types of objects without heuristic designs. In addition, the unified
architecture allows us to use data augmentation with synthetic masks and
captions to further improve model generalization. Experiments demonstrate that
MQ-Former can not only effectively handle multiple segmentation datasets and
tasks compared to specialized state-of-the-art models with competitive
performance, but also generalize better to open-set segmentation tasks,
evidenced by over 7 points higher performance than the prior art on the
open-vocabulary SeginW benchmark. |
This paper introduces MQ-Former, a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights, enabled by a novel mixed query strategy. |
Existing unified segmentation models are limited to either separate weights for each dataset or a single task, hindering their ability to leverage diverse information across tasks and datasets for real-world open-world applications. |
MQ-Former employs a mixed query strategy combining learnable and conditional queries, dynamically matched to objects via Hungarian matching, eliminating the need for heuristic thing/stuff class distinction. It is trained jointly on multiple datasets and tasks, further enhanced by incorporating synthetic masks and captions. |
MQ-Former effectively handles multiple segmentation tasks and datasets with competitive performance compared to specialized models.
It demonstrates superior generalization to open-set segmentation, outperforming the state-of-the-art by over 7 points on the SeginW benchmark.
The use of synthetic data significantly improves performance, highlighting its potential for addressing data scarcity in segmentation. |
MQ-Former currently lacks explicit support for reasoning segmentation tasks requiring complex reasoning abilities.
The paper doesn't explore cross-modality feature fusion, which could further enhance performance but at the cost of increased computational resources. |
image segmentation, unified architecture, multi-task learning, multi-dataset training, synthetic data |
2404.04465
Report |
Aligning Diffusion Models by Optimizing Human Utility |
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka |
We present Diffusion-KTO, a novel approach for aligning text-to-image
diffusion models by formulating the alignment objective as the maximization of
expected human utility. Since this objective applies to each generation
independently, Diffusion-KTO does not require collecting costly pairwise
preference data nor training a complex reward model. Instead, our objective
requires simple per-image binary feedback signals, e.g. likes or dislikes,
which are abundantly available. After fine-tuning using Diffusion-KTO,
text-to-image diffusion models exhibit superior performance compared to
existing techniques, including supervised fine-tuning and Diffusion-DPO, both
in terms of human judgment and automatic evaluation metrics such as PickScore
and ImageReward. Overall, Diffusion-KTO unlocks the potential of leveraging
readily available per-image binary signals and broadens the applicability of
aligning text-to-image diffusion models with human preferences. |
Presents Diffusion-KTO, a novel approach for aligning text-to-image diffusion models with human preference using per-sample binary feedback (likes or dislikes) and without training a reward model. |
Addresses the limitations of existing alignment methods that require expensive pairwise preference data or complex reward model training. Leverages readily available binary feedback to improve the alignment and applicability of text-to-image models. |
Extends the human utility maximization framework to diffusion models by optimizing a utility function based on the implicit reward of each step in the denoising process. Explores various utility functions, finding the Kahneman & Tversky model most effective. |
Diffusion-KTO significantly improves image quality and alignment with human preferences, as judged by both human evaluation and automated metrics.
Outperforms existing alignment methods, including supervised fine-tuning, Diffusion-DPO, and AlignProp, across various metrics like PickScore, HPS v2, and ImageReward.
Demonstrates potential for aligning models with individual user preferences through synthetic experiments simulating custom heuristics. |
Inherits potential biases and limitations present in the training data and the base text-to-image model.
Exploration of alternative utility functions and their impact on alignment remains an open question. |
text-to-image synthesis, diffusion models, human preference learning, utility maximization, binary feedback |
2404.04421
Report |
PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations |
Yang Zheng, Qingqing Zhao, Guandao Yang, Wang Yifan, Donglai Xiang, Florian Dubost, Dmitry Lagun, Thabo Beeler, Federico Tombari, Leonidas Guibas, Gordon Wetzstein |
Modeling and rendering photorealistic avatars is of crucial importance in
many applications. Existing methods that build a 3D avatar from visual
observations, however, struggle to reconstruct clothed humans. We introduce
PhysAvatar, a novel framework that combines inverse rendering with inverse
physics to automatically estimate the shape and appearance of a human from
multi-view video data along with the physical parameters of the fabric of their
clothes. For this purpose, we adopt a mesh-aligned 4D Gaussian technique for
spatio-temporal mesh tracking as well as a physically based inverse renderer to
estimate the intrinsic material properties. PhysAvatar integrates a physics
simulator to estimate the physical parameters of the garments using
gradient-based optimization in a principled manner. These novel capabilities
enable PhysAvatar to create high-quality novel-view renderings of avatars
dressed in loose-fitting clothes under motions and lighting conditions not seen
in the training data. This marks a significant advancement towards modeling
photorealistic digital humans using physically based inverse rendering with
physics in the loop. Our project website is at:
https://qingqing-zhao.github.io/PhysAvatar |
Introduces PhysAvatar, a novel framework that reconstructs dressed 3D avatars from multi-view video, accurately modeling garment physics and appearance. |
Existing methods struggle to realistically model loose-fitting clothes, neglecting physically accurate garment dynamics. |
Combines mesh tracking with 4D Gaussians, physics-based parameter optimization using a simulator, and appearance refinement via inverse rendering. |
Outperforms state-of-the-art methods in geometry accuracy, capturing fine wrinkle details.
Achieves superior appearance quality, particularly in capturing high-frequency details.
Enables animation, relighting, and redressing, and is compatible with traditional graphics pipelines. |
Relies on manual garment segmentation and mesh UV unwrapping.
Limited by the accuracy of the SMPL-X body model used as a collider. |
neural rendering, physics simulation, 3d avatar, inverse rendering, garment modeling |
2404.04376
Report |
ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing |
Alec Helbling, Seongmin Lee, Polo Chau |
Recently, researchers have proposed powerful systems for generating and
manipulating images using natural language instructions. However, it is
difficult to precisely specify many common classes of image transformations
with text alone. For example, a user may wish to change the location and breed
of a particular dog in an image with several similar dogs. This task is quite
difficult with natural language alone, and would require a user to write a
laboriously complex prompt that both disambiguates the target dog and describes
the destination. We propose ClickDiffusion, a system for precise image
manipulation and generation that combines natural language instructions with
visual feedback provided by the user through a direct manipulation interface.
We demonstrate that by serializing both an image and a multi-modal instruction
into a textual representation it is possible to leverage LLMs to perform
precise transformations of the layout and appearance of an image. Code
available at https://github.com/poloclub/ClickDiffusion. |
ClickDiffusion, an interactive image editing system that combines natural language instructions with visual feedback through direct manipulation for precise image editing. |
Existing text-based image editing methods lack precision for complex tasks that require object disambiguation and specific location editing. Direct manipulation alone is inflexible and limited to predefined operations. |
The system serializes multi-modal instructions (text + bounding boxes) into a textual format, processes them using an LLM with in-context learning, manipulates an intermediate image layout, and generates the final edited image using layout-based image generation (GLIGEN). |
Enables precise object manipulation and appearance changes within complex scenes.
Simplifies complex editing tasks compared to text-only methods.
Leverages LLMs' few-shot learning abilities for generalization to diverse editing operations. |
Limited user study and quantitative evaluation.
Reliance on layout-based image generation may inherit limitations of such methods. |
image editing, direct manipulation, natural language processing, large language models, human-computer interaction |
2404.04363
Report |
Idea-2-3D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs |
Junhao Chen, Xiang Li, Xiaojun Ye, Chao Li, Zhaoxin Fan, Hao Zhao |
In this paper, we pursue a novel 3D AIGC setting: generating 3D content from
IDEAs. The definition of an IDEA is the composition of multimodal inputs
including text, image, and 3D models. To our knowledge, this challenging and
appealing 3D AIGC setting has not been studied before. We propose the novel
framework called Idea-2-3D to achieve this goal, which consists of three agents
based upon large multimodel models (LMMs) and several existing algorithmic
tools for them to invoke. Specifically, these three LMM-based agents are
prompted to do the jobs of prompt generation, model selection and feedback
reflection. They work in a cycle that involves both mutual collaboration and
criticism. Note that this cycle is done in a fully automatic manner, without
any human intervention. The framework then outputs a text prompt to generate 3D
models that well align with input IDEAs. We show impressive 3D AIGC results
that are beyond any previous methods can achieve. For quantitative comparisons,
we construct caption-based baselines using a whole bunch of state-of-the-art 3D
AIGC models and demonstrate Idea-2-3D out-performs significantly. In 94.2% of
cases, Idea-2-3D meets users' requirements, marking a degree of match between
IDEA and 3D models that is 2.3 times higher than baselines. Moreover, in 93.5%
of the cases, users agreed that Idea-2-3D was better than baselines. Codes,
data and models will made publicly available. |
This paper introduces Idea-2-3D, a novel framework employing LMM (Large Multimodal Model) agents to automatically generate 3D models from interleaved multimodal inputs called IDEAs (combinations of text, images, and 3D models). |
Existing 3D AIGC models primarily rely on single-modality inputs (text or image) and struggle to capture the complexity of human creative ideas often expressed through a blend of modalities. Idea-2-3D bridges this gap, enabling a more natural and expressive way to design in 3D. |
Idea-2-3D leverages three LMM agents powered by GPT-4V for prompt generation, 3D model selection, and feedback reflection. It iteratively refines the generated 3D model by converting it into multi-view images, feeding them back to the LMM agents, and leveraging a memory module to learn from previous iterations. |
Idea-2-3D significantly outperforms caption-based T-2-3D baselines in user preference studies, demonstrating higher alignment with user IDEAs.
The iterative self-refinement process in Idea-2-3D leads to incremental improvements in the generated 3D models, effectively capturing and translating complex multimodal design concepts.
Ablation studies highlight the importance of the memory module, feedback mechanism, and storage of previous models in achieving high-quality and convergent 3D model generation. |
The reliance on closed-source LMMs like GPT-4V poses limitations in terms of accessibility and reproducibility.
Future work could explore alternative open-source LMMs or investigate methods to reduce the dependency on proprietary models while maintaining performance. |
lmm agents, 3d aigc, automated 3d design, multimodal learning, generative ai |
2404.04346
Report |
Koala: Key frame-conditioned long video-LLM |
Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko |
Long video question answering is a challenging task that involves recognizing
short-term activities and reasoning about their fine-grained relationships.
State-of-the-art video Large Language Models (vLLMs) hold promise as a viable
solution due to their demonstrated emergent capabilities on new tasks. However,
despite being trained on millions of short seconds-long videos, vLLMs are
unable to understand minutes-long videos and accurately answer questions about
them. To address this limitation, we propose a lightweight and self-supervised
approach, Key frame-conditioned long video-LLM (Koala), that introduces
learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to
longer videos. Our approach introduces two new tokenizers that condition on
visual tokens computed from sparse video key frames for understanding short and
long video moments. We train our proposed approach on HowTo100M and demonstrate
its effectiveness on zero-shot long video understanding benchmarks, where it
outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across
all tasks. Surprisingly, we also empirically show that our approach not only
helps a pretrained vLLM to understand long videos but also improves its
accuracy on short-term action recognition. |
Introduces Koala, a lightweight approach to adapt pretrained short video-LLMs to understand and answer questions about minutes-long videos by conditioning on sparsely sampled key frames. |
Existing video-LLMs, despite being trained on millions of short videos, struggle to understand and answer questions about minutes-long videos. |
Introduces Conditioned Segment (CS) and Conditioned Video (CV) tokenizer functions that leverage global video context from coarsely sampled key frames to aggregate fine-grained spatiotemporal information from local video segments. |
Outperforms state-of-the-art large models by 3-6% in absolute accuracy on zero-shot long video understanding benchmarks (EgoSchema and Seed-Bench).
Demonstrates improved accuracy on short-term action recognition, suggesting benefits for both short and long-term video understanding.
Empirical analysis highlights the importance of global context conditioning and spatiotemporal queries in the proposed tokenizer functions. |
Limited scalability to extremely long videos (e.g., movies) due to the maximum input token limit of pretrained LLMs.
Potential for further improvement by incorporating curated descriptive annotations during the final finetuning stage. |
video-llm, long-form video understanding, spatiotemporal reasoning, key frame conditioning, multimodal learning |
2404.04319
Report |
SpatialTracker: Tracking Any 2D Pixels in 3D Space |
Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, Xiaowei Zhou |
Recovering dense and long-range pixel motion in videos is a challenging
problem. Part of the difficulty arises from the 3D-to-2D projection process,
leading to occlusions and discontinuities in the 2D motion domain. While 2D
motion can be intricate, we posit that the underlying 3D motion can often be
simple and low-dimensional. In this work, we propose to estimate point
trajectories in 3D space to mitigate the issues caused by image projection. Our
method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth
estimators, represents the 3D content of each frame efficiently using a
triplane representation, and performs iterative updates using a transformer to
estimate 3D trajectories. Tracking in 3D allows us to leverage
as-rigid-as-possible (ARAP) constraints while simultaneously learning a
rigidity embedding that clusters pixels into different rigid parts. Extensive
evaluation shows that our approach achieves state-of-the-art tracking
performance both qualitatively and quantitatively, particularly in challenging
scenarios such as out-of-plane rotation. |
Presents SpatialTracker, a novel method for dense, long-range 2D motion tracking in videos by lifting pixels to 3D and performing tracking with a triplane representation and a learned as-rigid-as-possible (ARAP) constraint. |
Existing 2D tracking methods struggle with occlusions and complex deformations, issues which are mitigated by leveraging the inherent 3D nature of motion and 3D motion priors like the ARAP constraint. |
Uses monocular depth estimation to lift 2D pixels to 3D, represents each frame's 3D content with triplane feature maps, iteratively predicts 3D trajectories using a transformer, and enforces ARAP regularization with a learned rigidity embedding. |
Achieves state-of-the-art 2D tracking performance on TAP-Vid, BADJA, and PointOdyssey benchmarks.
Shows superior qualitative results in handling complex motion and occlusions on challenging videos.
Demonstrates accurate 3D trajectory estimation when ground truth depth is available. |
Performance relies on the accuracy of off-the-shelf monocular depth estimators.
Future work can explore joint learning of depth and motion to further enhance tracking. |
motion tracking, 3d reconstruction, triplane representation, arap constraint, monocular depth estimation |
2404.04256
Report |
Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation |
Zifu Wan, Yuhao Wang, Silong Yong, Pingping Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie |
Multi-modal semantic segmentation significantly enhances AI agents'
perception and scene understanding, especially under adverse conditions like
low-light or overexposed environments. Leveraging additional modalities
(X-modality) like thermal and depth alongside traditional RGB provides
complementary information, enabling more robust and reliable segmentation. In
this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic
segmentation, utilizing the Selective Structured State Space Model, Mamba.
Unlike conventional methods that rely on CNNs, with their limited local
receptive fields, or Vision Transformers (ViTs), which offer global receptive
fields at the cost of quadratic complexity, our model achieves global receptive
fields coverage with linear complexity. By employing a Siamese encoder and
innovating a Mamba fusion mechanism, we effectively select essential
information from different modalities. A decoder is then developed to enhance
the channel-wise modeling ability of the model. Our method, Sigma, is
rigorously evaluated on both RGB-Thermal and RGB-Depth segmentation tasks,
demonstrating its superiority and marking the first successful application of
State Space Models (SSMs) in multi-modal perception tasks. Code is available at
https://github.com/zifuwan/Sigma. |
Introduces Sigma, a Siamese Mamba network for multi-modal semantic segmentation using the Selective Structured State Space Model (Mamba) for efficient global receptive field coverage with linear complexity. |
Multi-modal semantic segmentation is crucial for AI agents in challenging conditions, but existing CNN and ViT-based methods have limitations in receptive field size and complexity. Mamba offers a solution with global receptive fields and linear complexity. |
Sigma uses a Siamese encoder with cascaded Visual State Space (VSS) Blocks for feature extraction. A fusion module with Cross and Concat Mamba Blocks aggregates information from different modalities. A channel-aware Mamba decoder refines the features for segmentation. |
Sigma outperforms state-of-the-art models on RGB-Thermal and RGB-Depth semantic segmentation tasks in terms of accuracy and efficiency.
The proposed Mamba fusion mechanism effectively integrates multi-modal information while significantly reducing computational demand compared to Transformer-based fusion.
Ablation studies validate the contribution of each component, especially the fusion module and the channel-aware decoder. |
Current implementation focuses on two modalities, potentially underutilizing Mamba's capacity for longer sequences.
Memory consumption in the Mamba encoder, specifically the four-directional scanning, poses challenges for deployment on lightweight edge devices. |
multi-modal learning, semantic segmentation, state space models, vision mamba, siamese networks |
2404.04242
Report |
Physical Property Understanding from Language-Embedded Feature Fields |
Albert J. Zhai, Yuan Shen, Emily Y. Chen, Gloria X. Wang, Xinlei Wang, Sheng Wang, Kaiyu Guan, Shenlong Wang |
Can computers perceive the physical properties of objects solely through
vision? Research in cognitive science and vision science has shown that humans
excel at identifying materials and estimating their physical properties based
purely on visual appearance. In this paper, we present a novel approach for
dense prediction of the physical properties of objects using a collection of
images. Inspired by how humans reason about physics through vision, we leverage
large language models to propose candidate materials for each object. We then
construct a language-embedded point cloud and estimate the physical properties
of each 3D point using a zero-shot kernel regression approach. Our method is
accurate, annotation-free, and applicable to any object in the open world.
Experiments demonstrate the effectiveness of the proposed approach in various
physical property reasoning tasks, such as estimating the mass of common
objects, as well as other properties like friction and hardness. |
Presents NeRF2Physics, a training-free approach for uncertainty-aware dense prediction of physical properties from images. |
Crucial for various applications like robotics, agriculture, urban planning, and graphics to perceive physics from visual data. |
1. Extracts a language-embedded point cloud from a neural radiance field fused with CLIP features. 2. Prompts an LLM to propose candidate materials and their properties. 3. Employs zero-shot CLIP-based kernel regression for per-point property estimation. 4. Aggregates properties for object-level estimates like mass using LLM-based thickness estimations. |
Outperforms baselines on mass estimation using the ABO dataset.
Produces reasonable predictions for diverse physical properties like friction and hardness on a real-world dataset.
Enables creation of physically realistic digital twins for various applications. |
Limited ability to reason about occluded object parts.
Potential for material recognition errors when local appearances are ambiguous. |
physical property estimation, vision-language models, neural radiance fields, zero-shot learning, digital twins |
2404.04211
Report |
Robust Gaussian Splatting |
François Darmon, Lorenzo Porzi, Samuel Rota-Bulò, Peter Kontschieder |
In this paper, we address common error sources for 3D Gaussian Splatting
(3DGS) including blur, imperfect camera poses, and color inconsistencies, with
the goal of improving its robustness for practical applications like
reconstructions from handheld phone captures. Our main contribution involves
modeling motion blur as a Gaussian distribution over camera poses, allowing us
to address both camera pose refinement and motion blur correction in a unified
way. Additionally, we propose mechanisms for defocus blur compensation and for
addressing color in-consistencies caused by ambient light, shadows, or due to
camera-related factors like varying white balancing settings. Our proposed
solutions integrate in a seamless way with the 3DGS formulation while
maintaining its benefits in terms of training efficiency and rendering speed.
We experimentally validate our contributions on relevant benchmark datasets
including Scannet++ and Deblur-NeRF, obtaining state-of-the-art results and
thus consistent improvements over relevant baselines. |
The paper proposes a robust Gaussian Splatting (3DGS) method resilient to blur, imperfect camera poses, and color inconsistencies common in real-world captures. |
Existing neural rendering methods, including 3DGS, often falter with real-world data exhibiting blur, inaccurate camera poses, and color inconsistencies, limiting their practicality. |
The method models motion blur as a Gaussian distribution over camera poses, introduces a defocus blur compensation mechanism using an offset correction to the 2D Gaussian covariances, and addresses color inconsistencies via an RGB decoder function with per-image affine color transformations. |
The proposed approach surpasses baselines in perceptual metrics (SSIM, LPIPS) on a real-world benchmark derived from the ScanNet++ dataset.
Ablation studies validate the individual contributions of color transformation, pose optimization, and blur modeling.
While achieving competitive performance on the synthetic DeblurNeRF dataset, the method lags behind state-of-the-art NeRF-based methods, potentially due to their stronger regularization. |
The method doesn't address dynamic blur from non-static objects.
The issue of poor 3DGS generalization to viewpoints far from the training trajectory remains unaddressed. |
neural rendering, 3d gaussian splatting, motion blur, defocus blur, color consistency |
2404.04057
Report |
Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation |
Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, Hai Huang |
We introduce Score identity Distillation (SiD), an innovative data-free
method that distills the generative capabilities of pretrained diffusion models
into a single-step generator. SiD not only facilitates an exponentially fast
reduction in Fr\'echet inception distance (FID) during distillation but also
approaches or even exceeds the FID performance of the original teacher
diffusion models. By reformulating forward diffusion processes as semi-implicit
distributions, we leverage three score-related identities to create an
innovative loss mechanism. This mechanism achieves rapid FID reduction by
training the generator using its own synthesized images, eliminating the need
for real data or reverse-diffusion-based generation, all accomplished within
significantly shortened generation time. Upon evaluation across four benchmark
datasets, the SiD algorithm demonstrates high iteration efficiency during
distillation and surpasses competing distillation approaches, whether they are
one-step or few-step, data-free, or dependent on training data, in terms of
generation quality. This achievement not only redefines the benchmarks for
efficiency and effectiveness in diffusion distillation but also in the broader
field of diffusion-based generation. The PyTorch implementation is available at
https://github.com/mingyuanzhou/SiD |
This paper introduces Score identity Distillation (SiD), a novel data-free method for distilling pretrained diffusion models into single-step generators, achieving fast distillation and high-quality generation exceeding the original model. |
Diffusion models, while powerful, suffer from slow multi-step generation. SiD addresses this by enabling single-step generation while maintaining or improving upon the original model's quality, offering significant efficiency gains. |
SiD leverages a novel perspective of forward diffusion as semi-implicit distributions. It introduces three score-related identities to formulate a loss mechanism, approximating a model-based score-matching loss using score estimation and Monte Carlo methods. |
SiD achieves exponentially fast reduction in FID during distillation, surpassing competing methods in efficiency.
The SiD-trained single-step generator approaches or surpasses the FID performance of the original multi-step teacher diffusion model.
Evaluations on four benchmark datasets show SiD's superior performance over existing single and multi-step generators, both data-free and data-dependent. |
SiD requires managing three networks during distillation, leading to higher memory demands than traditional diffusion model training.
While SiD outperforms competitors on most benchmarks, further investigation is needed for scenarios like ImageNet 64x64, where it currently lags behind the teacher model. |
diffusion distillation, score matching, deep generative models, single-step generation, semi-implicit distributions |
2404.04037
Report |
InstructHumans: Editing Animated 3D Human Textures with Instructions |
Jiayin Zhu, Linlin Yang, Angela Yao |
We present InstructHumans, a novel framework for instruction-driven 3D human
texture editing. Existing text-based editing methods use Score Distillation
Sampling (SDS) to distill guidance from generative models. This work shows that
naively using such scores is harmful to editing as they destroy consistency
with the source avatar. Instead, we propose an alternate SDS for Editing
(SDS-E) that selectively incorporates subterms of SDS across diffusion
timesteps. We further enhance SDS-E with spatial smoothness regularization and
gradient-based viewpoint sampling to achieve high-quality edits with sharp and
high-fidelity detailing. InstructHumans significantly outperforms existing 3D
editing methods, consistent with the initial avatar while faithful to the
textual instructions. Project page: https://jyzhu.top/instruct-humans . |
This paper introduces InstructHumans, a novel framework for instruction-driven 3D human texture editing, enabling users to modify animatable human avatars using text instructions. |
Existing text-driven 3D human editing methods are limited to non-animatable avatars or suffer from poor texture quality, failing to balance consistency with the original avatar and adherence to textual instructions. This work addresses these limitations by proposing a new method specifically tailored for editing animatable 3D humans with high fidelity and faithfulness to instructions. |
The paper proposes SDS for Editing (SDS-E), a customized score distillation sampling method for 3D editing. SDS-E selectively incorporates subterms of SDS across different diffusion timesteps, addressing the limitations of naive SDS application. The framework further enhances editing quality and efficiency using a Laplacian smoothness regularizer to maintain texture coherence and a gradient-aware viewpoint sampling strategy to optimize editing efforts. |
InstructHumans effectively edits 3D human textures based on text instructions while preserving the original avatar's identity and animation capability.
The proposed SDS-E method successfully distills editing guidance by selectively applying SDS terms across timesteps, outperforming naive SDS in editing quality.
The Laplacian smoothness regularizer and gradient-aware viewpoint sampling further improve the editing outcome and efficiency, respectively. |
The framework depends on a hybrid human representation that might limit capturing high-frequency details, leading to potential artifacts. Adopting higher-resolution mesh templates and training with larger datasets are suggested as solutions.
The disentanglement of textural and geometric changes with 2D guidance remains a challenge. Future research could focus on addressing the texture-geometry ambiguity for more comprehensive 3D human editing. |
3d human texture editing, text-guided editing, score distillation sampling, animatable avatars, diffusion models |
2404.03836
Report |
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model |
Amrin Kareem, Jean Lahoud, Hisham Cholakkal |
Recent advancements in 3D perception systems have significantly improved
their ability to perform visual recognition tasks such as segmentation.
However, these systems still heavily rely on explicit human instruction to
identify target objects or categories, lacking the capability to actively
reason and comprehend implicit user intentions. We introduce a novel
segmentation task known as reasoning part segmentation for 3D objects, aiming
to output a segmentation mask based on complex and implicit textual queries
about specific parts of a 3D object. To facilitate evaluation and benchmarking,
we present a large 3D dataset comprising over 60k instructions paired with
corresponding ground-truth part segmentation annotations specifically curated
for reasoning-based 3D part segmentation. We propose a model that is capable of
segmenting parts of 3D objects based on implicit textual queries and generating
natural language explanations corresponding to 3D object segmentation requests.
Experiments show that our method achieves competitive performance to models
that use explicit queries, with the additional abilities to identify part
concepts, reason about them, and complement them with world knowledge. Our
source code, dataset, and trained models are available at
https://github.com/AmrinKareem/PARIS3D. |
This paper introduces a new 3D segmentation task called reasoning part segmentation, demanding models to understand implicit textual queries and reason about 3D object parts for segmentation. |
This task is important for developing intelligent perception systems capable of understanding implicit user intentions and reasoning in 3D contexts, crucial for applications like robotics and human-robot interaction. |
A new dataset, RPSeg3D, is created with reasoning-based instructions and corresponding part segmentation masks for 3D objects. PARIS3D, a multimodal LLM-based model, is proposed, which takes multi-view images of a 3D object and a textual query as input, leverages a vision backbone and LLM for reasoning and explanation generation, and outputs a segmentation mask and textual explanation. |
PARIS3D achieves competitive performance on RPSeg3D, demonstrating its ability to reason and segment based on implicit queries.
The model outperforms baselines in 3D semantic segmentation, particularly when provided with 3D information in the queries.
PARIS3D generalizes to real-world point clouds captured with smartphone LiDAR sensors, showing its practical applicability. |
The current model does not handle instance segmentation, presenting a direction for future work.
Expanding the dataset to include more complex scenes and object interactions would further enhance the model's capabilities. |
3d vision-language models, reasoning, 3d part segmentation, multimodal learning, dataset |
2404.03799
Report |
Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation |
Elham Amin Mansour, Ozan Unal, Suman Saha, Benjamin Bejar, Luc Van Gool |
The increasing relevance of panoptic segmentation is tied to the advancements
in autonomous driving and AR/VR applications. However, the deployment of such
models has been limited due to the expensive nature of dense data annotation,
giving rise to unsupervised domain adaptation (UDA). A key challenge in
panoptic UDA is reducing the domain gap between a labeled source and an
unlabeled target domain while harmonizing the subtasks of semantic and instance
segmentation to limit catastrophic interference. While considerable progress
has been achieved, existing approaches mainly focus on the adaptation of
semantic segmentation. In this work, we focus on incorporating instance-level
adaptation via a novel instance-aware cross-domain mixing strategy IMix. IMix
significantly enhances the panoptic quality by improving instance segmentation
performance. Specifically, we propose inserting high-confidence predicted
instances from the target domain onto source images, retaining the
exhaustiveness of the resulting pseudo-labels while reducing the injected
confirmation bias. Nevertheless, such an enhancement comes at the cost of
degraded semantic performance, attributed to catastrophic forgetting. To
mitigate this issue, we regularize our semantic branch by employing CLIP-based
domain alignment (CDA), exploiting the domain-robustness of natural language
prompts. Finally, we present an end-to-end model incorporating these two
mechanisms called LIDAPS, achieving state-of-the-art results on all popular
panoptic UDA benchmarks. |
This paper proposes LIDAPS, a novel language-guided instance-aware domain-adaptive panoptic segmentation model. It introduces two key components: IMix, an instance-aware cross-domain mixing strategy, and CDA, a CLIP-based domain alignment mechanism, to enhance panoptic segmentation performance in unsupervised domain adaptation. |
The deployment of panoptic segmentation models is limited by the expensive nature of dense data annotation and the domain gap between datasets. This work addresses these challenges by proposing a novel approach to adapt both semantic and instance segmentation tasks, improving the model's ability to generalize to new domains. |
The study introduces IMix, which pastes high-confidence predicted instances from the target domain onto source images, improving instance segmentation while reducing confirmation bias. To mitigate catastrophic forgetting in the semantic branch, CDA aligns both domains with a pre-trained CLIP model using per-pixel text similarity maps. |
LIDAPS achieves state-of-the-art results on popular panoptic UDA benchmarks, surpassing previous methods by up to +3.6 mPQ.
IMix significantly improves instance segmentation performance by simplifying the recognition of target objects and ensuring exhaustive pseudo-label generation.
CDA effectively mitigates catastrophic forgetting in the semantic branch by aligning source and target domains with the CLIP embedding space. |
The confidence threshold for pseudo-mask filtering in IMix needs to be manually determined for different source-target domain pairs.
The refinement phase with IMix introduces additional computational overhead during training. |
unsupervised domain adaptation, panoptic segmentation, instance segmentation, semantic segmentation, clip |
2404.03736
Report |
SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer |
Zijie Wu, Chaohui Yu, Yanqin Jiang, Chenjie Cao, Fan Wang, Xiang Bai |
Recent advances in 2D/3D generative models enable the generation of dynamic
3D objects from a single-view video. Existing approaches utilize score
distillation sampling to form the dynamic scene as dynamic NeRF or dense 3D
Gaussians. However, these methods struggle to strike a balance among reference
view alignment, spatio-temporal consistency, and motion fidelity under
single-view conditions due to the implicit nature of NeRF or the intricate
dense Gaussian motion prediction. To address these issues, this paper proposes
an efficient, sparse-controlled video-to-4D framework named SC4D, that
decouples motion and appearance to achieve superior video-to-4D generation.
Moreover, we introduce Adaptive Gaussian (AG) initialization and Gaussian
Alignment (GA) loss to mitigate shape degeneration issue, ensuring the fidelity
of the learned motion and shape. Comprehensive experimental results demonstrate
that our method surpasses existing methods in both quality and efficiency. In
addition, facilitated by the disentangled modeling of motion and appearance of
SC4D, we devise a novel application that seamlessly transfers the learned
motion onto a diverse array of 4D entities according to textual descriptions. |
This paper introduces SC4D, a novel video-to-4D generation framework that leverages sparse control points and dense 3D Gaussians to disentangle motion and appearance for improved 4D object generation from single-view videos. |
Existing video-to-4D methods struggle to balance reference view alignment, spatio-temporal consistency, and motion fidelity due to limitations in representing dynamic 3D objects. |
SC4D employs a two-stage approach: a coarse stage to initialize sparse control points and their motion, followed by a fine stage that optimizes dense Gaussians guided by the control points using Linear Binding Skinning. Adaptive Gaussian initialization and Gaussian Alignment loss are introduced to mitigate shape degeneration during training. |
SC4D outperforms state-of-the-art methods in both qualitative and quantitative evaluations, demonstrating superior reference view alignment, spatio-temporal consistency, and motion fidelity.
SC4D exhibits efficiency in training, requiring significantly less time compared to existing approaches.
The disentangled motion representation enables a novel application for motion transfer, allowing the generation of diverse 4D entities with consistent motion based on text descriptions. |
SC4D's reliance on novel view synthesis models like Zero123 limits its performance on complex objects where such models may struggle.
The current framework is restricted to static camera scenarios and does not account for moving camera viewpoints, presenting an area for future exploration. |
video-to-4d generation, dynamic gaussian splatting, motion transfer, sparse control points, shape degeneration |
2404.03658
Report |
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning |
Rui Li, Tobias Fischer, Mattia Segu, Marc Pollefeys, Luc Van Gool, Federico Tombari |
Recovering the 3D scene geometry from a single view is a fundamental yet
ill-posed problem in computer vision. While classical depth estimation methods
infer only a 2.5D scene representation limited to the image plane, recent
approaches based on radiance fields reconstruct a full 3D representation.
However, these methods still struggle with occluded regions since inferring
geometry without visual observation requires (i) semantic knowledge of the
surroundings, and (ii) reasoning about spatial context. We propose KYN, a novel
method for single-view scene reconstruction that reasons about semantic and
spatial context to predict each point's density. We introduce a vision-language
modulation module to enrich point features with fine-grained semantic
information. We aggregate point representations across the scene through a
language-guided spatial attention mechanism to yield per-point density
predictions aware of the 3D semantic context. We show that KYN improves 3D
shape recovery compared to predicting density for each 3D point in isolation.
We achieve state-of-the-art results in scene and object reconstruction on
KITTI-360, and show improved zero-shot generalization compared to prior work.
Project page: https://ruili3.github.io/kyn. |
Proposes KYN, a single-view scene reconstruction method that leverages semantic and spatial context to predict 3D point density. |
Existing methods struggle to accurately reconstruct occluded regions due to a lack of semantic understanding and spatial context. |
Introduces a vision-language modulation module to enrich point features with semantic information and a vision-language spatial attention mechanism to aggregate these features across the scene. |
KYN achieves state-of-the-art scene and object reconstruction results on KITTI-360.
KYN exhibits superior performance in reconstructing occluded regions compared to previous methods.
KYN demonstrates improved zero-shot generalization on the DDAD dataset. |
The performance of KYN is limited by the quality of the semantic segmentation.
The memory footprint of the spatial attention mechanism can be further optimized. |
single-view reconstruction, 3d scene understanding, vision-language learning, spatial attention, semantic segmentation |
2404.03657
Report |
OW-VISCap: Open-World Video Instance Segmentation and Captioning |
Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing |
Open-world video instance segmentation is an important video understanding
task. Yet most methods either operate in a closed-world setting, require an
additional user-input, or use classic region-based proposals to identify never
before seen objects. Further, these methods only assign a one-word label to
detected objects, and don't generate rich object-centric descriptions. They
also often suffer from highly overlapping predictions. To address these issues,
we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap),
an approach to jointly segment, track, and caption previously seen or unseen
objects in a video. For this, we introduce open-world object queries to
discover never before seen objects without additional user-input. We generate
rich and descriptive object-centric captions for each detected object via a
masked attention augmented LLM input. We introduce an inter-query contrastive
loss to ensure that the object queries differ from one another. Our generalized
approach matches or surpasses state-of-the-art on three tasks: open-world video
instance segmentation on the BURST dataset, dense video object captioning on
the VidSTG dataset, and closed-world video instance segmentation on the OVIS
dataset. |
Proposes OW-VISCap, an approach to jointly segment, track, and caption both previously seen and unseen objects in a video, addressing limitations of existing open-world video instance segmentation methods. |
Open-world video instance segmentation is crucial for applications like autonomous systems and AR/VR, but existing methods have limitations in handling unseen objects and generating rich descriptions. |
Introduces open-world object queries to discover unseen objects, uses a masked attention augmented LLM for object-centric captioning, and implements an inter-query contrastive loss to ensure diverse object queries. |
Achieves state-of-the-art performance on uncommon categories in the BURST dataset for open-world video instance segmentation.
Shows significant improvement in captioning accuracy for detected objects on the VidSTG dataset for dense video object captioning.
Performs competitively on closed-world video instance segmentation on the OVIS dataset. |
Object detection and caption generation struggle with small objects or objects under prolonged occlusion.
Future work includes exploring stronger object discovery strategies, improved caption generators, and integrating more robust object trackers. |
open-world video instance segmentation, object-centric captioning, open-world object queries, masked attention, contrastive loss |
2404.03654
Report |
RaFE: Generative Radiance Fields Restoration |
Zhongkai Wu, Ziyu Wan, Jing Zhang, Jing Liao, Dong Xu |
NeRF (Neural Radiance Fields) has demonstrated tremendous potential in novel
view synthesis and 3D reconstruction, but its performance is sensitive to input
image quality, which struggles to achieve high-fidelity rendering when provided
with low-quality sparse input viewpoints. Previous methods for NeRF restoration
are tailored for specific degradation type, ignoring the generality of
restoration. To overcome this limitation, we propose a generic radiance fields
restoration pipeline, named RaFE, which applies to various types of
degradations, such as low resolution, blurriness, noise, compression artifacts,
or their combinations. Our approach leverages the success of off-the-shelf 2D
restoration methods to recover the multi-view images individually. Instead of
reconstructing a blurred NeRF by averaging inconsistencies, we introduce a
novel approach using Generative Adversarial Networks (GANs) for NeRF generation
to better accommodate the geometric and appearance inconsistencies present in
the multi-view images. Specifically, we adopt a two-level tri-plane
architecture, where the coarse level remains fixed to represent the low-quality
NeRF, and a fine-level residual tri-plane to be added to the coarse level is
modeled as a distribution with GAN to capture potential variations in
restoration. We validate RaFE on both synthetic and real cases for various
restoration tasks, demonstrating superior performance in both quantitative and
qualitative evaluations, surpassing other 3D restoration methods specific to
single task. Please see our project website
https://zkaiwu.github.io/RaFE-Project/. |
This supplementary material provides additional training details and experimental results for the paper 'RaFE: Generative Radiance Fields Restoration'. |
The main paper proposes a novel approach for restoring degraded images within a generative radiance field framework. This supplementary material aims to enhance the understanding and validity of the proposed method. |
The supplementary material details the training process for both the overall pipeline and the pre-trained coarse NeRF. It also presents ablation studies on the impact of the residual coarse NeRF, view direction conditioning, and patch sampling strategy. Additionally, the material analyzes the method's performance on NeRF-like degradation and elaborates on the calculation of the diversity score. |
Residual coarse NeRF aids in better geometry awareness and detailed rendering.
View direction conditioning contributes to realistic reflections on non-Lambertian surfaces.
The proposed Beta-based patch sampling strategy leads to more stable training and improved rendering quality compared to uniform sampling. |
The paper doesn't discuss the computational cost of the proposed method.
The paper mainly focuses on visual quality, a quantitative analysis with different degradation levels could be beneficial.
Exploring the generalization of RaFE to more complex real-world scenarios with severe degradations presents an exciting avenue for future work. |
generative radiance fields, image restoration, nerf, deep learning, computer vision |
2404.03653
Report |
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching |
Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li |
Diffusion models have demonstrated great success in the field of
text-to-image generation. However, alleviating the misalignment between the
text prompts and images is still challenging. The root reason behind the
misalignment has not been extensively investigated. We observe that the
misalignment is caused by inadequate token attention activation. We further
attribute this phenomenon to the diffusion model's insufficient condition
utilization, which is caused by its training paradigm. To address the issue, we
propose CoMat, an end-to-end diffusion model fine-tuning strategy with an
image-to-text concept matching mechanism. We leverage an image captioning model
to measure image-to-text alignment and guide the diffusion model to revisit
ignored tokens. A novel attribute concentration module is also proposed to
address the attribute binding problem. Without any image or human preference
data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL.
Extensive experiments show that CoMat-SDXL significantly outperforms the
baseline model SDXL in two text-to-image alignment benchmarks and achieves
start-of-the-art performance. |
CoMat, an end-to-end fine-tuning strategy for text-to-image diffusion models, improves text-image alignment by incorporating an image-to-text concept matching mechanism. |
Existing diffusion models struggle to fully utilize text prompts, leading to misalignment between generated images and complex prompts. |
CoMat leverages a pre-trained image captioning model to guide the diffusion model during training. It identifies missing concepts and encourages the model to revisit ignored text tokens, improving alignment. It also includes an attribute concentration module to enhance attribute binding and a fidelity preservation component to maintain image quality. |
CoMat-SDXL significantly outperforms the SDXL baseline and commercial models in text-image alignment benchmarks.
Both concept matching and attribute concentration modules contribute to performance gains.
Using a pre-trained UNet as a discriminator effectively preserves generation fidelity during fine-tuning. |
Effectively incorporating Multimodal Large Language Models (MLLMs) for finer-grained alignment and fidelity remains under-explored.
Adapting CoMat to the 3D domain for improved text-to-3D generation alignment is a potential future direction. |
text-to-image generation, diffusion model, text-image alignment, concept matching, attribute binding |
2404.03652
Report |
The More You See in 2D, the More You Perceive in 3D |
Xinyang Han, Zelin Gao, Angjoo Kanazawa, Shubham Goel, Yossi Gandelsman |
Humans can infer 3D structure from 2D images of an object based on past
experience and improve their 3D understanding as they see more images. Inspired
by this behavior, we introduce SAP3D, a system for 3D reconstruction and novel
view synthesis from an arbitrary number of unposed images. Given a few unposed
images of an object, we adapt a pre-trained view-conditioned diffusion model
together with the camera poses of the images via test-time fine-tuning. The
adapted diffusion model and the obtained camera poses are then utilized as
instance-specific priors for 3D reconstruction and novel view synthesis. We
show that as the number of input images increases, the performance of our
approach improves, bridging the gap between optimization-based prior-less 3D
reconstruction methods and single-image-to-3D diffusion-based methods. We
demonstrate our system on real images as well as standard synthetic benchmarks.
Our ablation studies confirm that this adaption behavior is key for more
accurate 3D understanding. |
SAP3D: a system for 3D object reconstruction and novel view synthesis from an arbitrary number of unposed images, improving its performance as the number of input images increases. |
Existing methods struggle to reconstruct accurate 3D from a few unposed images and cannot leverage additional views to improve. This system aims to bridge the gap between single-view and multi-view reconstruction by effectively utilizing any number of input images. |
The system uses a pre-trained view-conditioned diffusion model and camera pose estimator. It first estimates coarse camera poses, then jointly fine-tunes the diffusion model on input images and refines camera poses. Finally, it performs 3D reconstruction via a NeRF and novel view synthesis by sampling the adapted diffusion model. |
3D reconstruction quality (geometry and appearance) improves as the number of input views grows.
Novel view synthesis accuracy increases with more input images, demonstrating improved consistency and detail.
Test-time adaptation and using a large-scale dataset for camera pose estimation are crucial for performance. |
Camera pose parametrization is limited by the pre-trained diffusion model.
The system relies on an optimization stage when multiple input images are provided, hindering real-time applicability. |
3d reconstruction, novel view synthesis, test-time adaptation, diffusion models, camera pose estimation |
2404.03650
Report |
OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views |
Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, Federico Tombari |
Large visual-language models (VLMs), like CLIP, enable open-set image
segmentation to segment arbitrary concepts from an image in a zero-shot manner.
This goes beyond the traditional closed-set assumption, i.e., where models can
only segment classes from a pre-defined training set. More recently, first
works on open-set segmentation in 3D scenes have appeared in the literature.
These methods are heavily influenced by closed-set 3D convolutional approaches
that process point clouds or polygon meshes. However, these 3D scene
representations do not align well with the image-based nature of the
visual-language models. Indeed, point cloud and 3D meshes typically have a
lower resolution than images and the reconstructed 3D scene geometry might not
project well to the underlying 2D image sequences used to compute pixel-aligned
CLIP features. To address these challenges, we propose OpenNeRF which naturally
operates on posed images and directly encodes the VLM features within the NeRF.
This is similar in spirit to LERF, however our work shows that using pixel-wise
VLM features (instead of global CLIP features) results in an overall less
complex architecture without the need for additional DINO regularization. Our
OpenNeRF further leverages NeRF's ability to render novel views and extract
open-set VLM features from areas that are not well observed in the initial
posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF
outperforms recent open-vocabulary methods such as LERF and OpenScene by at
least +4.9 mIoU. |
This paper presents NERFCLIP, a novel neural radiance field (NeRF) based approach for open-set 3D scene segmentation, by distilling pixel-aligned CLIP features into NeRF and leveraging NeRF's view synthesis capabilities for extracting additional visual-language features from novel views. |
Open-set 3D scene segmentation is crucial for robots interacting with unseen environments or AR/VR applications where training labels are scarce, as it allows segmentation of arbitrary concepts beyond pre-defined classes. |
NERFCLIP utilizes a NeRF architecture to encode open-set features, trained on posed RGB images and supervised by pre-computed 2D open-set feature maps. It estimates confidence in existing features and leverages NeRF's view synthesis to generate novel views of low-confidence regions, extracting additional features. |
NERFCLIP achieves state-of-the-art performance on open-vocabulary 3D segmentation, outperforming mesh-based OpenScene and NeRF-based LERF by +4.5 mIoU on the Replica dataset.
The study shows NeRF-based representations are better at detecting small, long-tail objects compared to mesh-based methods.
Incorporating pixel-aligned CLIP features from novel views significantly improves segmentation performance. |
Many long-tail classes remain undetected, highlighting the difficulty of open-scene segmentation, especially for less frequent categories.
Future work could explore alternative confidence estimation techniques and novel view selection strategies for further improvement. |
open-set 3d scene segmentation, neural radiance fields (nerf), vision-language models (vlms), clip, novel view synthesis |
2404.03645
Report |
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation |
Shuting He, Henghui Ding |
Referring video segmentation relies on natural language expressions to
identify and segment objects, often emphasizing motion clues. Previous works
treat a sentence as a whole and directly perform identification at the
video-level, mixing up static image-level cues with temporal motion cues.
However, image-level features cannot well comprehend motion cues in sentences,
and static cues are not crucial for temporal perception. In fact, static cues
can sometimes interfere with temporal perception by overshadowing motion cues.
In this work, we propose to decouple video-level referring expression
understanding into static and motion perception, with a specific emphasis on
enhancing temporal comprehension. Firstly, we introduce an
expression-decoupling module to make static cues and motion cues perform their
distinct role, alleviating the issue of sentence embeddings overlooking motion
cues. Secondly, we propose a hierarchical motion perception module to capture
temporal information effectively across varying timescales. Furthermore, we
employ contrastive learning to distinguish the motions of visually similar
objects. These contributions yield state-of-the-art performance across five
datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement
on the challenging $\textbf{MeViS}$ dataset. Code is available at
https://github.com/heshuting555/DsHmp. |
This paper presents a novel approach for referring video segmentation that decouples static and motion perception to enhance the understanding of temporal information. |
Existing methods struggle to accurately segment objects in videos based on natural language descriptions, especially when motion cues are crucial for identification. |
The approach decouples the input sentence into static and motion cues. It uses a hierarchical motion perception module to capture short-term and long-term motion patterns. It also employs contrastive learning to distinguish visually similar objects based on their motion. |
Achieves state-of-the-art performance on five referring video segmentation datasets (MeViS, Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences).
Shows significant improvement on the challenging MeViS dataset, outperforming previous methods by a large margin (9.2% in J&F).
Demonstrates the effectiveness of decoupling static and motion cues, hierarchical motion perception, and contrastive learning through ablation studies. |
The performance gain on datasets with less emphasis on motion is relatively small.
Future work could explore incorporating more sophisticated language models for richer representation of motion cues. |
referring video segmentation, motion understanding, contrastive learning, hierarchical motion perception, computer vision |
2404.03635
Report |
WorDepth: Variational Language Prior for Monocular Depth Estimation |
Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong |
Three-dimensional (3D) reconstruction from a single image is an ill-posed
problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text
description(s) is similarly ill-posed, i.e. spatial arrangements of objects
described. We investigate the question of whether two inherently ambiguous
modalities can be used in conjunction to produce metric-scaled reconstructions.
To test this, we focus on monocular depth estimation, the problem of predicting
a dense depth map from a single image, but with an additional text caption
describing the scene. To this end, we begin by encoding the text caption as a
mean and standard deviation; using a variational framework, we learn the
distribution of the plausible metric reconstructions of 3D scenes corresponding
to the text captions as a prior. To "select" a specific reconstruction or depth
map, we encode the given image through a conditional sampler that samples from
the latent space of the variational text encoder, which is then decoded to the
output depth map. Our approach is trained alternatingly between the text and
image branches: in one optimization step, we predict the mean and standard
deviation from the text description and sample from a standard Gaussian, and in
the other, we sample using a (image) conditional sampler. Once trained, we
directly predict depth from the encoded text using the conditional sampler. We
demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where
we show that language can consistently improve performance in both. |
WorDepth is a novel variational framework for monocular depth estimation that leverages language as a prior to improve metric-scaled depth prediction from single images. |
Monocular depth estimation suffers from inherent scale ambiguity. Language descriptions, while also ambiguous in spatial layout, provide strong priors on object scales, offering complementary information to resolve this ambiguity. |
The method uses a text-VAE to learn a latent distribution of plausible depth maps from text captions. An image-based conditional sampler then samples from this distribution, conditioned on the input image, to predict the most probable depth map. |
WorDepth achieves state-of-the-art results on NYU Depth V2 and KITTI datasets.
The method shows significant improvement in threshold accuracy (δ<1.25), indicating better scale estimation.
Qualitative results demonstrate improved depth prediction accuracy for objects mentioned in text descriptions. |
The method's performance relies on the accuracy and specificity of text captions, making it susceptible to inaccuracies from the image captioner.
Vague or incorrect captions can lead to suboptimal depth predictions. |
monocular depth estimation, language prior, variational autoencoder (vae), conditional sampler, scale ambiguity |
2404.03620
Report |
LCM-Lookahead for Encoder-based Text-to-Image Personalization |
Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or |
Recent advancements in diffusion models have introduced fast sampling methods
that can effectively produce high-quality images in just one or a few denoising
steps. Interestingly, when these are distilled from existing diffusion models,
they often maintain alignment with the original model, retaining similar
outputs for similar prompts and seeds. These properties present opportunities
to leverage fast sampling methods as a shortcut-mechanism, using them to create
a preview of denoised outputs through which we can backpropagate image-space
losses. In this work, we explore the potential of using such
shortcut-mechanisms to guide the personalization of text-to-image models to
specific facial identities. We focus on encoder-based personalization
approaches, and demonstrate that by tuning them with a lookahead identity loss,
we can achieve higher identity fidelity, without sacrificing layout diversity
or prompt alignment. We further explore the use of attention sharing mechanisms
and consistent data generation for the task of personalization, and find that
encoder training can benefit from both. |
This paper introduces LCM-Lookahead, a novel mechanism using fast sampling models to apply image-space losses to diffusion models, leading to improved identity preservation in personalized text-to-image generation. |
Existing encoder-based text-to-image personalization methods struggle to balance prompt alignment and identity preservation, especially in challenging scenarios like stylization. |
The authors leverage a pretrained LCM model as a shortcut to create high-quality previews of denoised images. This allows them to apply an identity loss during encoder training, improving identity fidelity without sacrificing layout diversity or prompt alignment. Additionally, they propose using a consistent, synthetic dataset generated with SDXL-Turbo and integrating an extended self-attention mechanism into the encoder. |
The proposed method achieves superior identity preservation and prompt alignment compared to the baseline IP-Adapter.
Using an LCM-based shortcut for identity loss leads to noticeable improvements over direct approximations.
A novel consistent dataset generated with SDXL-Turbo significantly improves prompt alignment. |
The model may still exhibit biases present in the backbone model or the diffusion model itself.
The performance of tuning-free encoders, while improved, still falls short of optimization-based methods, especially with out-of-domain inputs. |
text-to-image personalization, face generation, diffusion models, lcm, consistency models |
2404.03613
Report |
Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting |
Jeongmin Bae, Seoha Kim, Youngsik Yun, Hahyun Lee, Gun Bang, Youngjung Uh |
As 3D Gaussian Splatting (3DGS) provides fast and high-quality novel view
synthesis, it is a natural extension to deform a canonical 3DGS to multiple
frames. However, previous works fail to accurately reconstruct dynamic scenes,
especially 1) static parts moving along nearby dynamic parts, and 2) some
dynamic areas are blurry. We attribute the failure to the wrong design of the
deformation field, which is built as a coordinate-based function. This approach
is problematic because 3DGS is a mixture of multiple fields centered at the
Gaussians, not just a single coordinate-based framework. To resolve this
problem, we define the deformation as a function of per-Gaussian embeddings and
temporal embeddings. Moreover, we decompose deformations as coarse and fine
deformations to model slow and fast movements, respectively. Also, we introduce
an efficient training strategy for faster convergence and higher quality.
Project page: https://jeongminb.github.io/e-d3dgs/ |
This paper introduces a novel per-Gaussian embedding-based deformation method for deformable 3D Gaussian Splatting, improving dynamic scene reconstruction. |
Existing field-based deformable 3D Gaussian Splatting methods suffer from entanglement of Gaussian deformations, leading to inaccurate reconstructions of dynamic scenes. |
The method defines deformation as a function of per-Gaussian and temporal embeddings, uses coarse-fine deformation to model different motion scales, and employs an efficient training strategy for faster convergence and better quality. |
The approach improves deformation quality and captures fine details in dynamic regions.
It outperforms baselines on datasets like Neural 3D Video and Technicolor Light Field.
The method demonstrates superior reconstruction quality even under challenging camera settings, as shown with the HyperNeRF dataset. |
The method may exhibit blurriness in areas with significant motion between frames.
Rendering speed can be slower compared to existing Gaussian Splatting methods. |
3d gaussian splatting, dynamic scene reconstruction, novel view synthesis, per-gaussian deformation, deformable neural radiance fields |
2404.03575
Report |
DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling |
Haoran Li, Haolin Shi, Wenli Zhang, Wenjun Wu, Yong Liao, Lin Wang, Lik-hang Lee, Pengyuan Zhou |
Text-to-3D scene generation holds immense potential for the gaming, film, and
architecture sectors. Despite significant progress, existing methods struggle
with maintaining high quality, consistency, and editing flexibility. In this
paper, we propose DreamScene, a 3D Gaussian-based novel text-to-3D scene
generation framework, to tackle the aforementioned three challenges mainly via
two strategies. First, DreamScene employs Formation Pattern Sampling (FPS), a
multi-timestep sampling strategy guided by the formation patterns of 3D
objects, to form fast, semantically rich, and high-quality representations. FPS
uses 3D Gaussian filtering for optimization stability, and leverages
reconstruction techniques to generate plausible textures. Second, DreamScene
employs a progressive three-stage camera sampling strategy, specifically
designed for both indoor and outdoor settings, to effectively ensure
object-environment integration and scene-wide 3D consistency. Last, DreamScene
enhances scene editing flexibility by integrating objects and environments,
enabling targeted adjustments. Extensive experiments validate DreamScene's
superiority over current state-of-the-art techniques, heralding its
wide-ranging potential for diverse applications. Code and demos will be
released at https://dreamscene-project.github.io . |
DreamScene, a novel text-to-3D scene generation framework, leverages 3D Gaussians and Formation Pattern Sampling (FPS) to achieve high quality, consistency, and editing flexibility. |
Existing methods struggle with maintaining quality and consistency across viewpoints and lack flexibility in editing generated scenes. |
DreamScene uses FPS, a multi-timestep sampling method with 3D Gaussian filtering, to create semantically rich representations and plausible textures. A three-stage camera sampling strategy ensures scene-wide consistency. Objects and environments are integrated for flexible editing. |
DreamScene generates high-quality 3D scenes and objects comparable to or exceeding state-of-the-art methods.
It achieves superior scene-wide 3D consistency compared to methods like Text2NeRF and Text2Room.
DreamScene allows for flexible editing of object placement and style within the generated scene. |
Outdoor scene realism is currently limited compared to some inpainting-based methods.
Future work will explore depth supervision to enhance realism in outdoor scene generation. |
text-to-3d, scene generation, 3d gaussian, formation pattern sampling, scene editing |
2404.03566
Report |
PointInfinity: Resolution-Invariant Point Diffusion Models |
Zixuan Huang, Justin Johnson, Shoubhik Debnath, James M. Rehg, Chao-Yuan Wu |
We present PointInfinity, an efficient family of point cloud diffusion
models. Our core idea is to use a transformer-based architecture with a
fixed-size, resolution-invariant latent representation. This enables efficient
training with low-resolution point clouds, while allowing high-resolution point
clouds to be generated during inference. More importantly, we show that scaling
the test-time resolution beyond the training resolution improves the fidelity
of generated point clouds and surfaces. We analyze this phenomenon and draw a
link to classifier-free guidance commonly used in diffusion models,
demonstrating that both allow trading off fidelity and variability during
inference. Experiments on CO3D show that PointInfinity can efficiently generate
high-resolution point clouds (up to 131k points, 31 times more than Point-E)
with state-of-the-art quality. |
This paper introduces PointInfinity, a resolution-invariant point cloud diffusion model that can be trained on low-resolution point clouds but generate high-resolution point clouds during inference. |
Current 3D point cloud diffusion models struggle to achieve the realism and diversity seen in 2D image generation due to the computational challenges posed by the large size of point cloud data. |
PointInfinity uses a two-stream transformer architecture with a fixed-size latent representation for modeling the underlying 3D shape and a variable-sized data representation for handling point clouds of different resolutions. This allows for efficient training on low-resolution data while enabling high-resolution generation. |
Scaling the test-time resolution beyond the training resolution improves the fidelity of the generated point clouds.
PointInfinity outperforms previous state-of-the-art methods in terms of surface generation fidelity and texture quality.
PointInfinity demonstrates significant computational efficiency, scaling linearly with input resolution during inference compared to the quadratic scaling of previous methods. |
The paper focuses on generating point clouds up to a specific resolution and does not explore generation at arbitrarily high resolutions.
Future work could explore incorporating other conditioning signals, such as text, to enable more controllable point cloud generation. |
point cloud generation, diffusion models, resolution invariance, transformer, 3d deep learning |
2404.03531
Report |
COMO: Compact Mapping and Odometry |
Eric Dexheimer, Andrew J. Davison |
We present COMO, a real-time monocular mapping and odometry system that
encodes dense geometry via a compact set of 3D anchor points. Decoding anchor
point projections into dense geometry via per-keyframe depth covariance
functions guarantees that depth maps are joined together at visible anchor
points. The representation enables joint optimization of camera poses and dense
geometry, intrinsic 3D consistency, and efficient second-order inference. To
maintain a compact yet expressive map, we introduce a frontend that leverages
the covariance function for tracking and initializing potentially visually
indistinct 3D points across frames. Altogether, we introduce a real-time system
capable of estimating accurate poses and consistent geometry. |
COMO, a real-time monocular SLAM system that encodes dense geometry via a compact set of 3D anchor points and uses per-keyframe depth covariance functions for 3D consistency. |
Provides accurate and consistent poses and dense geometry for robotics and AR using the simplicity and efficiency of monocular cameras. |
A compact set of 3D points is projected into keyframes. Depth covariance functions generate dense depth maps conditioned on the sparse 3D points. Poses and 3D points are jointly optimized by minimizing photometric error. |
Outperforms state-of-the-art dense SLAM methods on TUM RGBD in terms of trajectory error.
Achieves the lowest ATE and highest AUC on ScanNet trajectory estimation among both sparse and dense methods.
Produces the most accurate depth maps on Replica and ScanNet while also demonstrating strong depth consistency across neighboring frames. |
Reliance on photometric error can limit accuracy in scenes with low texture, specularities, and dynamic lighting.
The current depth covariance model was trained on ScanNet, and may not generalize to different environments. |
slam, dense geometry, depth covariance, monocular vision, 3d reconstruction |
2404.03477
Report |
Towards Automated Movie Trailer Generation |
Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem |
Movie trailers are an essential tool for promoting films and attracting
audiences. However, the process of creating trailers can be time-consuming and
expensive. To streamline this process, we propose an automatic trailer
generation framework that generates plausible trailers from a full movie by
automating shot selection and composition. Our approach draws inspiration from
machine translation techniques and models the movies and trailers as sequences
of shots, thus formulating the trailer generation problem as a
sequence-to-sequence task. We introduce Trailer Generation Transformer (TGT), a
deep-learning framework utilizing an encoder-decoder architecture. TGT movie
encoder is tasked with contextualizing each movie shot representation via
self-attention, while the autoregressive trailer decoder predicts the feature
representation of the next trailer shot, accounting for the relevance of shots'
temporal order in trailers. Our TGT significantly outperforms previous methods
on a comprehensive suite of metrics. |
This paper proposes Trailer Generation Transformer (TGT), a novel deep-learning framework for automatic movie trailer generation by formulating it as a sequence-to-sequence learning problem, addressing limitations of prior shot classification or ranking based approaches. |
Movie trailers are crucial for marketing but costly and time-consuming to create manually. Automating trailer generation can significantly streamline this process, benefiting both studios and audiences. |
TGT utilizes an encoder-decoder architecture. The encoder contextualizes movie shots using a trailerness encoder and a context encoder. The decoder autoregressively predicts trailer shot features, learning shot composition. A greedy algorithm then selects optimal shots from the movie based on predicted features. |
TGT significantly outperforms baselines on two new benchmarks built upon MAD and MovieNet datasets, using metrics like precision, recall, F1-score, Levenshtein distance, and sequence length difference.
Ablation studies demonstrate the importance of both trailerness and context encoders, as well as the chosen loss functions (reconstruction, trailerness, and KL divergence).
Adding text-based movie plot summaries as contextual input further improves the trailer generation performance. |
The current TGT model does not incorporate dialogue and sound modeling, which are important for fine-grained trailer editing.
Future work can explore incorporating these elements and further enhance TGT's capabilities. |
trailer generation, sequence-to-sequence learning, transformer, video understanding, deep learning |
2404.03421
Report |
Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View |
Andreea Dogaru, Mert Özer, Bernhard Egger |
Single-view 3D reconstruction is currently approached from two dominant
perspectives: reconstruction of scenes with limited diversity using 3D data
supervision or reconstruction of diverse singular objects using large image
priors. However, real-world scenarios are far more complex and exceed the
capabilities of these methods. We therefore propose a hybrid method following a
divide-and-conquer strategy. We first process the scene holistically,
extracting depth and semantic information, and then leverage a single-shot
object-level method for the detailed reconstruction of individual components.
By following a compositional processing approach, the overall framework
achieves full reconstruction of complex 3D scenes from a single image. We
purposely design our pipeline to be highly modular by carefully integrating
specific procedures for each processing step, without requiring an end-to-end
training of the whole system. This enables the pipeline to naturally improve as
future methods can replace the individual modules. We demonstrate the
reconstruction performance of our approach on both synthetic and real-world
scenes, comparing favorable against prior works. Project page:
https://andreeadogaru.github.io/Gen3DSR. |
This paper proposes a novel modular framework for generalizable 3D scene reconstruction from a single RGB image, leveraging a divide-and-conquer strategy with off-the-shelf models for depth estimation, entity segmentation, and object reconstruction. |
Reconstructing complex real-world scenes from a single view is a challenging task with significant implications for various applications. Existing methods are often limited in scope or generalization ability. This work addresses these limitations by proposing a more versatile and robust approach. |
The proposed method analyzes the input scene holistically, extracting depth, camera parameters, and segmenting entities. Foreground objects are processed individually, undergoing amodal completion and single-view reconstruction before being integrated into the final scene using depth guides. The background is modeled separately. |
The method achieves state-of-the-art results on benchmark datasets, outperforming existing approaches in terms of accuracy and visual quality.
The modular design allows for incremental improvements as individual components can be easily replaced with more advanced models in the future.
The framework exhibits strong generalization ability, effectively reconstructing scenes with diverse objects and layouts, even on real-world images. |
The method relies on accurate depth estimation, and errors in depth can propagate to the final reconstruction.
The current implementation relies on a separate object reconstruction model that may not generalize to unseen object categories. |
3d scene reconstruction, single-view, compositional, amodal completion, depth estimation |
2404.03413
Report |
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens |
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny |
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM)
designed specifically for video understanding. The model is capable of
processing both temporal visual and textual data, making it adept at
understanding the complexities of videos. Building upon the success of
MiniGPT-v2, which excelled in translating visual features into the LLM space
for single images and achieved impressive results on various image-text
benchmarks, this paper extends the model's capabilities to process a sequence
of frames, enabling it to comprehend videos. MiniGPT4-video does not only
consider visual content but also incorporates textual conversations, allowing
the model to effectively answer queries involving both visual and text
components. The proposed model outperforms existing state-of-the-art methods,
registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF,
and TVQA benchmarks respectively. Our models and code have been made publicly
available here https://vision-cair.github.io/MiniGPT4-video/ |
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding, capable of processing both temporal visual and textual data to understand and answer queries about videos. |
Existing LLMs struggle to effectively capture temporal information from videos, limiting their ability to understand dynamic visual content. This work aims to adapt LLMs for comprehending the complexities of video sequences, combining visual and textual information for enhanced understanding. |
The methodology involves subsampling video frames, aligning them with textual descriptions using a pretrained EVA-CLIP model, and mapping them into the LLM space. By concatenating visual and text tokens for each frame, the LLM gains a comprehensive understanding of the video's content. The model is trained in three stages: image-text pair pretraining, video-text pair pretraining, and video question answering instruction finetuning. |
MiniGPT4-Video outperforms previous state-of-the-art methods on the Video-ChatGPT benchmark across all five evaluation dimensions when subtitles are provided.
The model achieves significant improvements in zero-shot evaluations for open-ended questions on MSVD, MSRVTT, TGIF, and ActivityNet datasets, demonstrating its ability to effectively answer questions based on visual content.
Integrating subtitle information significantly boosts performance on the TVQA benchmark for multiple-choice questions, highlighting the model's capacity to leverage both visual and textual cues for enhanced video understanding. |
MiniGPT4-Video currently faces limitations in processing long videos due to the LLM's context window constraint.
Future work will focus on extending the model to handle longer video sequences, addressing this limitation. |
large language models, video understanding, multimodal learning, video question answering, temporal information processing |
2404.03407
Report |
AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment |
Chunyi Li, Tengchuan Kou, Yixuan Gao, Yuqin Cao, Wei Sun, Zicheng Zhang, Yingjie Zhou, Zhichao Zhang, Weixia Zhang, Haoning Wu, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai |
With the rapid advancements in AI-Generated Content (AIGC), AI-Generated
Images (AIGIs) have been widely applied in entertainment, education, and social
media. However, due to the significant variance in quality among different
AIGIs, there is an urgent need for models that consistently match human
subjective ratings. To address this issue, we organized a challenge towards
AIGC quality assessment on NTIRE 2024 that extensively considers 15 popular
generative models, utilizing dynamic hyper-parameters (including
classifier-free guidance, iteration epochs, and output image resolution), and
gather subjective scores that consider perceptual quality and text-to-image
alignment altogether comprehensively involving 21 subjects. This approach
culminates in the creation of the largest fine-grained AIGI subjective quality
database to date with 20,000 AIGIs and 420,000 subjective ratings, known as
AIGIQA-20K. Furthermore, we conduct benchmark experiments on this database to
assess the correspondence between 16 mainstream AIGI quality models and human
perception. We anticipate that this large-scale quality database will inspire
robust quality indicators for AIGIs and propel the evolution of AIGC for
vision. The database is released on
https://www.modelscope.cn/datasets/lcysyzxdxc/AIGCQA-30K-Image. |
This paper introduces AIGIQA-20K, the largest fine-grained database for AI-Generated Image Quality Assessment. |
A standardized quality assessment metric for AI-generated images is crucial due to the increasing prevalence of AIGIs and the variability in their quality. |
The database was constructed by generating 20,000 AIGIs using 15 T2I models with varying hyperparameters (CFG, iterations, resolution). 21 subjects then rated the perceptual quality and text-to-image alignment of each image, resulting in 420,000 subjective scores. |
AIGI quality is significantly impacted by the T2I model, prompt, and hyperparameters.
Fine-tuning traditional IQA metrics on AIGIQA-20K considerably improves their performance, with some surpassing zero-shot alignment metrics.
Zero-shot quality assessment models for AIGIs still require further development, as indicated by the superior performance of fine-tuned models and the leading accuracy of qalign despite its disregard for text-image alignment. |
The study primarily focuses on image quality and does not encompass other AIGC modalities like video, text, or audio.
Future work could investigate the development of more robust zero-shot quality assessment models for AIGIs that can generalize across different generation models and hyperparameters without requiring fine-tuning. |
ai-generated images, image quality assessment, text-to-image synthesis, subjective quality evaluation, aigiqa-20k |
2404.03392
Report |
Two Tricks to Improve Unsupervised Segmentation Learning |
Alp Eren Sari, Francesco Locatello, Paolo Favaro |
We present two practical improvement techniques for unsupervised segmentation
learning. These techniques address limitations in the resolution and accuracy
of predicted segmentation maps of recent state-of-the-art methods. Firstly, we
leverage image post-processing techniques such as guided filtering to refine
the output masks, improving accuracy while avoiding substantial computational
costs. Secondly, we introduce a multi-scale consistency criterion, based on a
teacher-student training scheme. This criterion matches segmentation masks
predicted from regions of the input image extracted at different resolutions to
each other. Experimental results on several benchmarks used in unsupervised
segmentation learning demonstrate the effectiveness of our proposed techniques. |
This paper introduces two practical tricks to enhance the resolution and accuracy of predicted segmentation maps in unsupervised segmentation learning, specifically addressing limitations in recent state-of-the-art methods. |
Unsupervised segmentation learning has the potential to be scaled to very large datasets and multiple imaging modalities with limited human effort but current methods are limited in either resolution or rely on complex training schemes. |
1. **Guided Filtering Post-Processing**: Refine output masks using guided filtering with input image luminance as guidance, improving accuracy without significant computational overhead.
2. **Multi-Scale Consistency Criterion**: Employ a teacher-student training scheme where the teacher network operates on zoomed-in image regions. The student network processes the whole image, and its predictions for corresponding regions are matched with the teacher's output, enhancing detail. |
Achieved state-of-the-art (SotA) results in unsupervised saliency segmentation on DUT-OMRON, DUTS-TE, and ECSSD datasets.
Introduced two novel and general techniques to enhance resolution of segmentation masks, demonstrating computational efficiency.
Showed consistent improvement in segmentation performance across different backbones and when combined with other recent methods. |
Challenges arise in scenarios with unambiguous saliency or multiple objects, particularly when the background and foreground share visual similarities.
Future work could explore extensions for multi-object segmentation and refine object selection mechanisms in complex scenes. |
unsupervised learning, segmentation, self-supervised learning, guided filtering, multi-scale consistency |
2404.03349
Report |
VF-NeRF: Viewshed Fields for Rigid NeRF Registration |
Leo Segre, Shai Avidan |
3D scene registration is a fundamental problem in computer vision that seeks
the best 6-DoF alignment between two scenes. This problem was extensively
investigated in the case of point clouds and meshes, but there has been
relatively limited work regarding Neural Radiance Fields (NeRF). In this paper,
we consider the problem of rigid registration between two NeRFs when the
position of the original cameras is not given. Our key novelty is the
introduction of Viewshed Fields (VF), an implicit function that determines, for
each 3D point, how likely it is to be viewed by the original cameras. We
demonstrate how VF can help in the various stages of NeRF registration, with an
extensive evaluation showing that VF-NeRF achieves SOTA results on various
datasets with different capturing approaches such as LLFF and Objaverese. |
This paper introduces Viewshed Fields (VF), a novel implicit function that identifies 3D points likely to be seen by the original cameras, for rigid registration of Neural Radiance Fields (NeRFs) without known camera positions. |
NeRF registration is crucial for various applications like 3D scene understanding and reconstruction, but existing methods struggle with finding good camera viewpoints for alignment. |
The method uses Normalizing Flows to learn a mapping between oriented points (location and viewing direction) and a latent Gaussian distribution during NeRF training. This enables sampling high-visibility points to generate novel views and guide registration optimization. |
VF-NeRF achieves state-of-the-art results on LLFF, casually captured scenes, and Objaverse datasets, outperforming point cloud and other NeRF registration methods.
VF-based initialization using point clouds or photometric scores significantly improves registration accuracy.
VF-NeRF demonstrates robustness to noise in oriented point positions, indicating its ability to generalize well for novel view generation. |
VF-NeRF's reliance on photometric loss can pose challenges in textureless or symmetric scenes.
Future work could explore the application of VF-NeRF to dynamic scenes and improve handling of partial overlaps. |
neural radiance fields, 3d registration, normalizing flows, novel view synthesis, viewshed fields |
2404.03242
Report |
Would Deep Generative Models Amplify Bias in Future Models? |
Tianwei Chen, Yusuke Hirota, Mayu Otani, Noa Garcia, Yuta Nakashima |
We investigate the impact of deep generative models on potential social
biases in upcoming computer vision models. As the internet witnesses an
increasing influx of AI-generated images, concerns arise regarding inherent
biases that may accompany them, potentially leading to the dissemination of
harmful content. This paper explores whether a detrimental feedback loop,
resulting in bias amplification, would occur if generated images were used as
the training data for future models. We conduct simulations by progressively
substituting original images in COCO and CC3M datasets with images generated
through Stable Diffusion. The modified datasets are used to train OpenCLIP and
image captioning models, which we evaluate in terms of quality and bias.
Contrary to expectations, our findings indicate that introducing generated
images during training does not uniformly amplify bias. Instead, instances of
bias mitigation across specific tasks are observed. We further explore the
factors that may influence these phenomena, such as artifacts in image
generation (e.g., blurry faces) or pre-existing biases in the original
datasets. |
This paper investigates the impact of using synthetic images generated by deep generative models, specifically Stable Diffusion, on the potential for social bias amplification in future computer vision models. |
With the increasing prevalence of AI-generated images online, it's crucial to understand how their inherent biases might affect the training of future models and potentially lead to a harmful feedback loop. |
The authors progressively replaced original images in the COCO and CC3M datasets with Stable Diffusion generated images. They then trained OpenCLIP and image captioning models on these modified datasets and evaluated their performance and bias. |
Introducing generated images during training did not consistently amplify bias as initially hypothesized.
Bias mitigation was observed in specific tasks, particularly related to gender.
The impact of generated images on bias varied, showing amplification, mitigation, no effect, or ambiguous trends depending on the task and the type of bias. |
The experiments were limited to moderately sized datasets (COCO and CC3M) due to computational constraints, leaving the impact on larger datasets uncertain.
Only Stable Diffusion was used for image generation, potentially overlooking insights from other generative models. |
social bias, deep generative models, dataset contamination, computer vision, stable diffusion |
2404.03214
Report |
LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity |
Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, Hilde Kuehne |
Vision Transformers (ViTs), with their ability to model long-range
dependencies through self-attention mechanisms, have become a standard
architecture in computer vision. However, the interpretability of these models
remains a challenge. To address this, we propose LeGrad, an explainability
method specifically designed for ViTs. LeGrad computes the gradient with
respect to the attention maps of ViT layers, considering the gradient itself as
the explainability signal. We aggregate the signal over all layers, combining
the activations of the last as well as intermediate tokens to produce the
merged explainability map. This makes LeGrad a conceptually simple and an
easy-to-implement tool for enhancing the transparency of ViTs. We evaluate
LeGrad in challenging segmentation, perturbation, and open-vocabulary settings,
showcasing its versatility compared to other SotA explainability methods
demonstrating its superior spatial fidelity and robustness to perturbations. A
demo and the code is available at https://github.com/WalBouss/LeGrad. |
LeGrad, a novel gradient-based explainability method specifically designed for Vision Transformers (ViTs), leverages the gradient information with respect to the attention maps to provide insights into the model's decision-making process. |
Existing explainability methods designed for convolutional or feed-forward neural networks are not directly applicable to ViTs due to their architectural differences, making it crucial to develop methods specifically tailored for ViTs. |
LeGrad computes the gradient of the target class activation with respect to the attention maps of each ViT layer. It then aggregates these layer-wise gradients, discarding negative contributions and normalizing the result, to generate a final explainability heatmap. |
LeGrad outperforms state-of-the-art explainability methods in object segmentation tasks, achieving a mean Intersection over Union (mIoU) of 58.7% on the ImageNet-Segmentation dataset.
In open-vocabulary scenarios, LeGrad demonstrates superior performance on the OpenImagesV7 dataset, with performance gains ranging from 2x to 5x compared to other methods.
LeGrad exhibits robustness and adaptability across different ViT architectures, including those with attentional poolers like SigLIP, effectively highlighting relevant image regions for a wide range of object categories and concepts. |
The authors acknowledge the potential presence of biases and sensitive content within the datasets used to train the ViTs, emphasizing the need for ethical considerations in future research.
Future work could explore extensions of LeGrad to other transformer-based architectures beyond ViTs, further broadening its applicability in the field of explainable AI. |
explainable ai, vision transformers, attention mechanisms, image segmentation, open-vocabulary object detection |
2404.03202
Report |
OmniGS: Omnidirectional Gaussian Splatting for Fast Radiance Field Reconstruction using Omnidirectional Images |
Longwei Li, Huajian Huang, Sai-Kit Yeung, Hui Cheng |
Photorealistic reconstruction relying on 3D Gaussian Splatting has shown
promising potential in robotics. However, the current 3D Gaussian Splatting
system only supports radiance field reconstruction using undistorted
perspective images. In this paper, we present OmniGS, a novel omnidirectional
Gaussian splatting system, to take advantage of omnidirectional images for fast
radiance field reconstruction. Specifically, we conduct a theoretical analysis
of spherical camera model derivatives in 3D Gaussian Splatting. According to
the derivatives, we then implement a new GPU-accelerated omnidirectional
rasterizer that directly splats 3D Gaussians onto the equirectangular screen
space for omnidirectional image rendering. As a result, we realize
differentiable optimization of the radiance field without the requirement of
cube-map rectification or tangent-plane approximation. Extensive experiments
conducted in egocentric and roaming scenarios demonstrate that our method
achieves state-of-the-art reconstruction quality and high rendering speed using
omnidirectional images. To benefit the research community, the code will be
made publicly available once the paper is published. |
This paper introduces OmniGS, a novel system leveraging omnidirectional Gaussian Splatting for fast and efficient radiance field reconstruction from omnidirectional images. |
Current 3D Gaussian Splatting methods are limited to perspective images. This work enables the use of information-rich omnidirectional images for real-time, high-fidelity 3D scene reconstruction, which is crucial for robotics applications. |
The authors theoretically analyze spherical camera model derivatives for 3D Gaussian splatting. Based on this, they develop a GPU-accelerated rasterizer directly splatting 3D Gaussians onto equirectangular images, enabling differentiable optimization without cube-map rectification or tangent-plane approximations. |
OmniGS achieves state-of-the-art photorealistic reconstruction quality on both roaming and egocentric scenes, outperforming NeRF-based methods.
OmniGS exhibits superior rendering speed, significantly faster than NeRF-based approaches and previous 3D Gaussian Splatting methods.
The method demonstrates strong scalability by producing high-quality perspective views from rendered omnidirectional images, surpassing perspective 3DGS in quality. |
The current implementation ignores the periodicity of trigonometric functions for speed, potentially limiting quality, which can be addressed in future work.
Future work can explore integrating OmniGS with omnidirectional SLAM systems for real-time simultaneous localization and photorealistic mapping in robotics. |
omnidirectional vision, photorealistic mapping, 3d reconstruction, view synthesis, gaussian splatting |
2404.03109
Report |
Many-to-many Image Generation with Auto-regressive Diffusion Models |
Ying Shen, Yizhe Zhang, Shuangfei Zhai, Lifu Huang, Joshua M. Susskind, Jiatao Gu |
Recent advancements in image generation have made significant progress, yet
existing models present limitations in perceiving and generating an arbitrary
number of interrelated images within a broad context. This limitation becomes
increasingly critical as the demand for multi-image scenarios, such as
multi-view images and visual narratives, grows with the expansion of multimedia
platforms. This paper introduces a domain-general framework for many-to-many
image generation, capable of producing interrelated image series from a given
set of images, offering a scalable solution that obviates the need for
task-specific solutions across different multi-image scenarios. To facilitate
this, we present MIS, a novel large-scale multi-image dataset, containing 12M
synthetic multi-image samples, each with 25 interconnected images. Utilizing
Stable Diffusion with varied latent noises, our method produces a set of
interconnected images from a single caption. Leveraging MIS, we learn M2M, an
autoregressive model for many-to-many generation, where each image is modeled
within a diffusion framework. Throughout training on the synthetic MIS, the
model excels in capturing style and content from preceding images - synthetic
or real - and generates novel images following the captured patterns.
Furthermore, through task-specific fine-tuning, our model demonstrates its
adaptability to various multi-image generation tasks, including Novel View
Synthesis and Visual Procedure Generation. |
This paper introduces a novel domain-general framework for many-to-many image generation, enabling the production of interrelated image series from a given set of images. |
Existing image generation models often struggle to perceive and generate an arbitrary number of interconnected images within a broader context, limiting their application in multi-image scenarios like multi-view synthesis and visual narratives. This paper addresses this limitation by proposing a scalable solution applicable across different multi-image generation tasks. |
The authors propose Many-to-Many Diffusion Model (M2M-DM), an autoregressive model trained on a new large-scale synthetic multi-image dataset named M2M-ImageSet. They explore two architectural variants: SD-M2M, which processes images using a shared U-Net, and DINO-M2M, which leverages DINOv2 for enhanced encoding of preceding images. |
M2M-DM demonstrates the ability to capture and maintain both content and style consistency across generated image sequences.
The model exhibits strong zero-shot generalization capabilities, effectively generating coherent images even when conditioned on real-world images from the MSCOCO dataset, despite being trained solely on synthetic data.
Through task-specific fine-tuning, M2M-DM adapts to diverse multi-image generation tasks, including Novel View Synthesis and Visual Procedure Generation, highlighting its versatility and potential for broader application. |
The model currently faces limitations in generating high-fidelity human faces, potentially due to limitations in the training dataset.
A decline in image quality is observed during the generation of prolonged image sequences, suggesting an area for future optimization. |
image generation, diffusion models, multi-image generation, autoregressive models, novel view synthesis |
2404.03042
Report |
AWOL: Analysis WithOut synthesis using Language |
Silvia Zuffi, Michael J. Black |
Many classical parametric 3D shape models exist, but creating novel shapes
with such models requires expert knowledge of their parameters. For example,
imagine creating a specific type of tree using procedural graphics or a new
kind of animal from a statistical shape model. Our key idea is to leverage
language to control such existing models to produce novel shapes. This involves
learning a mapping between the latent space of a vision-language model and the
parameter space of the 3D model, which we do using a small set of shape and
text pairs. Our hypothesis is that mapping from language to parameters allows
us to generate parameters for objects that were never seen during training. If
the mapping between language and parameters is sufficiently smooth, then
interpolation or generalization in language should translate appropriately into
novel 3D shapes. We test our approach with two very different types of
parametric shape models (quadrupeds and arboreal trees). We use a learned
statistical shape model of quadrupeds and show that we can use text to generate
new animals not present during training. In particular, we demonstrate
state-of-the-art shape estimation of 3D dogs. This work also constitutes the
first language-driven method for generating 3D trees. Finally, embedding images
in the CLIP latent space enables us to generate animals and trees directly from
images. |
Presents AWOL, a method leveraging language to generate novel 3D shapes from existing parametric models, achieving generalization beyond training data by mapping between vision-language models and model parameters. |
Addresses the limitations of classical parametric 3D shape models requiring expert knowledge for novel shape creation, enabling easy generation of new shapes (e.g., specific tree types, animal breeds) using language. |
Learns a mapping between CLIP's latent space and parameters of 3D models (e.g., SMAL for animals, TreeGen for trees) using a small dataset of shape and text pairs. Employs a RealNVP model with learned binary masks and a reconstruction loss for training. |
Generates new dog breeds by interpolating in the shape space, demonstrating realistic variations in size and age.
Produces novel animal and tree species not present in the training data, showcasing generalization capabilities.
Enables 3D shape generation from both text prompts and images, providing flexible control over the models. |
Limited to the diversity of the initial training data for the 3D models.
Primarily explores qualitative evaluation for generated shapes, lacking extensive quantitative metrics. |
text-to-3d, 3d shape generation, vision-language models, parametric models, clip |
2404.02948
Report |
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models |
Fanxu Meng, Zhaohui Wang, Muhan Zhang |
As the parameters of LLMs expand, the computational cost of fine-tuning the
entire model becomes prohibitive. To address this challenge, we introduce a
PEFT method, Principal Singular values and Singular vectors Adaptation (PiSSA),
which optimizes a significantly reduced parameter space while achieving or
surpassing the performance of full-parameter fine-tuning. PiSSA is inspired by
Intrinsic SAID, which suggests that pre-trained, over-parametrized models
inhabit a space of low intrinsic dimension. Consequently, PiSSA represents a
matrix W within the model by the product of two trainable matrices A and B,
plus a residual matrix $W^{res}$ for error correction. SVD is employed to
factorize W, and the principal singular values and vectors of W are utilized to
initialize A and B. The residual singular values and vectors initialize the
residual matrix $W^{res}$, which keeps frozen during fine-tuning. Notably,
PiSSA shares the same architecture with LoRA. However, LoRA approximates Delta
W through the product of two matrices, A, initialized with Gaussian noise, and
B, initialized with zeros, while PiSSA initializes A and B with principal
singular values and vectors of the original matrix W. PiSSA can better
approximate the outcomes of full-parameter fine-tuning at the beginning by
changing the essential parts while freezing the "noisy" parts. In comparison,
LoRA freezes the original matrix and updates the "noise". This distinction
enables PiSSA to convergence much faster than LoRA and also achieve better
performance in the end. Due to the same architecture, PiSSA inherits many of
LoRA's advantages, such as parameter efficiency and compatibility with
quantization. Leveraging a fast SVD method, the initialization of PiSSA takes
only a few seconds, inducing negligible cost of switching LoRA to PiSSA. |
This paper introduces PiSSA, a novel parameter-efficient fine-tuning (PEFT) method that leverages principal singular values and vectors for adapter initialization, outperforming existing techniques like LoRA. |
Fine-tuning large language models (LLMs) is computationally expensive. PiSSA addresses this by significantly reducing the number of trainable parameters while maintaining or exceeding the performance of full-parameter fine-tuning. |
PiSSA utilizes singular value decomposition (SVD) to extract principal components from pre-trained weight matrices. These components initialize adapters, effectively capturing essential model capabilities for efficient fine-tuning. |
PiSSA consistently outperforms LoRA in fine-tuning across various tasks, models (LLaMA 2-7B, Mistral-7B, Gemma-7B), and datasets (MetaMathQA, CodeFeedback, WizardLM).
When combined with quantization techniques, PiSSA significantly reduces quantization errors compared to QLoRA and LoftQ, further improving efficiency.
Experiments with different ranks demonstrate that PiSSA converges faster and achieves better performance with fewer trainable parameters than LoRA, highlighting its superior efficiency. |
Further investigation is needed to assess PiSSA's performance on a broader range of tasks and larger models.
Future work includes exploring the combination of PiSSA with LoRA successors and providing a theoretical explanation for its advantages. |
parameter-efficient fine-tuning, large language models, singular value decomposition, low-rank adaptation, quantization |
2404.02905
Report |
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction |
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang |
We present Visual AutoRegressive modeling (VAR), a new generation paradigm
that redefines the autoregressive learning on images as coarse-to-fine
"next-scale prediction" or "next-resolution prediction", diverging from the
standard raster-scan "next-token prediction". This simple, intuitive
methodology allows autoregressive (AR) transformers to learn visual
distributions fast and generalize well: VAR, for the first time, makes AR
models surpass diffusion transformers in image generation. On ImageNet 256x256
benchmark, VAR significantly improve AR baseline by improving Frechet inception
distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4,
with around 20x faster inference speed. It is also empirically verified that
VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions
including image quality, inference speed, data efficiency, and scalability.
Scaling up VAR models exhibits clear power-law scaling laws similar to those
observed in LLMs, with linear correlation coefficients near -0.998 as solid
evidence. VAR further showcases zero-shot generalization ability in downstream
tasks including image in-painting, out-painting, and editing. These results
suggest VAR has initially emulated the two important properties of LLMs:
Scaling Laws and zero-shot task generalization. We have released all models and
codes to promote the exploration of AR/VAR models for visual generation and
unified learning. |
This paper introduces Visual AutoRegressive (VAR) modeling, a novel image generation paradigm that redefines autoregressive learning on images as “next-scale prediction” or “next-resolution prediction”, departing from the conventional “next-token prediction” approach. |
Existing autoregressive image generation models, while theoretically sound, suffer from limitations such as violation of unidirectional dependencies, disruption of spatial locality, and inefficiency. This hinders their performance and scalability compared to diffusion models. VAR addresses these limitations, enabling autoregressive models to surpass diffusion models in image generation for the first time. |
VAR employs a multi-scale approach. It first encodes an image into multi-scale token maps using a novel multi-scale VQVAE. Then, a GPT-style transformer generates images autoregressively from coarse to fine scales, predicting the next higher-resolution token map at each step, conditioned on all previous scales. |
VAR achieves state-of-the-art FID/IS scores on ImageNet 256x256 and 512x512 benchmarks, outperforming both traditional autoregressive models and diffusion models, including DiT.
VAR exhibits strong power-law scaling laws similar to LLMs, demonstrating consistent performance improvements with increased model size and computational budget.
VAR shows promising zero-shot generalization capabilities for downstream tasks such as image in-painting, out-painting, and editing. |
The study primarily focuses on the learning paradigm, with the VQVAE architecture and training adopted from a baseline. Exploring advanced VQVAE architectures could further enhance VAR's performance.
While demonstrating promise in zero-shot generalization, the current study doesn't delve into text-prompt-based generation. Extending VAR for text-to-image synthesis is a high priority. |
image generation, autoregressive models, visual transformers, scaling laws, zero-shot learning |
2404.02889
Report |
Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining |
Qi Cui, Ruohan Meng, Chaohui Xu, Chip-Hong Chang |
Ensuring the legal usage of deep models is crucial to promoting trustable,
accountable, and responsible artificial intelligence innovation. Current
passport-based methods that obfuscate model functionality for license-to-use
and ownership verifications suffer from capacity and quality constraints, as
they require retraining the owner model for new users. They are also vulnerable
to advanced Expanded Residual Block ambiguity attacks. We propose
Steganographic Passport, which uses an invertible steganographic network to
decouple license-to-use from ownership verification by hiding the user's
identity images into the owner-side passport and recovering them from their
respective user-side passports. An irreversible and collision-resistant hash
function is used to avoid exposing the owner-side passport from the derived
user-side passports and increase the uniqueness of the model signature. To
safeguard both the passport and model's weights against advanced ambiguity
attacks, an activation-level obfuscation is proposed for the verification
branch of the owner's model. By jointly training the verification and
deployment branches, their weights become tightly coupled. The proposed method
supports agile licensing of deep models by providing a strong ownership proof
and license accountability without requiring a separate model retraining for
the admission of every new user. Experiment results show that our
Steganographic Passport outperforms other passport-based deep model protection
methods in robustness against various known attacks. |
This paper proposes Steganographic Passport, a novel method for protecting deep model intellectual property (IP) that allows verification of both model ownership and individual user licenses without retraining. |
Current passport-based methods for deep model protection require retraining for each new user, limiting their scalability and practicality for licensing scenarios. |
The method uses an invertible steganographic network to hide user IDs in user-side passports, decoupling license verification from ownership verification. It also employs activation-level obfuscation and a balance loss function to enhance security against attacks. |
The method achieves high accuracy in both ownership and license verification.
It exhibits strong robustness against various attacks, including ownership ambiguity attacks, license ambiguity attacks, and removal attacks.
Experimental results demonstrate its superior performance compared to existing passport-based methods. |
The impact of the choice of activation function on the method's security requires further investigation.
Exploring more advanced steganography techniques to enhance the hiding capacity and imperceptibility of user IDs in passports is a promising direction. |
deep model protection, intellectual property, steganography, passport-based verification, license verification |
2404.02883
Report |
On the Scalability of Diffusion-based Text-to-Image Generation |
Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto |
Scaling up model and data size has been quite successful for the evolution of
LLMs. However, the scaling law for the diffusion based text-to-image (T2I)
models is not fully explored. It is also unclear how to efficiently scale the
model for better performance at reduced cost. The different training settings
and expensive training cost make a fair model comparison extremely difficult.
In this work, we empirically study the scaling properties of diffusion based
T2I models by performing extensive and rigours ablations on scaling both
denoising backbones and training set, including training scaled UNet and
Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M
images. For model scaling, we find the location and amount of cross attention
distinguishes the performance of existing UNet designs. And increasing the
transformer blocks is more parameter-efficient for improving text-image
alignment than increasing channel numbers. We then identify an efficient UNet
variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data
scaling side, we show the quality and diversity of the training set matters
more than simply dataset size. Increasing caption density and diversity
improves text-image alignment performance and the learning efficiency. Finally,
we provide scaling functions to predict the text-image alignment performance as
functions of the scale of model size, compute and dataset size. |
This paper investigates the scaling properties of diffusion-based text-to-image models, focusing on the impact of scaling denoising backbones (UNet and Transformer) and training datasets on model performance. |
Understanding how to effectively scale these models is crucial for improving image generation quality, text-image alignment, and training efficiency. |
The authors conducted extensive, controlled experiments, training various UNet and Transformer architectures (ranging from 0.4B to 4B parameters) on datasets of up to 600M images. They rigorously ablated model architectures and dataset properties, evaluating performance with metrics like TIFA, ImageReward, CLIP score, FID, and HPSv2. |
The design of the denoising backbone significantly influences the performance, with SDXL's UNet outperforming others. Increasing transformer blocks in UNet is more parameter-efficient than increasing channel numbers for text-image alignment.
Scaling the training data with synthetic captions improves image quality and speeds up convergence. Larger, well-designed models benefit more from increased dataset size.
The study provides scaling functions that predict text-image alignment performance based on model size, compute, and dataset size, demonstrating power-law relationships similar to those observed in LLMs. |
Training Transformers from scratch for image generation is challenging due to the lack of inductive bias compared to UNets, suggesting further research in this area.
While the study focuses on scaling existing architectures, exploring novel architectural designs could further improve scaling efficiency. |
text-to-image synthesis, diffusion models, unet, transformer, scaling laws |
2404.02790
Report |
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation |
Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Iacobacci, Sarah Parisot |
Text-to-image generation has achieved astonishing results, yet precise
spatial controllability and prompt fidelity remain highly challenging. This
limitation is typically addressed through cumbersome prompt engineering, scene
layout conditioning, or image editing techniques which often require hand drawn
masks. Nonetheless, pre-existing works struggle to take advantage of the
natural instance-level compositionality of scenes due to the typically flat
nature of rasterized RGB output images. Towards adressing this challenge, we
introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of
RGB images as multilayer, instance-wise RGBA decompositions, and over 100K
instance images. To build MuLAn, we developed a training free pipeline which
decomposes a monocular RGB image into a stack of RGBA layers comprising of
background and isolated instances. We achieve this through the use of
pretrained general-purpose models, and by developing three modules: image
decomposition for instance discovery and extraction, instance completion to
reconstruct occluded areas, and image re-assembly. We use our pipeline to
create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image
decompositions in terms of style, composition and complexity. With MuLAn, we
provide the first photorealistic resource providing instance decomposition and
occlusion information for high quality images, opening up new avenues for
text-to-image generative AI research. With this, we aim to encourage the
development of novel generation and editing technology, in particular
layer-wise solutions. MuLAn data resources are available at
https://MuLAn-dataset.github.io/. |
This paper introduces MuLAn, a novel dataset of over 44K images with multi-layer RGBA decompositions, designed to facilitate research in compositional text-to-image generation. |
Precise controllability and prompt fidelity in text-to-image generation remain challenging. The flat nature of RGB images hinders leveraging the natural instance-level compositionality of scenes. MuLAn addresses this by providing instance decomposition and occlusion information. |
A novel, three-module pipeline decomposes RGB images into instance-wise RGBA stacks. The modules are: 1) Decomposition: Instance discovery and extraction using object detection, segmentation, and depth estimation. 2) Instance Completion: Reconstruction of occluded areas by leveraging depth, relative occlusion, and text-to-image inpainting. 3) Image Reassembly: Generation of occlusion-aware Alpha layers to build the final RGBA stack. |
MuLAn is the first dataset of its kind, providing instance decomposition and occlusion information for a large variety of photorealistic scenes and object types.
A robust, modular, and training-free pipeline is developed, capable of decomposing single RGB images into instance-wise RGBA stacks.
MuLAn's potential is showcased through two applications: RGBA image generation and instance addition image editing, demonstrating superior performance compared to existing methods. |
The pipeline's performance is limited by the accuracy of current object detection, segmentation, and inpainting models.
Future work will focus on improving pipeline performance, increasing MuLAn's size, and exploring human-in-the-loop extensions. |
text-to-image generation, image decomposition, rgba images, instance segmentation, image editing |
2404.02788
Report |
GenN2N: Generative NeRF2NeRF Translation |
Xiangyue Liu, Han Xue, Kunming Luo, Ping Tan, Li Yi |
We present GenN2N, a unified NeRF-to-NeRF translation framework for various
NeRF translation tasks such as text-driven NeRF editing, colorization,
super-resolution, inpainting, etc. Unlike previous methods designed for
individual translation tasks with task-specific schemes, GenN2N achieves all
these NeRF editing tasks by employing a plug-and-play image-to-image translator
to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF
space. Since the 3D consistency of 2D edits may not be assured, we propose to
model the distribution of the underlying 3D edits through a generative model
that can cover all possible edited NeRFs. To model the distribution of 3D
edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes
images while decoding NeRFs. The latent space is trained to align with a
Gaussian distribution and the NeRFs are supervised through an adversarial loss
on its renderings. To ensure the latent code does not depend on 2D viewpoints
but truly reflects the 3D edits, we also regularize the latent code through a
contrastive learning scheme. Extensive experiments on various editing tasks
show GenN2N, as a universal framework, performs as well or better than
task-specific specialists while possessing flexible generative power. More
results on our project page: https://xiangyueliu.github.io/GenN2N/ |
GenN2N, a unified NeRF-to-NeRF translation framework for diverse NeRF editing tasks (text-driven editing, colorization, super-resolution, inpainting). |
Existing NeRF editing methods are task-specific and lack flexibility. This work proposes a universal editing framework leveraging 2D image editing tools while maintaining 3D consistency. |
The method uses a plug-and-play 2D image-to-image translator for editing. It then trains a 3D VAE-GAN model to capture the distribution of possible 3D edits from the inconsistent 2D results. Contrastive learning is used to disentangle viewpoint from editing. |
GenN2N achieves state-of-the-art performance on various editing tasks, surpassing task-specific methods.
The framework demonstrates good 3D consistency, generating plausible edits across different viewpoints.
It shows strong generative capability, enabling diverse editing outcomes from a single input NeRF. |
The reliance on 2D image editing tools might limit the complexity of achievable 3D edits.
The method's performance heavily depends on the quality and consistency of the 2D translator. |
nerf, nerf editing, 3d scene editing, generative models, image-to-image translation |
2404.02747
Report |
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models |
Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, Jürgen Schmidhuber |
This study explores the role of cross-attention during inference in
text-conditional diffusion models. We find that cross-attention outputs
converge to a fixed point after few inference steps. Accordingly, the time
point of convergence naturally divides the entire inference process into two
stages: an initial semantics-planning stage, during which, the model relies on
cross-attention to plan text-oriented visual semantics, and a subsequent
fidelity-improving stage, during which the model tries to generate images from
previously planned semantics. Surprisingly, ignoring text conditions in the
fidelity-improving stage not only reduces computation complexity, but also
maintains model performance. This yields a simple and training-free method
called TGATE for efficient generation, which caches the cross-attention output
once it converges and keeps it fixed during the remaining inference steps. Our
empirical study on the MS-COCO validation set confirms its effectiveness. The
source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE. |
This paper presents \textsc{Tgate}, a training-free method that caches and reuses cross-attention outputs in text-to-image diffusion models, significantly reducing computational cost without sacrificing generation quality. |
Cross-attention in diffusion models, while crucial, is computationally expensive. This work identifies redundancy in cross-attention during later inference stages, enabling substantial efficiency improvements. |
The authors empirically analyze the convergence of cross-attention maps during inference. They propose \textsc{Tgate}, which caches these maps after convergence and reuses them, bypassing redundant computations. |
Cross-attention maps converge to a fixed point after a few inference steps, indicating diminishing influence in later stages.
\textsc{Tgate} reduces the number of Multiply–Accumulate Operations (MACs) by up to 50% and parameters by 25%, resulting in up to 2x speedup on a commercial GPU.
\textsc{Tgate} maintains or even slightly improves FID scores compared to base models, demonstrating effectiveness without sacrificing generation quality. |
While \textsc{Tgate} improves efficiency and FID scores, visual differences in generated images compared to baselines might be subtle.
Future work could explore the optimal gate step for diverse models and prompts, potentially through adaptive mechanisms. |
diffusion models, text-to-image synthesis, cross-attention, inference efficiency, computational cost |
2404.02733
Report |
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation |
Haofan Wang, Matteo Spinelli, Qixun Wang, Xu Bai, Zekui Qin, Anthony Chen |
Tuning-free diffusion-based models have demonstrated significant potential in
the realm of image personalization and customization. However, despite this
notable progress, current models continue to grapple with several complex
challenges in producing style-consistent image generation. Firstly, the concept
of style is inherently underdetermined, encompassing a multitude of elements
such as color, material, atmosphere, design, and structure, among others.
Secondly, inversion-based methods are prone to style degradation, often
resulting in the loss of fine-grained details. Lastly, adapter-based approaches
frequently require meticulous weight tuning for each reference image to achieve
a balance between style intensity and text controllability. In this paper, we
commence by examining several compelling yet frequently overlooked
observations. We then proceed to introduce InstantStyle, a framework designed
to address these issues through the implementation of two key strategies: 1) A
straightforward mechanism that decouples style and content from reference
images within the feature space, predicated on the assumption that features
within the same space can be either added to or subtracted from one another. 2)
The injection of reference image features exclusively into style-specific
blocks, thereby preventing style leaks and eschewing the need for cumbersome
weight tuning, which often characterizes more parameter-heavy designs.Our work
demonstrates superior visual stylization outcomes, striking an optimal balance
between the intensity of style and the controllability of textual elements. Our
codes will be available at https://github.com/InstantStyle/InstantStyle. |
InstantStyle is a novel tuning-free framework for diffusion-based text-to-image models that disentangles style and content in reference images for superior style transfer, enhancing existing adapter-based methods. |
Existing tuning-free methods for style transfer struggle with style degradation during inversion, content leakage, and laborious weight tuning. This work aims to address these challenges by simplifying style and content decoupling. |
InstantStyle utilizes two key strategies: 1) Subtracting content text features from reference image features in CLIP space for explicit content removal. 2) Injecting image features solely into style-specific attention blocks within the diffusion model for implicit content-style disentanglement. |
InstantStyle achieves visually superior style transfer with reduced content leakage compared to state-of-the-art methods.
The subtraction strategy effectively mitigates content leakage but may still require manual weight tuning.
Injecting features only into style blocks yields the most elegant and effective style transfer, enhancing text controllability by reducing adapter parameters. |
While effective, content subtraction may still require manual weight tuning.
The definition of 'style' can be subjective, necessitating further exploration for a more comprehensive representation. |
style transfer, text-to-image generation, diffusion models, content-style disentanglement, clip |
2404.02686
Report |
Design2Cloth: 3D Cloth Generation from 2D Masks |
Jiali Zheng, Rolandos Alexandros Potamias, Stefanos Zafeiriou |
In recent years, there has been a significant shift in the field of digital
avatar research, towards modeling, animating and reconstructing clothed human
representations, as a key step towards creating realistic avatars. However,
current 3D cloth generation methods are garment specific or trained completely
on synthetic data, hence lacking fine details and realism. In this work, we
make a step towards automatic realistic garment design and propose
Design2Cloth, a high fidelity 3D generative model trained on a real world
dataset from more than 2000 subject scans. To provide vital contribution to the
fashion industry, we developed a user-friendly adversarial model capable of
generating diverse and detailed clothes simply by drawing a 2D cloth mask.
Under a series of both qualitative and quantitative experiments, we showcase
that Design2Cloth outperforms current state-of-the-art cloth generative models
by a large margin. In addition to the generative properties of our network, we
showcase that the proposed method can be used to achieve high quality
reconstructions from single in-the-wild images and 3D scans. Dataset, code and
pre-trained model will become publicly available. |
This paper introduces Design2Cloth, a high-fidelity 3D garment generative model trained on a large-scale real-world dataset (DigitalMe) of over 2,000 garments from 2,010 subjects. |
Current 3D cloth generation methods lack realism due to being garment-specific or trained on synthetic data. Design2Cloth addresses this by using real-world data and a user-friendly approach to generate diverse and detailed clothes. |
The method leverages a mask encoder and a shape encoder to learn a compact latent space for cloth representation. It employs a triplane generator to decode latent codes into unsigned distance functions, generating 3D clothes. A dual-resolution discriminator enhances detail and realism. |
Design2Cloth outperforms state-of-the-art methods in generating realistic garments with high-frequency details.
It enables smooth interpolation between diverse garment styles and shapes.
The model allows 3D garment reconstruction from in-the-wild images and scans, outperforming baselines in accuracy and realism. |
The reliance on accurate SMPL pose and shape estimation for in-the-wild reconstruction.
Potential limitations in capturing the full diversity of real-world garment designs and textures. |
3d garment generation, implicit neural representation, real-world cloth dataset, user-friendly design, 3d garment reconstruction |
2404.02634
Report |
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization |
SeungJeh Chung, JooHyun Park, Hyewon Kan, HyeongYeop Kang |
3D stylization, which entails the application of specific styles to
three-dimensional objects, holds significant commercial potential as it enables
the creation of diverse 3D objects with distinct moods and styles, tailored to
specific demands of different scenes. With recent advancements in text-driven
methods and artificial intelligence, the stylization process is increasingly
intuitive and automated, thereby diminishing the reliance on manual labor and
expertise. However, existing methods have predominantly focused on holistic
stylization, thereby leaving the application of styles to individual components
of a 3D object unexplored. In response, we introduce 3DStyleGLIP, a novel
framework specifically designed for text-driven, part-tailored 3D stylization.
Given a 3D mesh and a text prompt, 3DStyleGLIP leverages the vision-language
embedding space of the Grounded Language-Image Pre-training (GLIP) model to
localize the individual parts of the 3D mesh and modify their colors and local
geometries to align them with the desired styles specified in the text prompt.
3DStyleGLIP is effectively trained for 3D stylization tasks through a
part-level style loss working in GLIP's embedding space, supplemented by two
complementary learning techniques. Extensive experimental validation confirms
that our method achieves significant part-wise stylization capabilities,
demonstrating promising potential in advancing the field of 3D stylization. |
Introduces 3DStyleGLIP, a novel framework for text-driven, part-tailored 3D neural stylization, allowing users to apply distinct styles to different parts of a 3D mesh based on text prompts. |
Existing 3D stylization methods mainly focus on holistic stylization, limiting the ability to apply different styles to individual object components. 3DStyleGLIP addresses this limitation by enabling part-tailored stylization. |
Leverages the GLIP model's vision-language embedding space to localize individual mesh parts. Trains a Neural Style Field (NSF) to modify the mesh's colors and local geometries to match the style phrases in the text prompt. |
Achieves superior part-tailored stylization compared to existing 3D generation and editing methods.
Demonstrates consistent and stable stylization outcomes across different random seeds.
Outperforms baseline methods in user studies, showcasing better alignment with text descriptions and higher-quality stylization. |
Currently limited in synthesizing parts based on abstract concepts or emotions (e.g., "delicious hamburger").
Faces challenges with stylizing objects with more than five parts or highly detailed semantic parts. |
3d stylization, part-tailored stylization, text-driven manipulation, vision-language model, glip |
2404.02617
Report |
Neural Radiance Fields with Torch Units |
Bingnan Ni, Huanyu Wang, Dongfeng Bai, Minghe Weng, Dexin Qi, Weichao Qiu, Bingbing Liu |
Neural Radiance Fields (NeRF) give rise to learning-based 3D reconstruction
methods widely used in industrial applications. Although prevalent methods
achieve considerable improvements in small-scale scenes, accomplishing
reconstruction in complex and large-scale scenes is still challenging. First,
the background in complex scenes shows a large variance among different views.
Second, the current inference pattern, $i.e.$, a pixel only relies on an
individual camera ray, fails to capture contextual information. To solve these
problems, we propose to enlarge the ray perception field and build up the
sample points interactions. In this paper, we design a novel inference pattern
that encourages a single camera ray possessing more contextual information, and
models the relationship among sample points on each camera ray. To hold
contextual information,a camera ray in our proposed method can render a patch
of pixels simultaneously. Moreover, we replace the MLP in neural radiance field
models with distance-aware convolutions to enhance the feature propagation
among sample points from the same camera ray. To summarize, as a torchlight, a
ray in our proposed method achieves rendering a patch of image. Thus, we call
the proposed method, Torch-NeRF. Extensive experiments on KITTI-360 and LLFF
show that the Torch-NeRF exhibits excellent performance. |
This paper proposes Torch-NeRF, a novel neural radiance field method that enhances contextual information aggregation and sample point interaction for improved 3D reconstruction in complex and large-scale scenes. |
Existing NeRF methods struggle to capture contextual information and handle background variance in complex scenes, particularly in autonomous driving scenarios where accurate 3D reconstruction is crucial. |
Torch-NeRF employs a novel inference pattern where each camera ray renders a patch of pixels, enlarging the ray perception field. It also introduces distance-aware convolutions along rays to model relationships between sample points and improve volume smoothness. |
Torch-NeRF outperforms previous methods on KITTI-360 and LLFF datasets in terms of PSNR and SSIM, demonstrating its effectiveness in complex scenes.
The method effectively handles noisy colors and preserves object shapes at scene edges, as shown in qualitative comparisons.
Ablation studies validate the contribution of each proposed module, including enlarged ray perception field, distance-aware convolutions, and structural similarity loss. |
The current implementation discards rendered pixels in a patch except for the center, impacting rendering time. Future work aims to improve the rendering quality of all patch pixels.
Further research will focus on enhancing rendering efficiency while maintaining high visual quality. |
neural radiance fields, 3d reconstruction, autonomous driving, distance-aware convolutions, ray perception field |
2404.02514
Report |
Freditor: High-Fidelity and Transferable NeRF Editing by Frequency Decomposition |
Yisheng He, Weihao Yuan, Siyu Zhu, Zilong Dong, Liefeng Bo, Qixing Huang |
This paper enables high-fidelity, transferable NeRF editing by frequency
decomposition. Recent NeRF editing pipelines lift 2D stylization results to 3D
scenes while suffering from blurry results, and fail to capture detailed
structures caused by the inconsistency between 2D editings. Our critical
insight is that low-frequency components of images are more
multiview-consistent after editing compared with their high-frequency parts.
Moreover, the appearance style is mainly exhibited on the low-frequency
components, and the content details especially reside in high-frequency parts.
This motivates us to perform editing on low-frequency components, which results
in high-fidelity edited scenes. In addition, the editing is performed in the
low-frequency feature space, enabling stable intensity control and novel scene
transfer. Comprehensive experiments conducted on photorealistic datasets
demonstrate the superior performance of high-fidelity and transferable NeRF
editing. The project page is at \url{https://aigc3d.github.io/freditor}. |
This paper proposes Freditor, a novel approach for high-fidelity and transferable NeRF editing that leverages frequency decomposition. |
Existing NeRF editing methods often produce blurry results or lack transferability, limiting their practical applications. Freditor addresses these limitations by decomposing appearance into low and high-frequency components and performing editing in the feature space. |
Freditor uses a two-branch architecture: a high-frequency branch reconstructs detailed scenes with standard NeRF, while a low-frequency branch performs style editing in the feature space. The method utilizes low-pass filtering, feature-space stylization, and a shared decoder to combine edited low-frequency components with original high-frequency details. |
Freditor achieves high-fidelity editing by preserving details through frequency decomposition, surpassing previous methods in visual quality.
The feature-space editing allows for controllable stylization intensity during inference, enabling dynamic adjustments without retraining.
The trained stylization modules are transferable to new scenes without retraining, enabling efficient editing of diverse 3D content. |
The blending of high-frequency details may sometimes conflict with the target style, requiring more intelligent blending strategies.
Further exploration of different low-frequency filter levels and their impact on editing effectiveness and artifact generation is warranted. |
nerf editing, frequency decomposition, style transfer, 3d scene manipulation, generative models |
2404.02410
Report |
TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Surrounding Autonomous Driving Scenes |
Cheng Zhao, Su Sun, Ruoyu Wang, Yuliang Guo, Jun-Jun Wan, Zhou Huang, Xinyu Huang, Yingjie Victor Chen, Liu Ren |
Most 3D Gaussian Splatting (3D-GS) based methods for urban scenes initialize
3D Gaussians directly with 3D LiDAR points, which not only underutilizes LiDAR
data capabilities but also overlooks the potential advantages of fusing LiDAR
with camera data. In this paper, we design a novel tightly coupled LiDAR-Camera
Gaussian Splatting (TCLC-GS) to fully leverage the combined strengths of both
LiDAR and camera sensors, enabling rapid, high-quality 3D reconstruction and
novel view RGB/depth synthesis. TCLC-GS designs a hybrid explicit (colorized 3D
mesh) and implicit (hierarchical octree feature) 3D representation derived from
LiDAR-camera data, to enrich the properties of 3D Gaussians for splatting. 3D
Gaussian's properties are not only initialized in alignment with the 3D mesh
which provides more completed 3D shape and color information, but are also
endowed with broader contextual information through retrieved octree implicit
features. During the Gaussian Splatting optimization process, the 3D mesh
offers dense depth information as supervision, which enhances the training
process by learning of a robust geometry. Comprehensive evaluations conducted
on the Waymo Open Dataset and nuScenes Dataset validate our method's
state-of-the-art (SOTA) performance. Utilizing a single NVIDIA RTX 3090 Ti, our
method demonstrates fast training and achieves real-time RGB and depth
rendering at 90 FPS in resolution of 1920x1280 (Waymo), and 120 FPS in
resolution of 1600x900 (nuScenes) in urban scenarios. |
This paper presents TCLC-GS, a novel tightly coupled LiDAR-Camera Gaussian Splatting method for rapid and high-quality 3D reconstruction and novel view synthesis in autonomous driving scenes. |
Existing 3D Gaussian Splatting methods underutilize LiDAR data and the potential of LiDAR-camera fusion, limiting their accuracy and quality in complex urban environments. |
TCLC-GS leverages a hybrid 3D representation with explicit (colorized 3D mesh) and implicit (hierarchical octree feature) information derived from LiDAR-camera data to enhance the initialization and optimization of 3D Gaussians. |
TCLC-GS achieves state-of-the-art performance on the Waymo Open Dataset and nuScenes Dataset, surpassing baselines in image and depth synthesis quality.
The method demonstrates fast training and enables real-time RGB and depth rendering at around 90 FPS (1920x1280) for Waymo and 120 FPS (1600x900) for nuScenes on a single NVIDIA RTX 3090 Ti.
Ablation studies validate the effectiveness of the colorized 3D mesh, octree implicit representation, and dense depth supervision in improving performance. |
The depth synthesis performance depends on the density of LiDAR data, showing relatively lower accuracy on the sparser nuScenes dataset compared to the Waymo dataset.
Future work could explore the integration of temporal information and dynamic object modeling within the TCLC-GS framework. |
lidar-camera fusion, gaussian splatting, 3d reconstruction, novel view synthesis, autonomous driving |
2404.02241
Report |
Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better |
Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Sergey Yekhanin, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang |
Diffusion Models (DM) and Consistency Models (CM) are two types of popular
generative models with good generation quality on various tasks. When training
DM and CM, intermediate weight checkpoints are not fully utilized and only the
last converged checkpoint is used. In this work, we find that high-quality
model weights often lie in a basin which cannot be reached by SGD but can be
obtained by proper checkpoint averaging. Based on these observations, we
propose LCSC, a simple but effective and efficient method to enhance the
performance of DM and CM, by combining checkpoints along the training
trajectory with coefficients deduced from evolutionary search. We demonstrate
the value of LCSC through two use cases: $\textbf{(a) Reducing training cost.}$
With LCSC, we only need to train DM/CM with fewer number of iterations and/or
lower batch sizes to obtain comparable sample quality with the fully trained
model. For example, LCSC achieves considerable training speedups for CM
(23$\times$ on CIFAR-10 and 15$\times$ on ImageNet-64). $\textbf{(b) Enhancing
pre-trained models.}$ Assuming full training is already done, LCSC can further
improve the generation quality or speed of the final converged models. For
example, LCSC achieves better performance using 1 number of function evaluation
(NFE) than the base model with 2 NFE on consistency distillation, and decreases
the NFE of DM from 15 to 9 while maintaining the generation quality on
CIFAR-10. Our code is available at
https://github.com/imagination-research/LCSC. |
This paper proposes LCSC, a method that enhances Diffusion Models (DM) and Consistency Models (CM) by linearly combining saved checkpoints along the training trajectory using coefficients determined by evolutionary search. |
DM and CM training often under-utilizes intermediate checkpoints. This paper shows high-quality models often lie in basins reachable not by SGD but by proper checkpoint averaging, which LCSC enables. |
Given saved checkpoints, LCSC employs an evolutionary algorithm to find optimal linear combination coefficients that minimize metrics like FID. |
LCSC reduces training cost, achieving similar sample quality with fewer iterations/smaller batch sizes (e.g., 23x speedup for CM on CIFAR-10).
LCSC enhances pre-trained models, improving generation quality/speed (e.g., better CM performance with 1 NFE than baseline with 2 NFE).
Analysis suggests the optimal combination often involves negative coefficients, highlighting the limitations of traditional averaging like EMA. |
Current search relies on evolutionary methods, limiting efficiency and potentially finding local optima. Exploring better optimization is needed.
LCSC applies uniform coefficients across the model. Finer-grained partitioning (per-layer, per-timestep) might yield further gains. |
diffusion models, consistency models, weight averaging, evolutionary search, generative models |
2404.02155
Report |
Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields |
Joshua Ahn, Haochen Wang, Raymond A. Yeh, Greg Shakhnarovich |
Scale-ambiguity in 3D scene dimensions leads to magnitude-ambiguity of
volumetric densities in neural radiance fields, i.e., the densities double when
scene size is halved, and vice versa. We call this property alpha invariance.
For NeRFs to better maintain alpha invariance, we recommend 1) parameterizing
both distance and volume densities in log space, and 2) a
discretization-agnostic initialization strategy to guarantee high ray
transmittance. We revisit a few popular radiance field models and find that
these systems use various heuristics to deal with issues arising from scene
scaling. We test their behaviors and show our recipe to be more robust. |
This paper investigates the issue of alpha invariance in neural radiance fields (NeRFs), where the scale ambiguity of 3D scenes leads to magnitude ambiguity of volumetric densities. |
A robust NeRF algorithm should perform consistently across different scene scales. This paper aims to address this challenge by proposing solutions for alpha invariance in NeRFs. |
The authors analyze and ablate several popular NeRF architectures, including Vanilla NeRF, TensoRF, DVGO, Plenoxels, and Nerfacto, to study their alpha invariance properties. They propose two key modifications: 1) parameterizing both distance and volume densities in log space using a GumbelCDF activation and 2) a discretization-agnostic initialization strategy to guarantee high ray transmittance. |
Empirically, volume density (σ) changes by a factor close to 1/k when scene size changes by k.
Vanilla NeRF's MLPs with ReLU activation can produce large σ values but are prone to converging to poor local minima.
Voxel variants (DVGO, Plenoxels, TensoRF) fail to converge without hardcoded heuristics to handle scene scaling. |
The assumption of i.i.d. sampled density values during initialization, while simplifying, is imperfect.
Further investigation is needed to match the default Plenoxels performance with the proposed modifications. |
neural radiance fields, nerf, alpha invariance, volume rendering, scene scaling |
2404.02154
Report |
Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration |
Akshay Dudhane, Omkar Thawakar, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, Ming-Hsuan Yang |
All-in-one image restoration tackles different types of degradations with a
unified model instead of having task-specific, non-generic models for each
degradation. The requirement to tackle multiple degradations using the same
model can lead to high-complexity designs with fixed configuration that lack
the adaptability to more efficient alternatives. We propose DyNet, a dynamic
family of networks designed in an encoder-decoder style for all-in-one image
restoration tasks. Our DyNet can seamlessly switch between its bulkier and
lightweight variants, thereby offering flexibility for efficient model
deployment with a single round of training. This seamless switching is enabled
by our weights-sharing mechanism, forming the core of our architecture and
facilitating the reuse of initialized module weights. Further, to establish
robust weights initialization, we introduce a dynamic pre-training strategy
that trains variants of the proposed DyNet concurrently, thereby achieving a
50% reduction in GPU hours. To tackle the unavailability of large-scale dataset
required in pre-training, we curate a high-quality, high-resolution image
dataset named Million-IRD having 2M image samples. We validate our DyNet for
image denoising, deraining, and dehazing in all-in-one setting, achieving
state-of-the-art results with 31.34% reduction in GFlops and a 56.75% reduction
in parameters compared to baseline models. The source codes and trained models
are available at https://github.com/akshaydudhane16/DyNet. |
This paper presents DyNet, a dynamic network architecture for efficient all-in-one image restoration, incorporating a novel weight-sharing mechanism to reduce parameters and improve computational efficiency. |
Existing all-in-one image restoration methods have high computational costs and lack flexibility in model depth during training. DyNet addresses this by allowing for seamless switching between bulkier and lightweight variants while maintaining high accuracy. |
DyNet utilizes a weight-sharing mechanism in an encoder-decoder architecture. Module weights are shared across subsequent modules at each level, controlled by a reuse frequency. A dynamic pre-training strategy is introduced to train both bulky and lightweight variants concurrently, using a new million-scale dataset, Million-IRD. |
DyNet-L outperforms the baseline PromptIR by 0.82 dB on average across denoising, deraining, and dehazing tasks.
DyNet-S, a lightweight variant, achieves a 0.59 dB average improvement over PromptIR with 31.34% fewer GFlops and 56.75% fewer parameters.
The proposed dynamic pre-training strategy reduces training time by 50% compared to traditional methods. |
The paper explores the performance of DyNet on a limited set of image restoration tasks.
Further investigation into the impact of varying module weight reuse frequencies on model performance is left for future work. |
image restoration, all-in-one restoration, dynamic network, weight sharing, large-scale pre-training |
2404.02152
Report |
GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image |
Chong Bao, Yinda Zhang, Yuan Li, Xiyu Zhang, Bangbang Yang, Hujun Bao, Marc Pollefeys, Guofeng Zhang, Zhaopeng Cui |
Recently, we have witnessed the explosive growth of various volumetric
representations in modeling animatable head avatars. However, due to the
diversity of frameworks, there is no practical method to support high-level
applications like 3D head avatar editing across different representations. In
this paper, we propose a generic avatar editing approach that can be
universally applied to various 3DMM driving volumetric head avatars. To achieve
this goal, we design a novel expression-aware modification generative model,
which enables lift 2D editing from a single image to a consistent 3D
modification field. To ensure the effectiveness of the generative modification
process, we develop several techniques, including an expression-dependent
modification distillation scheme to draw knowledge from the large-scale head
avatar model and 2D facial texture editing tools, implicit latent space
guidance to enhance model convergence, and a segmentation-based loss reweight
strategy for fine-grained texture inversion. Extensive experiments demonstrate
that our method delivers high-quality and consistent results across multiple
expression and viewpoints. Project page: https://zju3dv.github.io/geneavatar/ |
GeneAvatar enables fine-grained 3D head avatar editing in various volumetric representations from a single-view image. |
Existing 3D avatar editing methods lack adaptability across representations, user-friendliness, or fidelity across expressions and viewpoints. |
The method utilizes an expression-aware modification generative model. It learns expression-dependent 3D modifications from a single edited image and applies them consistently across different expressions and viewpoints. |
The method generates consistent editing results across viewpoints and expressions.
It is adaptable to various 3DMM-driven volumetric avatar representations.
It supports both global and local editing using off-the-shelf 2D editing tools. |
The method currently cannot handle adding new objects or changing hairstyles.
Improving editing speed for real-time applications is a future direction. |
3d avatar editing, volumetric representation, neural radiance fields, 3dmm, single-view editing |
2404.02148
Report |
Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models |
Zeyu Yang, Zijie Pan, Chun Gu, Li Zhang |
Recent advancements in 3D generation are predominantly propelled by
improvements in 3D-aware image diffusion models which are pretrained on
Internet-scale image data and fine-tuned on massive 3D data, offering the
capability of producing highly consistent multi-view images. However, due to
the scarcity of synchronized multi-view video data, it is impractical to adapt
this paradigm to 4D generation directly. Despite that, the available video and
3D data are adequate for training video and multi-view diffusion models
separately that can provide satisfactory dynamic and geometric priors
respectively. To take advantage of both, this paper present Diffusion$^2$, a
novel framework for dynamic 3D content creation that reconciles the knowledge
about geometric consistency and temporal smoothness from these models to
directly sample dense multi-view multi-frame images which can be employed to
optimize continuous 4D representation. Specifically, we design a simple yet
effective denoising strategy via score composition of pretrained video and
multi-view diffusion models based on the probability structure of the target
image array. Owing to the high parallelism of the proposed image generation
process and the efficiency of the modern 4D reconstruction pipeline, our
framework can generate 4D content within few minutes. Additionally, our method
circumvents the reliance on 4D data, thereby having the potential to benefit
from the scaling of the foundation video and multi-view diffusion models.
Extensive experiments demonstrate the efficacy of our proposed framework and
its ability to flexibly handle various types of prompts. |
This paper presents \textbf{\model{}}, a novel framework for dynamic 3D content creation that combines pretrained video and multi-view diffusion models to directly sample dense multi-view multi-frame images for efficient 4D content generation. |
Existing 4D generation methods rely on scarce synchronized multi-view video data or suffer from slow optimization. This framework leverages vast available monocular video and static multi-view data to achieve efficient 4D generation. |
The method leverages the conditional independence between geometry and dynamics in multi-view video frames. By blending scores from pretrained video and multi-view diffusion models, it directly samples image arrays, which are then used for 4D reconstruction. |
Achieves comparable quality to state-of-the-art optimization-based methods in image-to-4D generation.
Generates higher-fidelity and more consistent results than existing methods in video-to-4D generation.
Successfully animates static 3D models with realistic and diverse dynamics. |
Performance is limited by the quality of foundation diffusion models, especially for challenging viewpoints and thin structures.
The assumption of conditional independence may not hold in cases with extreme rotations, although the method still works well in practice. |
4d generation, diffusion models, multi-view synthesis, video generation, 3d reconstruction |
2404.02145
Report |
Iterated Learning Improves Compositionality in Large Vision-Language Models |
Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna |
A fundamental characteristic common to both human vision and natural language
is their compositional nature. Yet, despite the performance gains contributed
by large vision and language pretraining, recent investigations find that
most-if not all-our state-of-the-art vision-language models struggle at
compositionality. They are unable to distinguish between images of " a girl in
white facing a man in black" and "a girl in black facing a man in white".
Moreover, prior work suggests that compositionality doesn't arise with scale:
larger model sizes or training data don't help. This paper develops a new
iterated training algorithm that incentivizes compositionality. We draw on
decades of cognitive science research that identifies cultural transmission-the
need to teach a new generation-as a necessary inductive prior that incentivizes
humans to develop compositional languages. Specifically, we reframe
vision-language contrastive learning as the Lewis Signaling Game between a
vision agent and a language agent, and operationalize cultural transmission by
iteratively resetting one of the agent's weights during training. After every
iteration, this training paradigm induces representations that become "easier
to learn", a property of compositional languages: e.g. our model trained on
CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the
SugarCrepe benchmark. |
This paper proposes an iterated learning algorithm for vision-language models, inspired by cultural transmission in humans, to improve compositionality in representation learning. |
Current vision-language models struggle with compositionality, failing to generalize understandings from individual concepts to complex scenes, limiting their ability to understand novel compositions. |
The method reframes contrastive learning as a Lewis Signaling Game, incorporating a shared codebook as a communication bottleneck, and iteratively resetting the language agent to simulate cultural transmission. |
Iterated learning leads to significantly improved performance on compositionality benchmarks (CREPE, SugarCrepe, Cola, Winoground) compared to standard CLIP and other baselines.
The learned representations are empirically shown to be "easier to learn" for new language agents, supporting the hypothesis drawn from cognitive science.
Iterated learning maintains comparable performance to standard training on image recognition tasks, indicating no sacrifice in recognition ability for improved compositionality. |
The learning process can be unstable due to randomness introduced when resetting language agents.
Future work could explore more stable training strategies and investigate the applicability of iterated learning to other domains beyond vision and language. |
compositionality, vision-language models, iterated learning, cultural transmission, contrastive learning |
2404.02125
Report |
3D Congealing: 3D-Aware Image Alignment in the Wild |
Yunzhi Zhang, Zizhang Li, Amit Raj, Andreas Engelhardt, Yuanzhen Li, Tingbo Hou, Jiajun Wu, Varun Jampani |
We propose 3D Congealing, a novel problem of 3D-aware alignment for 2D images
capturing semantically similar objects. Given a collection of unlabeled
Internet images, our goal is to associate the shared semantic parts from the
inputs and aggregate the knowledge from 2D images to a shared 3D canonical
space. We introduce a general framework that tackles the task without assuming
shape templates, poses, or any camera parameters. At its core is a canonical 3D
representation that encapsulates geometric and semantic information. The
framework optimizes for the canonical representation together with the pose for
each input image, and a per-image coordinate map that warps 2D pixel
coordinates to the 3D canonical frame to account for the shape matching. The
optimization procedure fuses prior knowledge from a pre-trained image
generative model and semantic information from input images. The former
provides strong knowledge guidance for this under-constraint task, while the
latter provides the necessary information to mitigate the training data bias
from the pre-trained model. Our framework can be used for various tasks such as
correspondence matching, pose estimation, and image editing, achieving strong
results on real-world image datasets under challenging illumination conditions
and on in-the-wild online image collections. |
Introduces 3D-Aware Image Alignment in the Wild (3D-Cong), a novel method to align images of semantically similar objects in a shared 3D space, without relying on shape templates, poses, or camera parameters. |
Enables various downstream tasks like 6-DoF object pose estimation, pose-aware image filtering, and image editing by establishing 2D-3D correspondence between input images and a canonical 3D representation. |
Fuses prior 3D knowledge from a pre-trained text-to-image generative model with semantic information from input images using pre-trained semantic feature extractors (DINO). Optimizes for a canonical 3D shape, individual image poses, and dense 2D-3D correspondence maps. |
Achieves comparable pose estimation accuracy to state-of-the-art methods requiring pose priors on a challenging multi-illumination dataset.
Successfully aligns diverse internet images of objects and landmarks, demonstrating robustness to variations in appearance, viewpoint, and illumination.
Enables applications like image editing by establishing dense 2D-2D correspondences through the shared 3D space, outperforming direct feature matching. |
Performance depends on the accuracy of the initial shape generated by the pre-trained model.
Feature ambiguity in objects with high symmetry can lead to incorrect pose estimations. |
3d alignment, image congealing, pose estimation, generative models, semantic features |
2404.02101
Report |
CameraCtrl: Enabling Camera Control for Text-to-Video Generation |
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, Ceyuan Yang |
Controllability plays a crucial role in video generation since it allows
users to create desired content. However, existing models largely overlooked
the precise control of camera pose that serves as a cinematic language to
express deeper narrative nuances. To alleviate this issue, we introduce
CameraCtrl, enabling accurate camera pose control for text-to-video(T2V)
models. After precisely parameterizing the camera trajectory, a plug-and-play
camera module is then trained on a T2V model, leaving others untouched.
Additionally, a comprehensive study on the effect of various datasets is also
conducted, suggesting that videos with diverse camera distribution and similar
appearances indeed enhance controllability and generalization. Experimental
results demonstrate the effectiveness of CameraCtrl in achieving precise and
domain-adaptive camera control, marking a step forward in the pursuit of
dynamic and customized video storytelling from textual and camera pose inputs.
Our project website is at: https://hehao13.github.io/projects-CameraCtrl/. |
Introduces CameraCtrl, a plug-and-play camera control module for text-to-video (T2V) generation, enabling precise control over camera viewpoints. |
Existing T2V models lack precise control over camera viewpoints, crucial for realism and user engagement. |
Utilizes Plücker embeddings to represent camera parameters and incorporates a camera encoder trained on a dataset with diverse camera poses and similar appearance to the base T2V model. |
Achieves more precise camera control compared to AnimateDiff and MotionCtrl.
Demonstrates generalizability by effectively controlling camera viewpoints in various video domains and integrating with other video control methods like SparseCtrl.
A comprehensive study on training datasets reveals that data with similar appearance and diverse camera poses, like RealEstate10K, yields the best results. |
Generalization relies on the diversity of training data, future work could focus on collecting more diverse videos.
Current work evaluates CameraCtrl primarily on U-Net based T2V models, future work could explore compatibility with transformer-based generators like Sora. |
camera control, text-to-video generation, diffusion models, plücker embeddings, controllable video generation |
2404.01984
Report |
Fashion Style Editing with Generative Human Prior |
Chaerin Kong, Seungyong Lee, Soohyeok Im, Wonsuk Yang |
Image editing has been a long-standing challenge in the research community
with its far-reaching impact on numerous applications. Recently, text-driven
methods started to deliver promising results in domains like human faces, but
their applications to more complex domains have been relatively limited. In
this work, we explore the task of fashion style editing, where we aim to
manipulate the fashion style of human imagery using text descriptions.
Specifically, we leverage a generative human prior and achieve fashion style
editing by navigating its learned latent space. We first verify that the
existing text-driven editing methods fall short for our problem due to their
overly simplified guidance signal, and propose two directions to reinforce the
guidance: textual augmentation and visual referencing. Combined with our
empirical findings on the latent space structure, our Fashion Style Editing
framework (FaSE) successfully projects abstract fashion concepts onto human
images and introduces exciting new applications to the field. |
This paper presents FaSE, a framework for fashion style editing of human images using text descriptions, addressing the limitations of existing methods in handling complex domains like fashion. |
Fashion style editing with text descriptions is a challenging task due to the complexity of human imagery and the subjective nature of fashion concepts. Existing text-driven methods fall short in providing sufficient guidance for this task. |
FaSE leverages a generative human prior (StyleGAN-Human) and enhances text guidance using two methods: 1) textual augmentation with a large language model and 2) visual referencing by retrieving similar images from a fashion database and guiding the model in the latent space. |
FaSE successfully edits human images according to fashion style prompts, outperforming baseline methods.
The authors found that both textual augmentation and visual referencing significantly improve editing performance.
Empirical analysis of the StyleGAN-Human latent space reveals a hierarchical structure where mid-level features control garment shape and fine-level features control texture. |
The reference database is limited in size and diversity.
The retrieval mechanism for reference images could be further improved. |
image editing, fashion style editing, text-driven image manipulation, generative adversarial networks, vision-language models |
2404.01843
Report |
Sketch3D: Style-Consistent Guidance for Sketch-to-3D Generation |
Wangguandong Zheng, Haifeng Xia, Rui Chen, Ming Shao, Siyu Xia, Zhengming Ding |
Recently, image-to-3D approaches have achieved significant results with a
natural image as input. However, it is not always possible to access these
enriched color input samples in practical applications, where only sketches are
available. Existing sketch-to-3D researches suffer from limitations in broad
applications due to the challenges of lacking color information and multi-view
content. To overcome them, this paper proposes a novel generation paradigm
Sketch3D to generate realistic 3D assets with shape aligned with the input
sketch and color matching the textual description. Concretely, Sketch3D first
instantiates the given sketch in the reference image through the
shape-preserving generation process. Second, the reference image is leveraged
to deduce a coarse 3D Gaussian prior, and multi-view style-consistent guidance
images are generated based on the renderings of the 3D Gaussians. Finally,
three strategies are designed to optimize 3D Gaussians, i.e., structural
optimization via a distribution transfer mechanism, color optimization with a
straightforward MSE loss and sketch similarity optimization with a CLIP-based
geometric similarity loss. Extensive visual comparisons and quantitative
analysis illustrate the advantage of our Sketch3D in generating realistic 3D
assets while preserving consistency with the input. |
Sketch3D, a novel framework for generating realistic 3D assets from sketches, aligning shape with the input and color with textual descriptions. |
Existing sketch-to-3D methods struggle with limited color information, single-category generation, and lack of realism. Sketch3D addresses these limitations by leveraging both sketch and text prompts for realistic and customizable 3D asset creation. |
1. **Reference Image Generation:** Create a color image from the sketch and text prompt using ControlNet. 2. **3D Prior Initialization:** Generate a coarse 3D Gaussian representation from the reference image using a 3D diffusion model. 3. **Style-Consistent Optimization:** Generate multi-view guidance images with IP-Adapter and optimize the 3D Gaussian representation for structure, color, and sketch similarity using a distribution transfer mechanism, MSE loss, and CLIP-based geometric similarity loss, respectively. |
Sketch3D outperforms baselines in generating realistic 3D assets with consistent shapes and colors.
Quantitative analysis using CLIP similarity and SSIM demonstrates Sketch3D's superior alignment with input sketches and text prompts.
Ablation studies validate the effectiveness of the proposed distribution transfer mechanism, MSE loss, and CLIP geometric similarity loss. |
Generation quality is limited by the performance of ControlNet in generating the reference image.
Achieving fine-grained control over details in complex sketches remains challenging. |
sketch-to-3d generation, 3d gaussian splatting, text-guided synthesis, style-consistent guidance, controllable image synthesis |
2404.01810
Report |
Surface Reconstruction from Gaussian Splatting via Novel Stereo Views |
Yaniv Wolf, Amit Bracha, Ron Kimmel |
The Gaussian splatting for radiance field rendering method has recently
emerged as an efficient approach for accurate scene representation. It
optimizes the location, size, color, and shape of a cloud of 3D Gaussian
elements to visually match, after projection, or splatting, a set of given
images taken from various viewing directions. And yet, despite the proximity of
Gaussian elements to the shape boundaries, direct surface reconstruction of
objects in the scene is a challenge.
We propose a novel approach for surface reconstruction from Gaussian
splatting models. Rather than relying on the Gaussian elements' locations as a
prior for surface reconstruction, we leverage the superior novel-view synthesis
capabilities of 3DGS. To that end, we use the Gaussian splatting model to
render pairs of stereo-calibrated novel views from which we extract depth
profiles using a stereo matching method. We then combine the extracted RGB-D
images into a geometrically consistent surface. The resulting reconstruction is
more accurate and shows finer details when compared to other methods for
surface reconstruction from Gaussian splatting models, while requiring
significantly less compute time compared to other surface reconstruction
methods.
We performed extensive testing of the proposed method on in-the-wild scenes,
taken by a smartphone, showcasing its superior reconstruction abilities.
Additionally, we tested the proposed method on the Tanks and Temples benchmark,
and it has surpassed the current leading method for surface reconstruction from
Gaussian splatting models. Project page: https://gs2mesh.github.io/. |
This paper introduces a novel method for surface reconstruction from 3D Gaussian Splatting (3DGS) models by leveraging the generation of stereo-calibrated novel views and applying a stereo matching algorithm. |
Directly reconstructing surfaces from 3DGS models is challenging due to the misalignment between Gaussian element locations and the actual surface geometry. Existing methods either produce noisy results or require extensive computational time. |
The pipeline involves capturing a scene with 3DGS, rendering stereo-calibrated novel views, extracting depth maps using a stereo matching algorithm (DLNR), and fusing the depth data using the Truncated Signed Distance Function (TSDF) algorithm to generate a smooth and consistent mesh. |
Outperforms SuGaR, the current state-of-the-art method for surface reconstruction from 3DGS, on the Tanks and Temples benchmark.
Achieves comparable visual quality to neural reconstruction methods like BakedSDF on the Mip-NeRF360 dataset while requiring significantly less processing time.
Demonstrates superior performance in reconstructing accurate and noise-free meshes from in-the-wild scenes captured using smartphones. |
Reconstruction quality depends on the accuracy of the initial 3DGS scene capture.
The stereo matching algorithm used is inherently susceptible to issues with transparent surfaces, potentially affecting reconstruction accuracy in those areas. |
surface reconstruction, gaussian splatting, 3dgs, stereo matching, novel view synthesis |
2404.01717
Report |
AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation |
Rui Xie, Ying Tai, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, Xiaoqian Ye, Qian Wang, Jian Yang |
Blind super-resolution methods based on stable diffusion showcase formidable
generative capabilities in reconstructing clear high-resolution images with
intricate details from low-resolution inputs. However, their practical
applicability is often hampered by poor efficiency, stemming from the
requirement of thousands or hundreds of sampling steps. Inspired by the
efficient adversarial diffusion distillation (ADD), we design~\name~to address
this issue by incorporating the ideas of both distillation and ControlNet.
Specifically, we first propose a prediction-based self-refinement strategy to
provide high-frequency information in the student model output with marginal
additional time cost. Furthermore, we refine the training process by employing
HR images, rather than LR images, to regulate the teacher model, providing a
more robust constraint for distillation. Second, we introduce a
timestep-adaptive ADD to address the perception-distortion imbalance problem
introduced by original ADD. Extensive experiments demonstrate
our~\name~generates better restoration results, while achieving faster speed
than previous SD-based state-of-the-art models (e.g., $7$$\times$ faster than
SeeSR). |
Proposes AddSR, an efficient and effective Stable Diffusion based model for blind super-resolution, achieving high perceptual quality within a few sampling steps by incorporating distillation and ControlNet. |
Existing blind super-resolution methods based on stable diffusion, while powerful, suffer from poor efficiency due to the need for hundreds or thousands of sampling steps, hindering their practical use. |
AddSR utilizes a teacher-student distillation framework with several key innovations: a prediction-based self-refinement (PSR) strategy to provide high-frequency details, training the teacher model on HR images for better guidance, and a timestep-adaptive adversarial diffusion distillation (TA-ADD) to balance perception and distortion. |
AddSR-4 achieves state-of-the-art results on perceptual quality metrics (MANIQA, MUSIQ, CLIPIQA) across various degradation levels and real-world images.
AddSR significantly reduces inference steps compared to other SD-based methods, achieving comparable results to SeeSR in just 1-4 steps and being 7 times faster.
The effectiveness of PSR and TA-ADD is validated through ablation studies, showing improvements in perceptual quality, fidelity, and reduced hallucinations. |
Despite speed improvements, AddSR's inference time still lags behind GAN-based methods due to the complexity of SD and ControlNet.
Future work will focus on streamlining network architecture for greater efficiency. |
blind super-resolution, stable diffusion, knowledge distillation, controlnet, perception-distortion trade-off |
2404.01709
Report |
Upsample Guidance: Scale Up Diffusion Models without Training |
Juno Hwang, Yong-Hyun Park, Junghyo Jo |
Diffusion models have demonstrated superior performance across various
generative tasks including images, videos, and audio. However, they encounter
difficulties in directly generating high-resolution samples. Previously
proposed solutions to this issue involve modifying the architecture, further
training, or partitioning the sampling process into multiple stages. These
methods have the limitation of not being able to directly utilize pre-trained
models as-is, requiring additional work. In this paper, we introduce upsample
guidance, a technique that adapts pretrained diffusion model (e.g., $512^2$) to
generate higher-resolution images (e.g., $1536^2$) by adding only a single term
in the sampling process. Remarkably, this technique does not necessitate any
additional training or relying on external models. We demonstrate that upsample
guidance can be applied to various models, such as pixel-space, latent space,
and video diffusion models. We also observed that the proper selection of
guidance scale can improve image quality, fidelity, and prompt alignment. |
This paper introduces "upsample guidance (UG)", a novel technique to adapt pre-trained diffusion models to generate higher-resolution images without additional training or external models. |
Generating high-resolution images with diffusion models is challenging. Existing solutions require modifications to architecture, training from scratch, or using external models, leading to increased computational costs. |
UG adds a single term to the sampling process, derived from signal-to-noise ratio (SNR) matching, which guides the model towards consistency with the trained low-resolution component. |
UG successfully generates high-resolution images across various diffusion models, including pixel-space, latent-space, and video diffusion models.
The method effectively resolves artifacts and improves image quality, fidelity, and prompt alignment by adjusting the guidance scale.
UG incurs minimal computational overhead, especially with recent advancements in fast sampling techniques. |
The current implementation relies on a simple guidance scale design, which could be further improved.
While spatial upsampling is well-explored, further research is needed for optimal temporal upsampling in video and audio models. |
diffusion models, high-resolution image generation, upsampling, signal-to-noise ratio matching, guidance |
2404.01543
Report |
Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes |
Ziqian Bai, Feitong Tan, Sean Fanello, Rohit Pandey, Mingsong Dou, Shichen Liu, Ping Tan, Yinda Zhang |
3D head avatars built with neural implicit volumetric representations have
achieved unprecedented levels of photorealism. However, the computational cost
of these methods remains a significant barrier to their widespread adoption,
particularly in real-time applications such as virtual reality and
teleconferencing. While attempts have been made to develop fast neural
rendering approaches for static scenes, these methods cannot be simply employed
to support realistic facial expressions, such as in the case of a dynamic
facial performance. To address these challenges, we propose a novel fast 3D
neural implicit head avatar model that achieves real-time rendering while
maintaining fine-grained controllability and high rendering quality. Our key
idea lies in the introduction of local hash table blendshapes, which are
learned and attached to the vertices of an underlying face parametric model.
These per-vertex hash-tables are linearly merged with weights predicted via a
CNN, resulting in expression dependent embeddings. Our novel representation
enables efficient density and color predictions using a lightweight MLP, which
is further accelerated by a hierarchical nearest neighbor search method.
Extensive experiments show that our approach runs in real-time while achieving
comparable rendering quality to state-of-the-arts and decent results on
challenging expressions. |
This paper introduces a novel 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. |
Current state-of-the-art 3D head avatars, while photorealistic, are computationally expensive and impractical for real-time applications such as VR and teleconferencing. |
The paper introduces “local hash table blendshapes”, small hash tables attached to vertices of an underlying face parametric model. These are linearly merged with weights predicted by a CNN, resulting in expression-dependent embeddings for efficient density and color predictions using a lightweight MLP, further accelerated by a hierarchical nearest neighbor search. |
The model achieves real-time rendering (over 30 FPS at 512x512 resolution).
It maintains comparable rendering quality to state-of-the-art methods like MonoAvatar.
It produces significantly better results on challenging expressions compared to existing efficient avatars like NeRFBlendshape and INSTA. |
The model exhibits floaters under viewpoints and expressions far from the training distribution.
Performance is less stable around the mouth interior due to tracking limitations.
Future work involves exploring more expensive training strategies like adversarial loss or joint face fitting refinement to mitigate limitations and enhance quality. |
3d head avatar, neural implicit representation, real-time rendering, hash encoding, facial expression |
2404.01424
Report |
DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery |
Yixuan Zhu, Ao Li, Yansong Tang, Wenliang Zhao, Jie Zhou, Jiwen Lu |
The recovery of occluded human meshes presents challenges for current methods
due to the difficulty in extracting effective image features under severe
occlusion. In this paper, we introduce DPMesh, an innovative framework for
occluded human mesh recovery that capitalizes on the profound diffusion prior
about object structure and spatial relationships embedded in a pre-trained
text-to-image diffusion model. Unlike previous methods reliant on conventional
backbones for vanilla feature extraction, DPMesh seamlessly integrates the
pre-trained denoising U-Net with potent knowledge as its image backbone and
performs a single-step inference to provide occlusion-aware information. To
enhance the perception capability for occluded poses, DPMesh incorporates
well-designed guidance via condition injection, which produces effective
controls from 2D observations for the denoising U-Net. Furthermore, we explore
a dedicated noisy key-point reasoning approach to mitigate disturbances arising
from occlusion and crowded scenarios. This strategy fully unleashes the
perceptual capability of the diffusion prior, thereby enhancing accuracy.
Extensive experiments affirm the efficacy of our framework, as we outperform
state-of-the-art methods on both occlusion-specific and standard datasets. The
persuasive results underscore its ability to achieve precise and robust 3D
human mesh recovery, particularly in challenging scenarios involving occlusion
and crowded scenes. |
This paper proposes DPMesh, a novel framework for recovering 3D human mesh from images, especially under severe occlusion, by leveraging the structure and spatial relationship knowledge from pre-trained text-to-image diffusion models. |
Recovering occluded human mesh from images remains a significant challenge for existing methods due to the difficulty in extracting effective features under severe occlusion. Diffusion models offer a promising alternative with their rich prior knowledge of object structure and spatial relationships. |
DPMesh utilizes a pre-trained text-to-image diffusion model as the backbone for single-step feature extraction. It injects refined 2D keypoint information as conditions to guide the denoising U-Net. Moreover, a noisy key-point reasoning approach is introduced to enhance robustness against noisy 2D observations. |
DPMesh outperforms state-of-the-art methods on various occlusion benchmarks, including 3DPW-OC, 3DPW-PC, 3DOH, and 3DPW-Crowd.
The diffusion-based backbone effectively captures occlusion-aware information, as visualized in the cross-attention maps.
Ablation studies validate the contribution of the diffusion-based backbone, condition injection, and noisy key-point reasoning to the overall performance. |
The reliance on an off-the-shelf 2D key-point detector introduces sensitivity to the detector's performance.
Future work could explore extending DPMesh to handle multi-view images or video sequences for enhanced accuracy and temporal consistency. |
human mesh recovery, occlusion handling, diffusion models, computer vision, pose estimation |
2404.01367
Report |
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models |
Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar |
We study the scaling properties of latent diffusion models (LDMs) with an
emphasis on their sampling efficiency. While improved network architecture and
inference algorithms have shown to effectively boost sampling efficiency of
diffusion models, the role of model size -- a critical determinant of sampling
efficiency -- has not been thoroughly examined. Through empirical analysis of
established text-to-image diffusion models, we conduct an in-depth
investigation into how model size influences sampling efficiency across varying
sampling steps. Our findings unveil a surprising trend: when operating under a
given inference budget, smaller models frequently outperform their larger
equivalents in generating high-quality results. Moreover, we extend our study
to demonstrate the generalizability of the these findings by applying various
diffusion samplers, exploring diverse downstream tasks, evaluating
post-distilled models, as well as comparing performance relative to training
compute. These findings open up new pathways for the development of LDM scaling
strategies which can be employed to enhance generative capabilities within
limited inference budgets. |
This paper investigates the scaling properties of Latent Diffusion Models (LDMs) for image generation, focusing on the relationship between model size and sampling efficiency. |
LDMs are powerful but computationally expensive. Understanding how model size affects efficiency is crucial for optimizing their performance under real-world constraints. |
The authors trained a suite of LDMs ranging from 39 million to 5 billion parameters, evaluating their performance on text-to-image generation and downstream tasks like super-resolution and Dreambooth. |
Pretraining performance scales with training compute, but smaller models can be more efficient under limited sampling budgets.
The efficiency trends hold across different diffusion samplers (DDIM, DDPM, DPM-Solver++) and are also observed in distilled LDMs.
Larger models generally show better downstream performance after fine-tuning, highlighting the importance of pretraining quality. |
The evaluation relies on FID and CLIP scores, which might not perfectly correlate with human perception of visual quality.
The study focuses on a specific LDM architecture. Further research is needed to generalize the findings to other LDM families, especially transformer-based ones. |
latent diffusion models, sampling efficiency, scaling laws, text-to-image generation, diffusion distillation |
2404.01300
Report |
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields |
Muhammad Zubair Irshad, Sergey Zakahrov, Vitor Guizilini, Adrien Gaidon, Zsolt Kira, Rares Ambrus |
Neural fields excel in computer vision and robotics due to their ability to
understand the 3D visual world such as inferring semantics, geometry, and
dynamics. Given the capabilities of neural fields in densely representing a 3D
scene from 2D images, we ask the question: Can we scale their self-supervised
pretraining, specifically using masked autoencoders, to generate effective 3D
representations from posed RGB images. Owing to the astounding success of
extending transformers to novel data modalities, we employ standard 3D Vision
Transformers to suit the unique formulation of NeRFs. We leverage NeRF's
volumetric grid as a dense input to the transformer, contrasting it with other
3D representations such as pointclouds where the information density can be
uneven, and the representation is irregular. Due to the difficulty of applying
masked autoencoders to an implicit representation, such as NeRF, we opt for
extracting an explicit representation that canonicalizes scenes across domains
by employing the camera trajectory for sampling. Our goal is made possible by
masking random patches from NeRF's radiance and density grid and employing a
standard 3D Swin Transformer to reconstruct the masked patches. In doing so,
the model can learn the semantic and spatial structure of complete scenes. We
pretrain this representation at scale on our proposed curated posed-RGB data,
totaling over 1.6 million images. Once pretrained, the encoder is used for
effective 3D transfer learning. Our novel self-supervised pretraining for
NeRFs, NeRF-MAE, scales remarkably well and improves performance on various
challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining,
NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF
scene understanding baselines on Front3D and ScanNet datasets with an absolute
performance improvement of over 20% AP50 and 8% AP25 for 3D object detection. |
Introduces NeRF-MAE, the first self-supervised 3D pre-training method for Neural Radiance Fields using a masked autoencoder approach. |
Leverages the dense and regular structure of NeRF's radiance and density grid to learn effective 3D representations from readily available posed RGB images, overcoming limitations of sparse and irregular 3D representations like point clouds. |
Extracts an explicit 4D radiance and density grid from a trained NeRF model. Employs a masked autoencoder architecture with a 3D Swin Transformer encoder and a voxel decoder to reconstruct masked patches of the grid, learning semantic and spatial relationships within 3D scenes. |
Significantly outperforms state-of-the-art self-supervised 3D pre-training methods and NeRF-based scene understanding baselines on tasks like 3D object detection and semantic voxel labeling.
Demonstrates strong generalization capabilities, achieving superior performance on cross-dataset transfer tasks.
Showcases scalability, with performance improving as the amount and quality of pre-training data increase. |
Training efficiency can be further improved to handle larger and more diverse datasets.
Exploring the integration of neural rendering and masking for enhanced representation learning. |
neural radiance fields, 3d representation learning, self-supervised learning, masked autoencoders, 3d vision transformers |
2404.01297
Report |
Streaming Dense Video Captioning |
Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid |
An ideal model for dense video captioning -- predicting captions localized
temporally in a video -- should be able to handle long input videos, predict
rich, detailed textual descriptions, and be able to produce outputs before
processing the entire video. Current state-of-the-art models, however, process
a fixed number of downsampled frames, and make a single full prediction after
seeing the whole video. We propose a streaming dense video captioning model
that consists of two novel components: First, we propose a new memory module,
based on clustering incoming tokens, which can handle arbitrarily long videos
as the memory is of a fixed size. Second, we develop a streaming decoding
algorithm that enables our model to make predictions before the entire video
has been processed. Our model achieves this streaming ability, and
significantly improves the state-of-the-art on three dense video captioning
benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at
https://github.com/google-research/scenic. |
The paper introduces a novel streaming model for dense video captioning, aiming to address the limitations of existing models in handling long videos and producing detailed descriptions. |
Existing dense video captioning models struggle with long videos due to computational constraints and often produce limited descriptions. This work proposes a streaming approach to overcome these limitations, enabling real-time processing and richer event descriptions. |
The proposed model employs two key components: 1) a memory module based on K-means clustering to efficiently process long video inputs with a fixed computational budget, and 2) a streaming decoding algorithm that predicts event captions sequentially at intermediate timestamps (decoding points) using the memory features and previously predicted captions. |
The streaming model significantly outperforms state-of-the-art methods on three dense video captioning benchmarks (ActivityNet, YouCook2, ViTT) by up to 11.0 CIDEr points.
The clustering-based memory module proves effective in capturing diverse video information, outperforming alternative memory mechanisms like EMA and token merging.
Increasing the number of decoding points during training enhances performance by providing more supervision and aligning memory features better with target captions. |
The model occasionally produces duplicate predictions, even with prefix context, suggesting a need for exploring non-maximal suppression techniques in future work.
Future work could explore a dedicated benchmark for dense video captioning of long videos to evaluate the model's performance more comprehensively. |
dense video captioning, streaming models, memory modules, k-means clustering, decoding points |
2404.01296
Report |
MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space |
Armand Comas-Massagué, Di Qiu, Menglei Chai, Marcel Bühler, Amit Raj, Ruiqi Gao, Qiangeng Xu, Mark Matthews, Paulo Gotardo, Octavia Camps, Sergio Orts-Escolano, Thabo Beeler |
We introduce a novel framework for 3D human avatar generation and
personalization, leveraging text prompts to enhance user engagement and
customization. Central to our approach are key innovations aimed at overcoming
the challenges in photo-realistic avatar synthesis. Firstly, we utilize a
conditional Neural Radiance Fields (NeRF) model, trained on a large-scale
unannotated multi-view dataset, to create a versatile initial solution space
that accelerates and diversifies avatar generation. Secondly, we develop a
geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models,
to ensure superior view invariance and enable direct optimization of avatar
geometry. These foundational ideas are complemented by our optimization
pipeline built on Variational Score Distillation (VSD), which mitigates texture
loss and over-saturation issues. As supported by our extensive experiments,
these strategies collectively enable the creation of custom avatars with
unparalleled visual quality and better adherence to input text prompts. You can
find more results and videos in our website:
https://syntec-research.github.io/MagicMirror |
MagicMirror is a novel framework for fast, text-guided 3D avatar head generation and personalization, leveraging text-to-image diffusion models and conditional Neural Radiance Fields (NeRFs). |
Existing methods for text-guided 3D avatar generation struggle with photorealism, multi-view consistency, and limited customization options. MagicMirror addresses these limitations to achieve higher quality and faithfulness to text prompts. |
MagicMirror employs a conditional NeRF model trained on a diverse multi-view dataset to create a constrained solution space for efficient optimization. It utilizes text-to-image diffusion models as geometry and texture priors for high-quality stylization. A variational score distillation (VSD) objective guides the optimization, improving realism and detail. |
MagicMirror generates high-quality, personalized 3D avatars with detailed geometry and textures, outperforming existing methods in visual fidelity and text alignment.
The framework allows for intuitive customization through text prompts, enabling modifications to facial features, expressions, accessories, and styles.
MagicMirror effectively leverages personalized and generic diffusion priors, enabling a balance between identity preservation and creative exploration. |
Generating undefined shapes, like hair, remains challenging, particularly outside the facial region.
Creating new, detached volumes from scratch, such as hands, is not always successful due to limitations in the initial model's training data. |
3d avatar generation, text-guided synthesis, neural radiance fields (nerfs), text-to-image diffusion models, avatar personalization |
2404.01294
Report |
CosmicMan: A Text-to-Image Foundation Model for Humans |
Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu |
We present CosmicMan, a text-to-image foundation model specialized for
generating high-fidelity human images. Unlike current general-purpose
foundation models that are stuck in the dilemma of inferior quality and
text-image misalignment for humans, CosmicMan enables generating
photo-realistic human images with meticulous appearance, reasonable structure,
and precise text-image alignment with detailed dense descriptions. At the heart
of CosmicMan's success are the new reflections and perspectives on data and
models: (1) We found that data quality and a scalable data production flow are
essential for the final results from trained models. Hence, we propose a new
data production paradigm, Annotate Anyone, which serves as a perpetual data
flywheel to produce high-quality data with accurate yet cost-effective
annotations over time. Based on this, we constructed a large-scale dataset,
CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean
resolution of 1488x1255, and attached with precise text annotations deriving
from 115 Million attributes in diverse granularities. (2) We argue that a
text-to-image foundation model specialized for humans must be pragmatic -- easy
to integrate into down-streaming tasks while effective in producing
high-quality human images. Hence, we propose to model the relationship between
dense text descriptions and image pixels in a decomposed manner, and present
Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly
decomposes the cross-attention features in existing text-to-image diffusion
model, and enforces attention refocusing without adding extra modules. Through
Daring, we show that explicitly discretizing continuous text space into several
basic groups that align with human body structure is the key to tackling the
misalignment problem in a breeze. |
This paper introduces CosmicMan, a specialized text-to-image foundation model for generating high-fidelity human images with meticulous appearance, reasonable structure, and precise text-image alignment, addressing the limitations of general-purpose models in human-centric content generation. |
Current general-purpose text-to-image models struggle with generating realistic and diverse human images, particularly in capturing nuanced details of human anatomy and attire, hindering downstream human-centric content generation tasks. |
The authors propose Annotate Anyone, a human-AI cooperative data production paradigm, to build a large-scale, high-quality dataset called CosmicMan-HQ. They also introduce Daring, a training framework that decomposes text descriptions into groups aligned with human body structure, enforcing attention refocusing in the model to improve text-image alignment. |
CosmicMan outperforms state-of-the-art text-to-image models in generating high-fidelity human images, exhibiting superior performance in both quantitative metrics (FID, Semantic Acc) and human preference evaluations.
Annotate Anyone proves effective in constructing a large-scale, high-quality human-centric dataset, CosmicMan-HQ, which contributes significantly to the model's performance.
The Daring training framework, specifically the HOLA loss and data discretization, effectively enhances the model's ability to accurately generate images aligned with detailed descriptions, particularly for dense concepts related to human appearance. |
The authors acknowledge the need for continuous operation of Annotate Anyone to produce subsequent versions of CosmicMan-HQ, dynamically aligning with evolving real-world data.
Future work includes providing up-to-date human-specialized foundation models trained on new versions of their dataset to support long-term research in human-centric content generation. |
text-to-image generation, foundation models, human-centric content generation, data production, text-image alignment |
2404.01292
Report |
Measuring Style Similarity in Diffusion Models |
Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, Tom Goldstein |
Generative models are now widely used by graphic designers and artists. Prior
works have shown that these models remember and often replicate content from
their training data during generation. Hence as their proliferation increases,
it has become important to perform a database search to determine whether the
properties of the image are attributable to specific training data, every time
before a generated image is used for professional purposes. Existing tools for
this purpose focus on retrieving images of similar semantic content. Meanwhile,
many artists are concerned with style replication in text-to-image models. We
present a framework for understanding and extracting style descriptors from
images. Our framework comprises a new dataset curated using the insight that
style is a subjective property of an image that captures complex yet meaningful
interactions of factors including but not limited to colors, textures, shapes,
etc. We also propose a method to extract style descriptors that can be used to
attribute style of a generated image to the images used in the training dataset
of a text-to-image model. We showcase promising results in various style
retrieval tasks. We also quantitatively and qualitatively analyze style
attribution and matching in the Stable Diffusion model. Code and artifacts are
available at https://github.com/learn2phoenix/CSD. |
This paper presents a new method for extracting style descriptors from images, enabling style-based image retrieval and analysis of style replication in text-to-image models like Stable Diffusion. |
As generative models become increasingly used, it's crucial to understand how they replicate style from training data, both for copyright concerns and for understanding the model's capabilities. |
The authors curate a new dataset, LAION-Styles, from LAION-Aesthetics, and train a Vision Transformer model with a combination of self-supervised and multi-label contrastive learning objectives tailored for style representation. |
The proposed model, CSD, outperforms existing style attribution models and pre-trained feature extractors on style-based image retrieval tasks across DomainNet, WikiArt, and LAION-Styles datasets.
Analysis of Stable Diffusion reveals a correlation between prompt complexity and the degree of style copying, with more complex prompts leading to increased style replication.
The model can be used to identify which artists' styles are more likely to be replicated by Stable Diffusion, and to explore how styles generalize to out-of-distribution content. |
The LAION-Styles dataset, while curated, still contains noise in the form of missing or incorrect tags.
The evaluation assumes strict adherence of the generative model to the prompts, which may not always hold true. |
style representation, style retrieval, text-to-image generation, stable diffusion, style copying |
2404.01291
Report |
Evaluating Text-to-Visual Generation with Image-to-Text Generation |
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan |
Despite significant progress in generative AI, comprehensive evaluation
remains challenging because of the lack of effective metrics and standardized
benchmarks. For instance, the widely-used CLIPScore measures the alignment
between a (generated) image and text prompt, but it fails to produce reliable
scores for complex prompts involving compositions of objects, attributes, and
relations. One reason is that text encoders of CLIP can notoriously act as a
"bag of words", conflating prompts such as "the horse is eating the grass" with
"the grass is eating the horse". To address this, we introduce the VQAScore,
which uses a visual-question-answering (VQA) model to produce an alignment
score by computing the probability of a "Yes" answer to a simple "Does this
figure show '{text}'?" question. Though simpler than prior art, VQAScore
computed with off-the-shelf models produces state-of-the-art results across
many (8) image-text alignment benchmarks. We also compute VQAScore with an
in-house model that follows best practices in the literature. For example, we
use a bidirectional image-question encoder that allows image embeddings to
depend on the question being asked (and vice versa). Our in-house model,
CLIP-FlanT5, outperforms even the strongest baselines that make use of the
proprietary GPT-4V. Interestingly, although we train with only images, VQAScore
can also align text with video and 3D models. VQAScore allows researchers to
benchmark text-to-visual generation using complex texts that capture the
compositional structure of real-world prompts. We introduce GenAI-Bench, a more
challenging benchmark with 1,600 compositional text prompts that require
parsing scenes, objects, attributes, relationships, and high-order reasoning
like comparison and logic. GenAI-Bench also offers over 15,000 human ratings
for leading image and video generation models such as Stable Diffusion, DALL-E
3, and Gen2. |
This paper introduces VQAScore, a simple yet effective metric for evaluating text-to-visual generation models that surpasses current metrics and doesn't rely on expensive human feedback or proprietary models. |
Comprehensive and reliable evaluation of text-to-visual generative AI remains challenging due to a lack of effective metrics and standardized benchmarks, particularly for complex prompts involving compositions. |
VQAScore leverages visual question answering (VQA) by calculating the probability of a "Yes" answer to a question like "Does this figure show {text}?". It also introduces a new bidirectional VQA model, CLIP-FlanT5, and a challenging benchmark, GenAI-Bench, featuring compositional prompts and human ratings. |
VQAScore outperforms prior art on challenging compositional image-text matching benchmarks (Winoground and EqBen).
VQAScore achieves state-of-the-art correlation with human judgments on alignment benchmarks.
VQAScore can be extended to evaluate text-to-video and text-to-3D models by averaging scores across sampled frames or rendered views. |
VQAScore currently does not evaluate aspects like toxicity, bias, aesthetics, video motion, and 3D physics.
Future work could fine-tune VQAScore with relevant data to address these limitations. |
generative ai, text-to-visual generation, evaluation metrics, vqascore, genai-bench |
2404.01284
Report |
Large Motion Model for Unified Multi-Modal Motion Generation |
Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jingfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu |
Human motion generation, a cornerstone technique in animation and video
production, has widespread applications in various tasks like text-to-motion
and music-to-dance. Previous works focus on developing specialist models
tailored for each task without scalability. In this work, we present Large
Motion Model (LMM), a motion-centric, multi-modal framework that unifies
mainstream motion generation tasks into a generalist model. A unified motion
model is appealing since it can leverage a wide range of motion data to achieve
broad generalization beyond a single task. However, it is also challenging due
to the heterogeneous nature of substantially different motion data and tasks.
LMM tackles these challenges from three principled aspects: 1) Data: We
consolidate datasets with different modalities, formats and tasks into a
comprehensive yet unified motion generation dataset, MotionVerse, comprising 10
tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2)
Architecture: We design an articulated attention mechanism ArtAttention that
incorporates body part-aware modeling into Diffusion Transformer backbone. 3)
Pre-Training: We propose a novel pre-training strategy for LMM, which employs
variable frame rates and masking forms, to better exploit knowledge from
diverse training data. Extensive experiments demonstrate that our generalist
LMM achieves competitive performance across various standard motion generation
tasks over state-of-the-art specialist models. Notably, LMM exhibits strong
generalization capabilities and emerging properties across many unseen tasks.
Additionally, our ablation studies reveal valuable insights about training and
scaling up large motion models for future research. |
This paper introduces LMM (Large Motion Model), a generalist, multi-modal framework that unifies various motion generation tasks into a single model, leveraging a comprehensive dataset called MotionVerse. |
Existing motion generation models are often specialist models limited by data quantity and domain, resulting in poor generalization. LMM aims to overcome these limitations by leveraging diverse motion data for broader generalization. |
The authors consolidate 16 motion datasets into MotionVerse, addressing inconsistencies in pose representation, keypoints, and frame rates. LMM, built on a transformer-based diffusion model with a novel attention mechanism (ArtAttention), is pretrained with random frame rates and masking techniques before fine-tuning on specific tasks. |
LMM achieves state-of-the-art results on text-to-motion generation tasks, outperforming specialist models in accuracy and fidelity.
In motion prediction, LMM demonstrates superior performance, particularly in long-distance prediction, attributed to its robust motion prior learned from large-scale data.
LMM shows competitive performance in music-to-dance generation, with significant advantages in diversity metrics, highlighting its ability to leverage multi-modal data. |
The current intermediate representation cannot handle missing individual keypoints within a body part, limiting its flexibility.
The use of motion translators introduces noise, decreasing motion quality. Future work will focus on more flexible motion representation and modeling. |
motion generation, unified model, multi-modality, diffusion model, large motion model |
2404.01247
Report |
An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance |
Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig |
Given the rise of multimedia content, human translators increasingly focus on
culturally adapting not only words but also other modalities such as images to
convey the same meaning. While several applications stand to benefit from this,
machine translation systems remain confined to dealing with language in speech
and text. In this work, we take a first step towards translating images to make
them culturally relevant. First, we build three pipelines comprising
state-of-the-art generative models to do the task. Next, we build a two-part
evaluation dataset: i) concept: comprising 600 images that are cross-culturally
coherent, focusing on a single concept per image, and ii) application:
comprising 100 images curated from real-world applications. We conduct a
multi-faceted human evaluation of translated images to assess for cultural
relevance and meaning preservation. We find that as of today, image-editing
models fail at this task, but can be improved by leveraging LLMs and retrievers
in the loop. Best pipelines can only translate 5% of images for some countries
in the easier concept dataset and no translation is successful for some
countries in the application dataset, highlighting the challenging nature of
the task. Our code and data is released here:
https://github.com/simran-khanuja/image-transcreation. |
This paper introduces the task of "image transcreation", aiming to culturally adapt images using machine learning for diverse audiences. |
With the rise of multimedia content, translating visual elements like images for cultural relevance is crucial alongside text, yet remains unaddressed. |
The authors build three pipelines using generative models: 1) direct instruction-based editing, 2) caption-edit-image edit, and 3) caption-edit-image retrieval. They also create a two-part evaluation dataset ("concept" and "application") with images from 7 countries. |
Image-editing models struggle to grasp cultural context, but improve with LLMs and retrieval methods.
The best pipeline achieves only 5% successful translation for certain countries in the simpler "concept" dataset.
No successful translations are found for some countries in the harder "application" dataset, highlighting the task's difficulty. |
Cultural categorization solely based on country is a limitation acknowledged by the authors.
Limited language and country coverage due to resource constraints. |
image transcreation, cultural adaptation, multimodal translation, generative models, human evaluation |
2404.01241
Report |
StructLDM: Structured Latent Diffusion for 3D Human Generation |
Tao Hu, Fangzhou Hong, Ziwei Liu |
Recent 3D human generative models have achieved remarkable progress by
learning 3D-aware GANs from 2D images. However, existing 3D human generative
methods model humans in a compact 1D latent space, ignoring the articulated
structure and semantics of human body topology. In this paper, we explore more
expressive and higher-dimensional latent space for 3D human modeling and
propose StructLDM, a diffusion-based unconditional 3D human generative model,
which is learned from 2D images. StructLDM solves the challenges imposed due to
the high-dimensional growth of latent space with three key designs: 1) A
semantic structured latent space defined on the dense surface manifold of a
statistical human body template. 2) A structured 3D-aware auto-decoder that
factorizes the global latent space into several semantic body parts
parameterized by a set of conditional structured local NeRFs anchored to the
body template, which embeds the properties learned from the 2D training data
and can be decoded to render view-consistent humans under different poses and
clothing styles. 3) A structured latent diffusion model for generative human
appearance sampling. Extensive experiments validate StructLDM's
state-of-the-art generation performance and illustrate the expressiveness of
the structured latent space over the well-adopted 1D latent space. Notably,
StructLDM enables different levels of controllable 3D human generation and
editing, including pose/view/shape control, and high-level tasks including
compositional generations, part-aware clothing editing, 3D virtual try-on, etc.
Our project page is at: https://taohuumd.github.io/projects/StructLDM/. |
This paper presents StructLDM, a novel diffusion-based 3D human generative model that utilizes a structured 2D latent space representing the human body surface. |
Existing 3D human generative methods employ limited 1D latent spaces, hindering controllability and realism. StructLDM addresses these limitations by leveraging a higher-dimensional, semantically meaningful representation. |
StructLDM employs a two-stage approach: 1) training a structured auto-decoder to embed human subjects into a 2D latent space aligned with a human body mesh, and 2) training a latent diffusion model in this structured space to facilitate diverse and realistic human generation. |
Achieves state-of-the-art generation quality on three datasets, outperforming existing 3D-aware GANs in terms of FID and user-study evaluations.
Enables controllable generation by manipulating pose, view, and shape, as well as editing capabilities like compositional generation and part-aware modifications (e.g., 3D virtual try-on).
Demonstrates the superiority of the structured 2D latent space over traditional 1D representations for capturing fine details and enabling local editing. |
Limited diversity due to reliance on training from scratch and the lack of a large-scale, accurate 3D human dataset.
Challenges in learning from single-view images, though promising results are shown on the DeepFashion dataset. |
3d human generation, latent diffusion model, structured latent representation, controllable generation, 3d virtual try-on |
2404.01203
Report |
Video Interpolation with Diffusion Models |
Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hołyński, Ben Poole, Janne Kontkanen |
We present VIDIM, a generative model for video interpolation, which creates
short videos given a start and end frame. In order to achieve high fidelity and
generate motions unseen in the input data, VIDIM uses cascaded diffusion models
to first generate the target video at low resolution, and then generate the
high-resolution video conditioned on the low-resolution generated video. We
compare VIDIM to previous state-of-the-art methods on video interpolation, and
demonstrate how such works fail in most settings where the underlying motion is
complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We
additionally demonstrate how classifier-free guidance on the start and end
frame and conditioning the super-resolution model on the original
high-resolution frames without additional parameters unlocks high-fidelity
results. VIDIM is fast to sample from as it jointly denoises all the frames to
be generated, requires less than a billion parameters per diffusion model to
produce compelling results, and still enjoys scalability and improved quality
at larger parameter counts. |
VIDIM, a cascaded diffusion model for video interpolation, generates high-quality videos between two input frames, particularly excelling in scenarios with complex, nonlinear, or ambiguous motion. |
Existing video interpolation methods struggle with complex or ambiguous motion. VIDIM addresses this limitation by leveraging the generative capabilities of diffusion models to produce plausible interpolations even in challenging cases. |
VIDIM uses a two-stage diffusion model: a base model generates low-resolution interpolating frames, and a super-resolution model enhances their resolution conditioned on the original high-resolution input frames. Both models share parameters across frames and employ classifier-free guidance for enhanced quality. |
VIDIM outperforms state-of-the-art methods in generative metrics (FID, FVD) on challenging datasets with large and ambiguous motions.
Human evaluations strongly favor VIDIM for generating more realistic videos compared to baselines.
Ablation studies confirm the importance of explicit frame conditioning and classifier-free guidance in achieving high-quality results. |
VIDIM currently operates at a fixed resolution and aspect ratio, limiting its flexibility.
Future work includes exploring techniques for arbitrary aspect ratio generation and further enhancing the super-resolution model's quality. |
video interpolation, diffusion models, generative models, classifier-free guidance, deep learning |
2404.01197
Report |
Getting it Right: Improving Spatial Consistency in Text-to-Image Models |
Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang |
One of the key shortcomings in current text-to-image (T2I) models is their
inability to consistently generate images which faithfully follow the spatial
relationships specified in the text prompt. In this paper, we offer a
comprehensive investigation of this limitation, while also developing datasets
and methods that achieve state-of-the-art performance. First, we find that
current vision-language datasets do not represent spatial relationships well
enough; to alleviate this bottleneck, we create SPRIGHT, the first
spatially-focused, large scale dataset, by re-captioning 6 million images from
4 widely used vision datasets. Through a 3-fold evaluation and analysis
pipeline, we find that SPRIGHT largely improves upon existing datasets in
capturing spatial relationships. To demonstrate its efficacy, we leverage only
~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially
accurate images while also improving the FID and CMMD scores. Secondly, we find
that training on images containing a large number of objects results in
substantial improvements in spatial consistency. Notably, we attain
state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by
fine-tuning on <500 images. Finally, through a set of controlled experiments
and ablations, we document multiple findings that we believe will enhance the
understanding of factors that affect spatial consistency in text-to-image
models. We publicly release our dataset and model to foster further research in
this area. |
This paper introduces SPRIGHT, a spatially focused vision-language dataset aimed at improving spatial consistency in text-to-image models. The authors also propose an efficient fine-tuning method that optimizes model performance on spatial relationships. |
Current text-to-image models struggle to accurately represent spatial relationships described in text prompts. This work addresses this limitation by providing a high-quality dataset and an effective training strategy. |
The authors create SPRIGHT by re-captioning 6 million images from existing datasets with a focus on spatial relationships. They fine-tune Stable Diffusion models on SPRIGHT using a novel approach that prioritizes images with a high density of objects. |
SPRIGHT significantly improves the representation of spatial relationships compared to existing datasets.
Fine-tuning on SPRIGHT leads to significant performance gains on spatial reasoning benchmarks (VISOR, T2I-CompBench) while also improving image fidelity metrics (FID, CMMD).
An efficient training methodology utilizing images with many objects achieves state-of-the-art performance on T2I-CompBench Spatial Score. |
SPRIGHT, being a derived dataset, inherits potential limitations from the original datasets used for captioning.
The accuracy of synthetic captions, while high, can be further improved with advanced prompting techniques and models. |
text-to-image synthesis, spatial reasoning, vision-language models, dataset creation, stable diffusion |
2404.01143
Report |
Condition-Aware Neural Network for Controlled Image Generation |
Han Cai, Muyang Li, Zhuoyang Zhang, Qinsheng Zhang, Ming-Yu Liu, Song Han |
We present Condition-Aware Neural Network (CAN), a new method for adding
control to image generative models. In parallel to prior conditional control
methods, CAN controls the image generation process by dynamically manipulating
the weight of the neural network. This is achieved by introducing a
condition-aware weight generation module that generates conditional weight for
convolution/linear layers based on the input condition. We test CAN on
class-conditional image generation on ImageNet and text-to-image generation on
COCO. CAN consistently delivers significant improvements for diffusion
transformer models, including DiT and UViT. In particular, CAN combined with
EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2
while requiring 52x fewer MACs per sampling step. |
Introduces Condition-Aware Neural Network (CAN), a method for controlling image generation by dynamically manipulating neural network weights based on input conditions. |
Improves controllability and efficiency of image generative models, enabling them to better follow user instructions and be deployed on resource-constrained devices. |
Introduces a condition-aware weight generation module that generates conditional weights for convolution/linear layers based on input conditions, which are then fused with static weights during training and inference. |
Significantly improves image quality and controllability over baseline models on ImageNet and COCO datasets.
Outperforms prior conditional control methods like adaptive normalization and attention-based methods.
Enables development of CaT, a new family of efficient diffusion transformers that achieve state-of-the-art results with significantly lower computational cost. |
Current implementation incurs 30-40% training overhead compared to static models due to reliance on grouped convolution.
Large-scale text-to-image generation and video generation applications are left for future work. |
controlled image generation, diffusion models, dynamic neural networks, weight generation networks, efficient deep learning |
2404.01133
Report |
CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians |
Yang Liu, He Guan, Chuanchen Luo, Lue Fan, Junran Peng, Zhaoxiang Zhang |
The advancement of real-time 3D scene reconstruction and novel view synthesis
has been significantly propelled by 3D Gaussian Splatting (3DGS). However,
effectively training large-scale 3DGS and rendering it in real-time across
various scales remains challenging. This paper introduces CityGaussian
(CityGS), which employs a novel divide-and-conquer training approach and
Level-of-Detail (LoD) strategy for efficient large-scale 3DGS training and
rendering. Specifically, the global scene prior and adaptive training data
selection enables efficient training and seamless fusion. Based on fused
Gaussian primitives, we generate different detail levels through compression,
and realize fast rendering across various scales through the proposed
block-wise detail levels selection and aggregation strategy. Extensive
experimental results on large-scale scenes demonstrate that our approach
attains state-of-theart rendering quality, enabling consistent real-time
rendering of largescale scenes across vastly different scales. Our project page
is available at https://dekuliutesla.github.io/citygs/. |
This paper introduces CityGaussian (CityGS), a novel method for real-time, high-quality rendering of large-scale scenes using 3D Gaussian Splatting (3DGS). It employs a divide-and-conquer training approach with a global scene prior and Level-of-Detail (LoD) for efficient rendering across different scales. |
Effectively training large-scale 3DGS models and rendering them in real-time across various scales is challenging due to high memory and computational demands. This paper addresses these limitations. |
CityGS divides the scene into blocks, each trained in parallel with a global Gaussian prior for consistent fusion. It compresses Gaussians into different detail levels and uses a block-wise LoD strategy for efficient rendering. |
CityGS achieves state-of-the-art rendering quality on large-scale scenes, outperforming NeRF-based methods in SSIM, PSNR, and LPIPS.
The proposed LoD strategy enables real-time rendering even under drastically different scales with minimal quality loss.
CityGS allows for efficient scene manipulation due to its explicit representation of the scene. |
The assumption of a static scene limits the generalization ability of the current method.
Future work includes exploring the application of CityGS in dynamic scenes and improving performance with drastically different training views (e.g., aerial and street views). |
3d scene reconstruction, novel view synthesis, 3d gaussian splatting, level of detail, large-scale scene rendering |
2404.01089
Report |
Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On |
Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, Xiangmin Xu |
Image-based virtual try-on is an increasingly important task for online
shopping. It aims to synthesize images of a specific person wearing a specified
garment. Diffusion model-based approaches have recently become popular, as they
are excellent at image synthesis tasks. However, these approaches usually
employ additional image encoders and rely on the cross-attention mechanism for
texture transfer from the garment to the person image, which affects the
try-on's efficiency and fidelity. To address these issues, we propose an
Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the
fidelity of the results and introduces no additional image encoders.
Accordingly, we make contributions from two aspects. First, we propose to
concatenate the masked person and reference garment images along the spatial
dimension and utilize the resulting image as the input for the diffusion
model's denoising UNet. This enables the original self-attention layers
contained in the diffusion model to achieve efficient and accurate texture
transfer. Second, we propose a novel diffusion-based method that predicts a
precise inpainting mask based on the person and reference garment images,
further enhancing the reliability of the try-on results. In addition, we
integrate mask prediction and image synthesis into a single compact model. The
experimental results show that our approach can be applied to various try-on
tasks, e.g., garment-to-person and person-to-person try-ons, and significantly
outperforms state-of-the-art methods on popular VITON, VITON-HD databases. |
This paper proposes Texture-Preserving Diffusion (TPD), a novel diffusion-based virtual try-on model that enhances fidelity without additional image encoders. |
Virtual try-on is important for online shopping, but existing methods struggle with fidelity, especially for garments with complex textures and challenging poses. |
TPD introduces two key components: (1) Self-Attention-based Texture Transfer (SATT) concatenates masked person and garment images spatially, leveraging inherent self-attention in diffusion models for efficient texture transfer. (2) Decoupled Mask Prediction (DMP) iteratively predicts a precise inpainting mask based on both person and garment images, preserving details. |
TPD generates high-quality try-on images with fewer artifacts, especially for complex textures.
DMP effectively preserves body details, such as arms or tattoos, by minimizing the removal of irrelevant information.
Quantitative evaluations show TPD consistently outperforms state-of-the-art methods on VITON and VITON-HD datasets. |
The model's performance on images with complex backgrounds, as opposed to single-color backgrounds prevalent in datasets, needs further exploration.
Future work includes extending TPD to handle multi-garment try-on scenarios. |
virtual try-on, diffusion models, image synthesis, self-attention, inpainting |
2404.00987
Report |
FlexiDreamer: Single Image-to-3D Generation with FlexiCubes |
Ruowen Zhao, Zhengyi Wang, Yikai Wang, Zihan Zhou, Jun Zhu |
3D content generation from text prompts or single images has made remarkable
progress in quality and speed recently. One of its dominant paradigms involves
generating consistent multi-view images followed by a sparse-view
reconstruction. However, due to the challenge of directly deforming the mesh
representation to approach the target topology, most methodologies learn an
implicit representation (such as NeRF) during the sparse-view reconstruction
and acquire the target mesh by a post-processing extraction. Although the
implicit representation can effectively model rich 3D information, its training
typically entails a long convergence time. In addition, the post-extraction
operation from the implicit field also leads to undesirable visual artifacts.
In this paper, we propose FlexiDreamer, a novel single image-to-3d generation
framework that reconstructs the target mesh in an end-to-end manner. By
leveraging a flexible gradient-based extraction known as FlexiCubes, our method
circumvents the defects brought by the post-processing and facilitates a direct
acquisition of the target mesh. Furthermore, we incorporate a multi-resolution
hash grid encoding scheme that progressively activates the encoding levels into
the implicit field in FlexiCubes to help capture geometric details for per-step
optimization. Notably, FlexiDreamer recovers a dense 3D structure from a
single-view image in approximately 1 minute on a single NVIDIA A100 GPU,
outperforming previous methodologies by a large margin. |
FlexiDreamer is a novel single image-to-3D generation framework that reconstructs the target mesh in an end-to-end manner by leveraging FlexiCubes for a direct acquisition of the target mesh, bypassing the need for post-processing steps common in NeRF-based methods. |
Existing methods for 3D content generation from single images often rely on implicit representations like NeRF, leading to long training times and potential artifacts during post-processing extraction of the mesh. FlexiDreamer addresses these limitations by directly generating the target mesh in an end-to-end fashion. |
FlexiDreamer uses a pre-trained diffusion model to generate multi-view RGB and normal images from a single input image. Then, it employs FlexiCubes, a flexible gradient-based surface extraction method, to extract an explicit mesh from a signed distance field encoded via a multi-resolution hash grid network. A texture neural field is also integrated to learn mesh surface texture. The entire framework is trained end-to-end using reconstruction losses from the rendered images. |
FlexiDreamer recovers dense 3D structures from single-view images in approximately 1 minute, significantly faster than previous methods.
It generates high-quality textured meshes with sharper geometric details and more distinct textures compared to baselines.
The end-to-end pipeline avoids artifacts often introduced during post-processing extraction in NeRF-based approaches. |
The quality of generated 3D assets depends heavily on the quality of multi-view images, which can be limited by the capabilities of current multi-view diffusion models.
Limited perspectives of input images can hinder the accurate reconstruction of objects with complex geometries. |
3d generation, diffusion models, flexicubes, single image-to-3d, sparse-view reconstruction |
2404.00931
Report |
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields |
Yunsong Wang, Hanlin Chen, Gim Hee Lee |
Recent advancements in vision-language foundation models have significantly
enhanced open-vocabulary 3D scene understanding. However, the generalizability
of existing methods is constrained due to their framework designs and their
reliance on 3D data. We address this limitation by introducing Generalizable
Open-Vocabulary Neural Semantic Fields (GOV-NeSF), a novel approach offering a
generalizable implicit representation of 3D scenes with open-vocabulary
semantics. We aggregate the geometry-aware features using a cost volume, and
propose a Multi-view Joint Fusion module to aggregate multi-view features
through a cross-view attention mechanism, which effectively predicts
view-specific blending weights for both colors and open-vocabulary features.
Remarkably, our GOV-NeSF exhibits state-of-the-art performance in both 2D and
3D open-vocabulary semantic segmentation, eliminating the need for ground truth
semantic labels or depth priors, and effectively generalize across scenes and
datasets without fine-tuning. |
Introduces GOV-NeSF, a novel generalizable open-vocabulary neural semantic field for 3D scenes, enabling open-vocabulary semantic segmentation in both 2D and 3D without requiring 3D data, depth priors, or explicit semantic labels during training. |
Addresses the limitations of existing open-vocabulary 3D scene understanding methods that suffer from constrained generalizability due to framework design and reliance on 3D data. |
Leverages a cost volume for geometry-aware feature extraction and proposes a Multi-view Joint Fusion module to blend colors and open-vocabulary features from multi-view images using cross-view attention, trained with supervision from novel views. |
Achieves state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation on ScanNet and Replica datasets.
Demonstrates significant improvements over existing methods when ground truth depth maps are unavailable, effectively learning occlusion reasoning implicitly.
Exhibits strong generalizability, successfully transferring to unseen scenes and datasets without fine-tuning. |
Rendering quality of color images can be blurry compared to methods using depth priors due to the focus on room-scale representation without depth information.
Depth-guided masking, while improving 3D segmentation, can negatively impact 2D segmentation performance by creating empty holes in rendered images. |
open-vocabulary learning, semantic segmentation, neural radiance fields, 3d scene understanding, generalizable vision |
2404.00891
Report |
Marrying NeRF with Feature Matching for One-step Pose Estimation |
Ronghan Chen, Yang Cong, Yu Ren |
Given the image collection of an object, we aim at building a real-time
image-based pose estimation method, which requires neither its CAD model nor
hours of object-specific training. Recent NeRF-based methods provide a
promising solution by directly optimizing the pose from pixel loss between
rendered and target images. However, during inference, they require long
converging time, and suffer from local minima, making them impractical for
real-time robot applications. We aim at solving this problem by marrying image
matching with NeRF. With 2D matches and depth rendered by NeRF, we directly
solve the pose in one step by building 2D-3D correspondences between target and
initial view, thus allowing for real-time prediction. Moreover, to improve the
accuracy of 2D-3D correspondences, we propose a 3D consistent point mining
strategy, which effectively discards unfaithful points reconstruted by NeRF.
Moreover, current NeRF-based methods naively optimizing pixel loss fail at
occluded images. Thus, we further propose a 2D matches based sampling strategy
to preclude the occluded area. Experimental results on representative datasets
prove that our method outperforms state-of-the-art methods, and improves
inference efficiency by 90x, achieving real-time prediction at 6 FPS. |
This paper introduces a novel NeRF-based pose estimation method that leverages image matching for real-time, CAD-model-free pose estimation of novel objects. |
Existing NeRF-based pose estimation techniques suffer from slow convergence and are prone to local minima, making them impractical for real-time applications. |
The method uses a pre-trained NeRF model to render depth information and combines it with 2D feature matches to create 2D-3D correspondences. This allows for direct pose solving using PnP in a single step. Additionally, a 3D consistent point mining strategy is employed to enhance the accuracy of the correspondences by filtering out unreliable points. A keypoint-guided sampling strategy is also introduced to address occlusion challenges during pose refinement. |
The proposed method achieves state-of-the-art pose estimation accuracy on both synthetic and real-world datasets.
It significantly improves inference efficiency by 90x compared to previous NeRF-based methods, enabling real-time prediction at 6 FPS.
The method exhibits strong robustness to occlusion, outperforming existing techniques. |
The method's performance relies on the accuracy of the employed image matcher.
Future work could explore extending the approach to handle object scales and incorporate it into robot manipulation or neural field-based SLAM tasks. |
pose estimation, neural radiance fields (nerf), image matching, 3d consistent point mining, occlusion handling |
2404.00879
Report |
Model-Agnostic Human Preference Inversion in Diffusion Models |
Jeeyung Kim, Ze Wang, Qiang Qiu |
Efficient text-to-image generation remains a challenging task due to the high
computational costs associated with the multi-step sampling in diffusion
models. Although distillation of pre-trained diffusion models has been
successful in reducing sampling steps, low-step image generation often falls
short in terms of quality. In this study, we propose a novel sampling design to
achieve high-quality one-step image generation aligning with human preferences,
particularly focusing on exploring the impact of the prior noise distribution.
Our approach, Prompt Adaptive Human Preference Inversion (PAHI), optimizes the
noise distributions for each prompt based on human preferences without the need
for fine-tuning diffusion models. Our experiments showcase that the tailored
noise distributions significantly improve image quality with only a marginal
increase in computational cost. Our findings underscore the importance of noise
optimization and pave the way for efficient and high-quality text-to-image
synthesis. |
Proposed PAHI, a novel sampling design that optimizes noise distributions for one-step text-to-image generation, aligning with human preferences without fine-tuning diffusion models. |
Efficient text-to-image generation is crucial, but low-step image generation often lacks quality. This work addresses the need for high-quality, efficient synthesis by exploring the impact of prior noise distribution in one-step generation. |
Leveraged a distilled diffusion model as the generator and a scoring model (PickScore) to assess image quality based on human preferences. Optimized the noise distribution parameters by minimizing an objective function that maximizes the scores, employing a lightweight noise-predicting model to tailor noise distributions for individual prompts. |
PAHI significantly outperforms standard Gaussian noise in one-step generation, achieving a win rate of 94.0% based on PickScore.
The prompt-adaptive approach (PAHI) shows superior performance (94.0% win rate) compared to a single optimized noise distribution across all prompts (64.7% win rate).
PAHI achieves higher quality images (based on PickScore and ImageReward) compared to one-step and two-step generation with standard Gaussian noise, while only adding a marginal increase in inference time. |
The study primarily focuses on one-step generation, and further investigation is needed for multi-step scenarios.
Exploration of alternative noise distributions beyond Gaussian could be beneficial. |
text-to-image generation, diffusion models, one-step sampling, noise optimization, human preferences |
2404.00878
Report |
TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On |
Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang |
Virtual try-on focuses on adjusting the given clothes to fit a specific
person seamlessly while avoiding any distortion of the patterns and textures of
the garment. However, the clothing identity uncontrollability and training
inefficiency of existing diffusion-based methods, which struggle to maintain
the identity even with full parameter training, are significant limitations
that hinder the widespread applications. In this work, we propose an effective
and efficient framework, termed TryOn-Adapter. Specifically, we first decouple
clothing identity into fine-grained factors: style for color and category
information, texture for high-frequency details, and structure for smooth
spatial adaptive transformation. Our approach utilizes a pre-trained
exemplar-based diffusion model as the fundamental network, whose parameters are
frozen except for the attention layers. We then customize three lightweight
modules (Style Preserving, Texture Highlighting, and Structure Adapting)
incorporated with fine-tuning techniques to enable precise and efficient
identity control. Meanwhile, we introduce the training-free T-RePaint strategy
to further enhance clothing identity preservation while maintaining the
realistic try-on effect during the inference. Our experiments demonstrate that
our approach achieves state-of-the-art performance on two widely-used
benchmarks. Additionally, compared with recent full-tuning diffusion-based
methods, we only use about half of their tunable parameters during training.
The code will be made publicly available at
https://github.com/jiazheng-xing/TryOn-Adapter. |
This paper proposes TryOn-Adapter, an efficient framework for virtual try-on that decouples clothing identity into fine-grained factors for enhanced controllability and training efficiency. |
Existing diffusion-based virtual try-on methods struggle to maintain clothing identity and are computationally expensive to train. |
The paper uses a pre-trained diffusion model with frozen parameters, except attention layers. It then integrates three lightweight modules: Style Preserving, Texture Highlighting, and Structure Adapting. A training-free T-RePaint strategy further enhances identity preservation during inference. An Enhanced Latent Blending Module is used to enhance the visual quality of the generated image. |
Achieves state-of-the-art performance on VITON-HD and Dresscode datasets.
Significantly reduces trainable parameters compared to full fine-tuning methods.
Demonstrates superior preservation of garment style, texture, and structure. |
The method is limited by the existing datasets, which hinders widespread practical application.
Lack of targeted quantitative evaluation metrics for virtual try-on tasks. |
virtual try-on, diffusion models, identity preservation, parameter efficient fine-tuning, generative adversarial networks |
2404.00874
Report |
DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF |
Jie Long Lee, Chen Li, Gim Hee Lee |
We present DiSR-NeRF, a diffusion-guided framework for view-consistent
super-resolution (SR) NeRF. Unlike prior works, we circumvent the requirement
for high-resolution (HR) reference images by leveraging existing powerful 2D
super-resolution models. Nonetheless, independent SR 2D images are often
inconsistent across different views. We thus propose Iterative 3D
Synchronization (I3DS) to mitigate the inconsistency problem via the inherent
multi-view consistency property of NeRF. Specifically, our I3DS alternates
between upscaling low-resolution (LR) rendered images with diffusion models,
and updating the underlying 3D representation with standard NeRF training. We
further introduce Renoised Score Distillation (RSD), a novel score-distillation
objective for 2D image resolution. Our RSD combines features from ancestral
sampling and Score Distillation Sampling (SDS) to generate sharp images that
are also LR-consistent. Qualitative and quantitative results on both synthetic
and real-world datasets demonstrate that our DiSR-NeRF can achieve better
results on NeRF super-resolution compared with existing works. Code and video
results available at the project website. |
This paper proposes DiSR-NeRF, a diffusion-guided framework for view-consistent super-resolution (SR) of Neural Radiance Fields (NeRFs) that enhances the resolution of NeRFs trained on low-resolution images without requiring high-resolution reference images. |
Super-resolution NeRFs have practical applications in scenarios where high-resolution multi-view images are unavailable (e.g., drones, CCTVs) but existing methods require high-resolution references or datasets, which are often costly or impractical to obtain. |
DiSR-NeRF leverages pre-trained 2D super-resolution diffusion models and introduces two key components: 1) Iterative 3D Synchronization (I3DS) to address cross-view inconsistency by alternating between upscaling rendered low-resolution images and refining the 3D representation. 2) Renoised Score Distillation (RSD) to generate sharp and consistent super-resolution images by optimizing denoised latents within an ancestral sampling trajectory. |
DiSR-NeRF generates sharper and more detailed super-resolution NeRFs compared to existing methods, as demonstrated on synthetic and real-world datasets.
RSD effectively produces high-resolution details while maintaining consistency with the original low-resolution input, outperforming both ancestral sampling and Score Distillation Sampling (SDS).
I3DS significantly improves view consistency in super-resolution NeRFs compared to using only SDS optimization. |
The upscaling factor is limited by the specific 2D super-resolution diffusion model used (4x in this case).
Future work can explore cascaded diffusion models for higher upscaling factors. |
neural radiance fields, nerf, super-resolution, diffusion models, view synthesis |
2404.00661
Report |
DeeDSR: Towards Real-World Image Super-Resolution via Degradation-Aware Stable Diffusion |
Chunyang Bi, Xin Luo, Sheng Shen, Mengxi Zhang, Huanjing Yue, Jingyu Yang |
Diffusion models, known for their powerful generative capabilities, play a
crucial role in addressing real-world super-resolution challenges. However,
these models often focus on improving local textures while neglecting the
impacts of global degradation, which can significantly reduce semantic fidelity
and lead to inaccurate reconstructions and suboptimal super-resolution
performance. To address this issue, we introduce a novel two-stage,
degradation-aware framework that enhances the diffusion model's ability to
recognize content and degradation in low-resolution images. In the first stage,
we employ unsupervised contrastive learning to obtain representations of image
degradations. In the second stage, we integrate a degradation-aware module into
a simplified ControlNet, enabling flexible adaptation to various degradations
based on the learned representations. Furthermore, we decompose the
degradation-aware features into global semantics and local details branches,
which are then injected into the diffusion denoising module to modulate the
target generation. Our method effectively recovers semantically precise and
photorealistic details, particularly under significant degradation conditions,
demonstrating state-of-the-art performance across various benchmarks. Codes
will be released at https://github.com/bichunyang419/DeeDSR. |
Introduces DeeDSR, a novel two-stage degradation-aware framework for real-world image super-resolution that enhances the generative capabilities of pre-trained text-to-image diffusion models by leveraging image prompts to represent global degradation. |
Addresses limitations in existing diffusion-based super-resolution models that neglect the impact of global degradation, leading to inaccurate reconstructions and reduced semantic fidelity, especially under severe degradation conditions. |
Employs unsupervised contrastive learning in the first stage to learn representations of image degradations. Integrates a degradation-aware module into a simplified ControlNet in the second stage to adapt to various degradations based on learned representations. Decomposes degradation-aware features into global and local branches, injecting them into the diffusion denoising module for modulated target generation. |
DeeDSR effectively recovers semantically accurate details, particularly under significant degradation, outperforming existing methods on benchmark datasets.
Quantitative evaluations show superior performance in perceptual metrics, including CLIPIQA and MANIQA, indicating high image generation quality and fidelity.
Ablation studies confirm the effectiveness of the degradation learner, global and local representation branches, and the proposed noise guidance strategy for balancing realism and fidelity. |
The model exhibits slightly slower inference speed compared to some diffusion-based methods due to the additional stage for estimating degradations.
Future work could explore incorporating additional priors or optimization techniques to further improve the efficiency of the proposed framework. |
image super-resolution, diffusion models, degradation awareness, contrastive learning, controlnet |
2404.00648
Report |
SpiralMLP: A Lightweight Vision MLP Architecture |
Haojie Mu, Burhan Ul Tayyab, Nicholas Chua |
We present SpiralMLP, a novel architecture that introduces a Spiral FC layer
as a replacement for the conventional Token Mixing approach. Differing from
several existing MLP-based models that primarily emphasize axes, our Spiral FC
layer is designed as a deformable convolution layer with spiral-like offsets.
We further adapt Spiral FC into two variants: Self-Spiral FC and Cross-Spiral
FC, which enable both local and global feature integration seamlessly,
eliminating the need for additional processing steps. To thoroughly investigate
the effectiveness of the spiral-like offsets and validate our design, we
conduct ablation studies and explore optimal configurations. In empirical
tests, SpiralMLP reaches state-of-the-art performance, similar to Transformers,
CNNs, and other MLPs, benchmarking on ImageNet-1k, COCO and ADE20K. SpiralMLP
still maintains linear computational complexity O(HW) and is compatible with
varying input image resolutions. Our study reveals that targeting the full
receptive field is not essential for achieving high performance, instead,
adopting a refined approach offers better results. |
Proposes SpiralMLP, a lightweight vision architecture using a novel Spiral Fully-Connected (Spiral FC) layer to replace traditional Token Mixing in MLP-based models. |
Aims to address limitations of existing MLPs, such as quadratic computational complexity and fixed input size, while improving spatial information integration for better performance. |
Introduces Spiral FC, inspired by deformable convolution and spiral patterns observed in attention visualizations, using spiral-like offsets to capture local and global features with linear complexity. |
Achieves state-of-the-art accuracy on ImageNet-1k, surpassing comparable MLPs and remaining competitive with Transformers and CNNs.
Demonstrates strong performance in object detection, instance segmentation (COCO), and semantic segmentation (ADE20K) tasks.
Exhibits faster inference latency compared to other MLPs of similar model size. |
Discrete hyperparameter optimization leaves room for further exploration of optimal configurations.
Future work includes investigating a dynamic version of Spiral FC for enhanced adaptability and efficiency. |
mlp, lightweight vision model, spiral fully-connected layer, deformable convolution, spatial information integration |
2404.00485
Report |
DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans |
Akash Sengupta, Thiemo Alldieck, Nikos Kolotouros, Enric Corona, Andrei Zanfir, Cristian Sminchisescu |
We present DiffHuman, a probabilistic method for photorealistic 3D human
reconstruction from a single RGB image. Despite the ill-posed nature of this
problem, most methods are deterministic and output a single solution, often
resulting in a lack of geometric detail and blurriness in unseen or uncertain
regions. In contrast, DiffHuman predicts a probability distribution over 3D
reconstructions conditioned on an input 2D image, which allows us to sample
multiple detailed 3D avatars that are consistent with the image. DiffHuman is
implemented as a conditional diffusion model that denoises pixel-aligned 2D
observations of an underlying 3D shape representation. During inference, we may
sample 3D avatars by iteratively denoising 2D renders of the predicted 3D
representation. Furthermore, we introduce a generator neural network that
approximates rendering with considerably reduced runtime (55x speed up),
resulting in a novel dual-branch diffusion framework. Our experiments show that
DiffHuman can produce diverse and detailed reconstructions for the parts of the
person that are unseen or uncertain in the input image, while remaining
competitive with the state-of-the-art when reconstructing visible surfaces. |
Presents DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image using a conditional diffusion model that predicts a distribution over 3D reconstructions. |
Addresses the limitations of deterministic methods that output a single solution, often lacking detail and blurriness in unseen regions, by predicting a probability distribution over plausible 3D human reconstructions. |
Implements a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation, and introduces a generator network to approximate rendering for faster inference. |
Produces diverse and detailed reconstructions for unseen or uncertain regions, such as the back of a person.
Remains competitive with state-of-the-art methods in reconstructing visible surfaces.
Offers a significant speed-up in inference time compared to diffusion-via-rendering approaches. |
Currently requires training data with known 3D geometry, limiting the amount of usable data.
Future work aims to leverage data with partial 2D and 2.5D supervision to overcome training data limitations. |
3d human reconstruction, diffusion models, probabilistic modeling, implicit surfaces, photorealistic rendering |
2404.00409
Report |
3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting |
Xiaoyang Lyu, Yang-Tian Sun, Yi-Hua Huang, Xiuzhe Wu, Ziyi Yang, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi |
In this paper, we present an implicit surface reconstruction method with 3D
Gaussian Splatting (3DGS), namely 3DGSR, that allows for accurate 3D
reconstruction with intricate details while inheriting the high efficiency and
rendering quality of 3DGS. The key insight is incorporating an implicit signed
distance field (SDF) within 3D Gaussians to enable them to be aligned and
jointly optimized. First, we introduce a differentiable SDF-to-opacity
transformation function that converts SDF values into corresponding Gaussians'
opacities. This function connects the SDF and 3D Gaussians, allowing for
unified optimization and enforcing surface constraints on the 3D Gaussians.
During learning, optimizing the 3D Gaussians provides supervisory signals for
SDF learning, enabling the reconstruction of intricate details. However, this
only provides sparse supervisory signals to the SDF at locations occupied by
Gaussians, which is insufficient for learning a continuous SDF. Then, to
address this limitation, we incorporate volumetric rendering and align the
rendered geometric attributes (depth, normal) with those derived from 3D
Gaussians. This consistency regularization introduces supervisory signals to
locations not covered by discrete 3D Gaussians, effectively eliminating
redundant surfaces outside the Gaussian sampling range. Our extensive
experimental results demonstrate that our 3DGSR method enables high-quality 3D
surface reconstruction while preserving the efficiency and rendering quality of
3DGS. Besides, our method competes favorably with leading surface
reconstruction techniques while offering a more efficient learning process and
much better rendering qualities. The code will be available at
https://github.com/CVMI-Lab/3DGSR. |
Presents 3DGSR, a novel implicit surface reconstruction method leveraging 3D Gaussian Splatting (3DGS) to achieve accurate 3D reconstructions with intricate details while retaining the high efficiency and rendering quality of 3DGS. |
Addresses the limitations of 3DGS in faithfully representing 3D surfaces due to its unstructured point-based geometry representation by incorporating a neural implicit signed distance field (SDF) within Gaussians for geometry modeling. |
Introduces a differentiable SDF-to-opacity transformation function to connect SDF and Gaussians, enabling joint optimization and enforcing surface constraints. Incorporates volumetric rendering and aligns rendered geometric attributes (depth, normal) with those derived from 3D Gaussians, providing regularization to locations not covered by Gaussians and eliminating redundant surfaces. |
Achieves high-quality 3D surface reconstruction while preserving the efficiency and rendering quality of 3DGS.
Outperforms leading surface reconstruction techniques on various datasets in terms of rendering quality and reconstruction accuracy.
Offers a more efficient learning process and superior rendering qualities compared to existing methods. |
Trade-off between rendering quality and surface smoothness: high-quality rendering may lead to compromised surface smoothness in cases with complex textures.
Potential limitations in handling scenes with extreme view changes or severe occlusions, as the method relies on multi-view consistency. |
3d gaussian splatting, implicit surface reconstruction, signed distance function, volumetric rendering, novel view synthesis |
2404.00384
Report |
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias |
Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim |
We identify a critical bias in contemporary CLIP-based models, which we
denote as single tag bias. This bias manifests as a disproportionate focus on a
singular tag (word) while neglecting other pertinent tags, stemming from CLIP's
text embeddings that prioritize one specific tag in image-text relationships.
When deconstructing text into individual tags, only one tag tends to have high
relevancy with CLIP's image embedding, leading to biased tag relevancy. In this
paper, we introduce a novel two-step fine-tuning approach, Text-Tag
Self-Distillation (TTD), to address this challenge. TTD first extracts
image-relevant tags from text based on their similarity to the nearest pixels
then employs a self-distillation strategy to align combined masks with the
text-derived mask. This approach ensures the unbiased image-text alignment of
the CLIP-based models using only image-text pairs without necessitating
additional supervision. Our technique demonstrates model-agnostic improvements
in multi-tag classification and segmentation tasks, surpassing competing
methods that rely on external resources. The code is available at
https://github.com/shjo-april/TTD. |
This paper identifies and addresses the "single tag bias" in CLIP-based models, where the models overly focus on a single tag in image-text relationships. |
Addressing this bias is crucial for improving the accuracy and reliability of CLIP-based models in downstream tasks like multi-tag classification and segmentation. |
The paper proposes Text-Tag Self-Distillation (TTD), a two-step fine-tuning approach: 1) selecting image-relevant tags from text based on pixel-tag similarity and 2) using these tags to guide the model towards a more holistic understanding of the image-text relationship. |
TTD effectively mitigates single tag bias, leading to improved performance in multi-tag selection compared to methods relying on external NLP models.
Fine-tuning with TTD enhances text-level segmentation performance, as demonstrated by higher CaptionIoU scores and reduced false positive/negative rates.
TTD boosts open-vocabulary semantic segmentation performance, achieving competitive results on benchmarks like Pascal VOC and COCO-Object. |
The performance difference with some methods on datasets with a large number of classes suggests potential improvements in incorporating richer tag information during fine-tuning.
Future work could investigate the underlying causes of single tag bias in CLIP's training process. |
image-text alignment, clip, self-distillation, open-vocabulary segmentation, multi-tag classification |
2404.00358
Report |
Spread Your Wings: A Radial Strip Transformer for Image Deblurring |
Duosheng Chen, Shihao Zhou, Jinshan Pan, Jinglei Shi, Lishen Qu, Jufeng Yang |
Exploring motion information is important for the motion deblurring task.
Recent the window-based transformer approaches have achieved decent performance
in image deblurring. Note that the motion causing blurry results is usually
composed of translation and rotation movements and the window-shift operation
in the Cartesian coordinate system by the window-based transformer approaches
only directly explores translation motion in orthogonal directions. Thus, these
methods have the limitation of modeling the rotation part. To alleviate this
problem, we introduce the polar coordinate-based transformer, which has the
angles and distance to explore rotation motion and translation information
together. In this paper, we propose a Radial Strip Transformer (RST), which is
a transformer-based architecture that restores the blur images in a polar
coordinate system instead of a Cartesian one. RST contains a dynamic radial
embedding module (DRE) to extract the shallow feature by a radial deformable
convolution. We design a polar mask layer to generate the offsets for the
deformable convolution, which can reshape the convolution kernel along the
radius to better capture the rotation motion information. Furthermore, we
proposed a radial strip attention solver (RSAS) as deep feature extraction,
where the relationship of windows is organized by azimuth and radius. This
attention module contains radial strip windows to reweight image features in
the polar coordinate, which preserves more useful information in rotation and
translation motion together for better recovering the sharp images.
Experimental results on six synthesis and real-world datasets prove that our
method performs favorably against other SOTA methods for the image deblurring
task. |
This paper proposes Radial Strip Transformer (RST), an efficient polar coordinate-based transformer architecture for image deblurring, addressing the limitations of Cartesian coordinate systems in modeling rotation motion blur. |
Existing window-based transformer deblurring methods struggle to effectively model rotation motion blur due to their reliance on the Cartesian coordinate system. RST overcomes this limitation by operating in the polar coordinate system, enabling it to better capture both translation and rotation motion information for improved deblurring performance. |
RST employs a dynamic radial embedding (DRE) module for extracting shallow features using a polar mask and deformable convolution. This is followed by a radial strip attention solver (RSAS) with strip windows along the radius and angular relative position encoding for deep feature extraction. The architecture follows an asymmetric encoder-decoder design, with RSAS applied only in the decoder for efficiency. |
RST outperforms state-of-the-art methods on five synthetic and real-world datasets (GoPro, HIDE, RealBlur, REDS, RSBlur), demonstrating its superior deblurring capability.
The proposed DRE and RSAS modules contribute significantly to RST's performance, highlighting their effectiveness in capturing motion information.
RST achieves a favorable balance between computational efficiency and deblurring performance, exhibiting lower or comparable complexity compared to existing methods. |
Limited cross-window interactions due to the use of radial strip windows.
Reduced deblurring capacity for heavy blur in complex real-world scenarios. |
image deblurring, transformer, motion information, polar coordinate system, deformable convolution |
2404.00345
Report |
MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text |
Takayuki Hara, Tatsuya Harada |
The generation of 3D scenes from user-specified conditions offers a promising
avenue for alleviating the production burden in 3D applications. Previous
studies required significant effort to realize the desired scene, owing to
limited control conditions. We propose a method for controlling and generating
3D scenes under multimodal conditions using partial images, layout information
represented in the top view, and text prompts. Combining these conditions to
generate a 3D scene involves the following significant difficulties: (1) the
creation of large datasets, (2) reflection on the interaction of multimodal
conditions, and (3) domain dependence of the layout conditions. We decompose
the process of 3D scene generation into 2D image generation from the given
conditions and 3D scene generation from 2D images. 2D image generation is
achieved by fine-tuning a pretrained text-to-image model with a small
artificial dataset of partial images and layouts, and 3D scene generation is
achieved by layout-conditioned depth estimation and neural radiance fields
(NeRF), thereby avoiding the creation of large datasets. The use of a common
representation of spatial information using 360-degree images allows for the
consideration of multimodal condition interactions and reduces the domain
dependence of the layout control. The experimental results qualitatively and
quantitatively demonstrated that the proposed method can generate 3D scenes in
diverse domains, from indoor to outdoor, according to multimodal conditions. |
This paper proposes MaGRITTe, a method for controlling and generating 3D scenes from partial images, layout information (floor plans or terrain maps), and text prompts. |
Generating 3D scenes from user specifications is crucial for various applications, and existing methods struggle to integrate multiple control modalities effectively. |
MaGRITTe first converts partial images and layouts into a common equirectangular projection (ERP) format. Then, a fine-tuned text-to-image diffusion model generates a 360° RGB image, leveraging these inputs and text prompts. Finally, layout-conditioned depth estimation and NeRF training produce a navigable 3D scene. |
MaGRITTe generates consistent and controllable 3D scenes reflecting input conditions.
Fine-tuning large text-to-image models with small, targeted datasets proves effective for this task.
The method handles both indoor and outdoor scenes by adapting layout representations. |
MaGRITTe may struggle to separate overlapping objects specified in the layout.
There are limitations in specifying areas where objects should not exist.
Future work includes detecting and resolving inconsistencies between input conditions. |
3d scene generation, 360-degree image generation, text-to-3d, layout-to-3d, image outpainting |
2404.00269
Report |
IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images |
Yushuang Wu, Luyue Shi, Junhao Cai, Weihao Yuan, Lingteng Qiu, Zilong Dong, Liefeng Bo, Shuguang Cui, Xiaoguang Han |
Generalizable 3D object reconstruction from single-view RGB-D images remains
a challenging task, particularly with real-world data. Current state-of-the-art
methods develop Transformer-based implicit field learning, necessitating an
intensive learning paradigm that requires dense query-supervision uniformly
sampled throughout the entire space. We propose a novel approach, IPoD, which
harmonizes implicit field learning with point diffusion. This approach treats
the query points for implicit field learning as a noisy point cloud for
iterative denoising, allowing for their dynamic adaptation to the target object
shape. Such adaptive query points harness diffusion learning's capability for
coarse shape recovery and also enhances the implicit representation's ability
to delineate finer details. Besides, an additional self-conditioning mechanism
is designed to use implicit predictions as the guidance of diffusion learning,
leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset
affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6%
in Chamfer distance over existing methods. The generalizability of IPoD is also
demonstrated on the MVImgNet dataset. Our project page is at
https://yushuang-wu.github.io/IPoD. |
Proposes IPoD, a novel method integrating implicit field learning with point diffusion for generalizable 3D object reconstruction from single RGB-D images. |
Addresses limitations of pure implicit field learning methods, which require dense query-supervision and struggle with fine details, by leveraging diffusion models for adaptive query point positioning. |
Treats query points as a noisy point cloud, iteratively denoising them using a diffusion model while concurrently predicting implicit values (UDF) to refine the shape. Employs a self-conditioning mechanism using predicted UDF values to guide the denoising process. |
Achieves 7.8% improvement in F-score and 28.6% in Chamfer distance over previous state-of-the-art methods on CO3D-v2 dataset.
Demonstrates superior reconstruction quality for both coarse shapes and fine details.
Shows generalizability to unseen object categories in CO3D-v2 and MVImgNet datasets. |
Effectiveness on 3D human and scene reconstruction not yet validated.
Future work includes exploring applications in human and scene reconstruction, addressing challenges like fine-grained details and severe occlusion. |
3d reconstruction, diffusion models, implicit field learning, single-view reconstruction, rgb-d images |
2404.00262
Report |
Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation |
Yuan Wang, Rui Sun, Naisong Luo, Yuwen Pan, Tianzhu Zhang |
Open-vocabulary semantic segmentation (OVS) aims to segment images of
arbitrary categories specified by class labels or captions. However, most
previous best-performing methods, whether pixel grouping methods or region
recognition methods, suffer from false matches between image features and
category labels. We attribute this to the natural gap between the textual
features and visual features. In this work, we rethink how to mitigate false
matches from the perspective of image-to-image matching and propose a novel
relation-aware intra-modal matching (RIM) framework for OVS based on visual
foundation models. RIM achieves robust region classification by firstly
constructing diverse image-modal reference features and then matching them with
region features based on relation-aware ranking distribution. The proposed RIM
enjoys several merits. First, the intra-modal reference features are better
aligned, circumventing potential ambiguities that may arise in cross-modal
matching. Second, the ranking-based matching process harnesses the structure
information implicit in the inter-class relationships, making it more robust
than comparing individually. Extensive experiments on three benchmarks
demonstrate that RIM outperforms previous state-of-the-art methods by large
margins, obtaining a lead of more than 10% in mIoU on PASCAL VOC benchmark. |
This paper proposes RIM, a training-free open-vocabulary semantic segmentation framework that leverages the intra-modal matching between image features, outperforming previous state-of-the-art methods. |
Existing open-vocabulary segmentation methods struggle with false matches between image and category features due to the inherent gap between visual and textual representations. |
RIM utilizes Stable Diffusion and Segment Anything Model (SAM) to construct image-based category reference features. It then performs relation-aware matching based on ranking distribution in the DINOv2 feature space. |
RIM achieves a significant performance improvement over existing zero-shot OVS methods, particularly a 20.4% mIoU gain over SimSeg on COCO Object.
The study validates the effectiveness of intra-modal matching over traditional cross-modal approaches for region classification.
The proposed relation-aware matching strategy, incorporating inter-class relationships, further enhances segmentation accuracy by reducing misclassifications. |
The reliance on multiple foundation models introduces computational complexity.
Future work could explore incorporating temporal information for video segmentation. |
open-vocabulary semantic segmentation, intra-modal matching, visual foundation models, stable diffusion, segment anything model |
2404.00234
Report |
Grid Diffusion Models for Text-to-Video Generation |
Taegyeong Lee, Soyeong Kwon, Taehwan Kim |
Recent advances in the diffusion models have significantly improved
text-to-image generation. However, generating videos from text is a more
challenging task than generating images from text, due to the much larger
dataset and higher computational cost required. Most existing video generation
methods use either a 3D U-Net architecture that considers the temporal
dimension or autoregressive generation. These methods require large datasets
and are limited in terms of computational costs compared to text-to-image
generation. To tackle these challenges, we propose a simple but effective novel
grid diffusion for text-to-video generation without temporal dimension in
architecture and a large text-video paired dataset. We can generate a
high-quality video using a fixed amount of GPU memory regardless of the number
of frames by representing the video as a grid image. Additionally, since our
method reduces the dimensions of the video to the dimensions of the image,
various image-based methods can be applied to videos, such as text-guided video
manipulation from image manipulation. Our proposed method outperforms the
existing methods in both quantitative and qualitative evaluations,
demonstrating the suitability of our model for real-world video generation. |
This paper introduces a novel grid diffusion model for text-to-video generation, which represents videos as grid images to reduce computational cost and reliance on large text-video paired datasets. |
Generating videos from text is computationally expensive and often requires large, paired datasets, which this method aims to address. |
The method uses two stages: (1) key grid image generation by fine-tuning a pre-trained text-to-image diffusion model on a small dataset of grid images representing key video frames; (2) autoregressive grid image interpolation to generate intermediate frames while maintaining temporal consistency. |
The model outperforms existing text-to-video generation models on standard benchmarks (MSR-VTT, UCF-101) in terms of CLIP similarity, FVD, and Inception Score, even with less training data.
It generates higher-quality videos with better text alignment according to human evaluation.
The approach maintains a fixed GPU memory footprint regardless of the number of frames generated, showcasing its efficiency. |
The model's reliance on a pre-trained text-to-image model might limit its ability to generate novel or highly complex visual content.
Future work could explore applying this method to other generative tasks involving different modalities, such as sound. |
text-to-video generation, diffusion models, grid images, temporal consistency, computational efficiency |
2404.00230
Report |
Latent Watermark: Inject and Detect Watermarks in Latent Diffusion Space |
Zheling Meng, Bo Peng, Jing Dong |
Watermarking is a tool for actively identifying and attributing the images
generated by latent diffusion models. Existing methods face the dilemma of
watermark robustness and image quality. The reason for this dilemma is that
watermark detection is performed in pixel space, implying an intrinsic link
between image quality and watermark robustness. In this paper, we highlight
that an effective solution to the problem is to both inject and detect
watermarks in latent space, and propose Latent Watermark (LW) with a
progressive training strategy. Experiments show that compared to the recently
proposed methods such as StegaStamp, StableSignature, RoSteALS and TreeRing, LW
not only surpasses them in terms of robustness but also offers superior image
quality. When we inject 64-bit messages, LW can achieve an identification
performance close to 100% and an attribution performance above 97% under 9
single-attack scenarios and one all-attack scenario. Our code will be available
on GitHub. |
This paper proposes Latent Watermark (LW), a method for watermarking images generated by latent diffusion models, that injects and detects watermarks directly in the latent space. |
Addressing the critical need for identifying and attributing images generated by AI models, especially given the potential for misuse like spreading misinformation. |
LW uses a message encoder/decoder, coupler, and decoupler, all trained with a three-step progressive strategy. This strategy ensures minimal impact on image quality while enabling robust watermark embedding. |
LW demonstrates superior image quality compared to existing methods, showing minimal differences from non-watermarked images across metrics like FID, SSIM, NIQE, and PIQE.
It exhibits significantly stronger robustness against various attacks, including destructive, constructive, and reconstructive attacks, achieving high Bit Accuracy and TPR@0.01FPR.
The method is environmentally friendly, with a training process that results in significantly lower CO2 emissions compared to training the generative model itself. |
The current work focuses on image watermarking; further investigation is needed to extend its applicability to other generative frameworks like GANs.
Exploring different latent space manipulation techniques within LW could lead to even more robust and imperceptible watermarking. |
latent diffusion model, watermarking, image attribution, information security, aigc |
2403.20312
Report |
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations |
Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna Bharati |
Existing vision-language models (VLMs) treat text descriptions as a unit,
confusing individual concepts in a prompt and impairing visual semantic
matching and reasoning. An important aspect of reasoning in logic and language
is negations. This paper highlights the limitations of popular VLMs such as
CLIP, at understanding the implications of negations, i.e., the effect of the
word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts
with negations, we present CC-Neg, a dataset containing 228,246 images, true
captions and their corresponding negated captions. Using CC-Neg along with
modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework,
has an improved understanding of negations. This training paradigm improves
CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average
gain in top-1 accuracy for zero-shot image classification across 8 datasets.
Further, CoN-CLIP outperforms CLIP on challenging compositionality benchmarks
such as SugarCREPE by 4.4%, showcasing emergent compositional understanding of
objects, relations, and attributes in text. Overall, our work addresses a
crucial limitation of VLMs by introducing a dataset and framework that
strengthens semantic associations between images and text, demonstrating
improved large-scale foundation models with significantly reduced computational
cost, promoting efficiency and accessibility. |
This paper exposes the weakness of current vision-language models (VLMs) in understanding negations in text descriptions, which limits their ability for accurate image-text matching and reasoning. To address this, the authors introduce a new dataset, CC-Neg, and a novel training framework, CoN-CLIP. |
Understanding negations is crucial for VLMs as it enables finer-grained control over semantic matching, leading to improvements in various tasks like image-text retrieval, text-to-image generation, and zero-shot image classification. |
The authors create CC-Neg, a large-scale dataset with image-caption pairs and their corresponding negated captions. They then propose CoN-CLIP, which fine-tunes CLIP's text encoder using a modified contrastive loss incorporating negated captions and distractor images. |
CoN-CLIP significantly outperforms existing VLMs on CC-Neg, demonstrating a strong grasp of negation in textual descriptions.
CoN-CLIP exhibits enhanced zero-shot image classification accuracy across 8 different datasets, indicating improved semantic understanding.
CoN-CLIP shows superior performance on the SugarCREPE benchmark, demonstrating emergent compositional understanding of objects, attributes, and relations. |
The generation of negated captions relies heavily on the capabilities and potential biases of the chosen large language model.
Future work can investigate the generalization of CoN-CLIP to more nuanced forms of negation and explore its application in other multimodal domains. |
vision-language models, compositionality, multimodal learning, contrastive learning, negation understanding |
2403.20309
Report |
InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds |
Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, Yue Wang |
While novel view synthesis (NVS) has made substantial progress in 3D computer
vision, it typically requires an initial estimation of camera intrinsics and
extrinsics from dense viewpoints. This pre-processing is usually conducted via
a Structure-from-Motion (SfM) pipeline, a procedure that can be slow and
unreliable, particularly in sparse-view scenarios with insufficient matched
features for accurate reconstruction. In this work, we integrate the strengths
of point-based representations (e.g., 3D Gaussian Splatting, 3D-GS) with
end-to-end dense stereo models (DUSt3R) to tackle the complex yet unresolved
issues in NVS under unconstrained settings, which encompasses pose-free and
sparse view challenges. Our framework, InstantSplat, unifies dense stereo
priors with 3D-GS to build 3D Gaussians of large-scale scenes from sparseview &
pose-free images in less than 1 minute. Specifically, InstantSplat comprises a
Coarse Geometric Initialization (CGI) module that swiftly establishes a
preliminary scene structure and camera parameters across all training views,
utilizing globally-aligned 3D point maps derived from a pre-trained dense
stereo pipeline. This is followed by the Fast 3D-Gaussian Optimization (F-3DGO)
module, which jointly optimizes the 3D Gaussian attributes and the initialized
poses with pose regularization. Experiments conducted on the large-scale
outdoor Tanks & Temples datasets demonstrate that InstantSplat significantly
improves SSIM (by 32%) while concurrently reducing Absolute Trajectory Error
(ATE) by 80%. These establish InstantSplat as a viable solution for scenarios
involving posefree and sparse-view conditions. Project page:
instantsplat.github.io. |
Introduced InstantSplat, an efficient framework for simultaneous pose estimation and novel view synthesis from sparse, unposed images, utilizing 3D priors from a dense stereo model. |
Addresses the limitations of traditional NVS methods that require pre-computed camera parameters and dense views, enabling casual capture scenarios. |
Employs a two-stage approach: 1) Coarse Geometric Initialization using DUSt3R for preliminary scene structure and camera parameters. 2) Fast 3D-Gaussian Optimization to refine scene attributes and camera extrinsics. |
Achieves high rendering quality, outperforming baselines in SSIM and LPIPS on Tanks & Temples and MVImgNet datasets.
Demonstrates accurate pose estimation, with lower ATE and RPE compared to pose-free methods.
Significantly faster than existing techniques, reconstructing scenes in under a minute. |
Assumes a single-camera setup, limiting its applicability to multi-view stereo scenarios.
Relies on the accuracy of the pre-trained dense stereo model, which can impact overall performance. Future work can explore online refinement of both the 3D prior and Gaussian attributes. |
novel view synthesis, pose estimation, 3d gaussian splatting, dense stereo, sparse view |
2403.20275
Report |
Snap-it, Tap-it, Splat-it: Tactile-Informed 3D Gaussian Splatting for Reconstructing Challenging Surfaces |
Mauro Comi, Alessio Tonioni, Max Yang, Jonathan Tremblay, Valts Blukis, Yijiong Lin, Nathan F. Lepora, Laurence Aitchison |
Touch and vision go hand in hand, mutually enhancing our ability to
understand the world. From a research perspective, the problem of mixing touch
and vision is underexplored and presents interesting challenges. To this end,
we propose Tactile-Informed 3DGS, a novel approach that incorporates touch data
(local depth maps) with multi-view vision data to achieve surface
reconstruction and novel view synthesis. Our method optimises 3D Gaussian
primitives to accurately model the object's geometry at points of contact. By
creating a framework that decreases the transmittance at touch locations, we
achieve a refined surface reconstruction, ensuring a uniformly smooth depth
map. Touch is particularly useful when considering non-Lambertian objects (e.g.
shiny or reflective surfaces) since contemporary methods tend to fail to
reconstruct with fidelity specular highlights. By combining vision and tactile
sensing, we achieve more accurate geometry reconstructions with fewer images
than prior methods. We conduct evaluation on objects with glossy and reflective
surfaces and demonstrate the effectiveness of our approach, offering
significant improvements in reconstruction quality. |
Introduces Tactile-Informed 3DGS, a novel approach that integrates tactile sensing (local depth maps) with multi-view RGB data for enhanced 3D object reconstruction and novel view synthesis, particularly effective for challenging surfaces like glossy and reflective objects. |
Addresses limitations of vision-only methods that struggle with non-Lambertian surfaces and limited viewpoints, leveraging tactile sensing's robustness to lighting variations and sparse yet accurate geometric information. |
Optimizes 3D Gaussian primitives within a 3D Gaussian Splatting framework, guided by: (1) Photometric loss from multi-view images, (2) 3D transmittance loss minimized at touch locations, (3) Unsupervised edge-aware smoothness loss with proximity-based masking to refine reconstruction beyond contact areas. |
Achieves state-of-the-art geometry reconstruction on glossy/reflective surfaces, outperforming NeRF-based methods in speed (1 hour vs. 25 hours).
Significantly improves reconstruction quality and novel view synthesis with minimal views (5 views) compared to 3DGS and NeRO.
Demonstrates consistent improvement with increasing touch interactions, validating the effectiveness of tactile data integration. |
Current random touch sampling could be improved with an adaptive strategy to complement visual data more effectively.
Future work could explore the application of multimodal interaction for reconstructing transparent objects and integrating surface modeling techniques. |
3d reconstruction, novel view synthesis, tactile sensing, 3d gaussian splatting, non-lambertian surfaces |
2403.20271
Report |
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want |
Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li |
The interaction between humans and artificial intelligence (AI) is a crucial
factor that reflects the effectiveness of multimodal large language models
(MLLMs). However, current MLLMs primarily focus on image-level comprehension
and limit interaction to textual instructions, thereby constraining their
flexibility in usage and depth of response. In this paper, we introduce the
Draw-and-Understand project: a new model, a multi-domain dataset, and a
challenging benchmark for visual prompting. Specifically, we propose SPHINX-V,
a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a
vision encoder, a visual prompt encoder and an LLM for various visual prompts
(points, bounding boxes, and free-form shape) and language understanding. To
advance visual prompting research for MLLMs, we introduce MDVP-Data and
MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique
image-visual prompt-text instruction-following samples, including natural
images, document images, OCR images, mobile screenshots, web screenshots, and
multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and
challenging benchmark to assess a model's capability in understanding visual
prompting instructions. Our experiments demonstrate SPHINX-V's impressive
multimodal interaction capabilities through visual prompting, revealing
significant improvements in detailed pixel-level description and
question-answering abilities. |
This paper introduces SPHINX-V, a novel multimodal large language model (MLLM) designed for enhanced pixel-level image understanding through visual prompting, supporting various prompt types like points, boxes, and free-form shapes. |
Current MLLMs primarily focus on comprehending entire images, limiting their ability to address user queries about specific regions or details within an image. SPHINX-V aims to overcome this limitation and enable more precise, pixel-level understanding. |
SPHINX-V uses a visual prompt encoder and a two-stage training strategy: 1) pre-training for image-visual prompt-text alignment and 2) supervised fine-tuning on a multi-domain dataset (MDVP-Data) with instructions for various tasks like captioning, relationship analysis, and reasoning. |
SPHINX-V demonstrates state-of-the-art performance on referring object classification tasks, surpassing previous methods on LVIS and PACO datasets.
It excels in regional optical character recognition (OCR), significantly outperforming baseline models on the COCO-Text dataset.
SPHINX-V achieves high scores on region-level captioning tasks, as well as comprehensive assessments using LLaVA-Bench, Ferret-Bench, and the proposed MDVP-Bench. |
The model's performance on image-level understanding tasks could be further improved by incorporating more open-source image-level VQA data during training.
Future work could focus on enhancing the visual prompt encoder to better distinguish and model different types of visual prompts. |
multimodal large language model, visual prompting, pixel-level understanding, region-level captioning, optical character recognition |
2403.20249
Report |
Relation Rectification in Diffusion Model |
Yinwei Wu, Xingyi Yang, Xinchao Wang |
Despite their exceptional generative abilities, large text-to-image diffusion
models, much like skilled but careless artists, often struggle with accurately
depicting visual relationships between objects. This issue, as we uncover
through careful analysis, arises from a misaligned text encoder that struggles
to interpret specific relationships and differentiate the logical order of
associated objects. To resolve this, we introduce a novel task termed Relation
Rectification, aiming to refine the model to accurately represent a given
relationship it initially fails to generate. To address this, we propose an
innovative solution utilizing a Heterogeneous Graph Convolutional Network
(HGCN). It models the directional relationships between relation terms and
corresponding objects within the input prompts. Specifically, we optimize the
HGCN on a pair of prompts with identical relational words but reversed object
orders, supplemented by a few reference images. The lightweight HGCN adjusts
the text embeddings generated by the text encoder, ensuring the accurate
reflection of the textual relation in the embedding space. Crucially, our
method retains the parameters of the text encoder and diffusion model,
preserving the model's robust performance on unrelated descriptions. We
validated our approach on a newly curated dataset of diverse relational data,
demonstrating both quantitative and qualitative enhancements in generating
images with precise visual relations. Project page:
https://wuyinwei-hah.github.io/rrnet.github.io/. |
Introduces Relation Rectification, a novel task to improve the accuracy of directional relationships depicted in images generated by T2I diffusion models, and proposes RRNet, a HGCN-based framework to address it. |
Large T2I diffusion models often struggle to accurately depict visual relationships between objects due to limitations in interpreting directional or relational terms in text prompts, treating them as 'Bags-of words'. |
RRNet models object-swapped prompts (OSPs) as heterogeneous graphs to capture directional relationships. It leverages HGCN to generate adjustment vectors that refine the text embeddings, particularly the [EOT] token embedding, to guide the diffusion model towards generating images with correct relationship directions. The model is trained using a combination of positive (denoising) and negative losses to ensure accurate relationship representation and disentanglement of object features from relationships. |
RRNet significantly improves the accuracy of relationship generation in SD by up to 25%, as evidenced by evaluation using vision-language chatbots.
The approach enhances the interpretability of generated images, allowing for clear depiction of directional transitions in relationships.
RRNet demonstrates robust generalization capabilities, effectively handling even objects unseen during training. |
RRNet's performance is limited by the diffusion model's pre-existing knowledge, struggling with relationships involving unseen concepts.
Extending RRNet to handle more complex, multi-relational scenarios requires further investigation, particularly in managing multiple adjustment vectors without introducing semantic confusion. |
text-to-image synthesis, diffusion models, relation rectification, heterogeneous graph convolutional network, vision-language models |
2403.20236
Report |
Long-Tailed Anomaly Detection with Learnable Class Names |
Chih-Hui Ho, Kuan-Chuan Peng, Nuno Vasconcelos |
Anomaly detection (AD) aims to identify defective images and localize their
defects (if any). Ideally, AD models should be able to detect defects over many
image classes; without relying on hard-coded class names that can be
uninformative or inconsistent across datasets; learn without anomaly
supervision; and be robust to the long-tailed distributions of real-world
applications. To address these challenges, we formulate the problem of
long-tailed AD by introducing several datasets with different levels of class
imbalance and metrics for performance evaluation. We then propose a novel
method, LTAD, to detect defects from multiple and long-tailed classes, without
relying on dataset class names. LTAD combines AD by reconstruction and semantic
AD modules. AD by reconstruction is implemented with a transformer-based
reconstruction module. Semantic AD is implemented with a binary classifier,
which relies on learned pseudo class names and a pretrained foundation model.
These modules are learned over two phases. Phase 1 learns the pseudo-class
names and a variational autoencoder (VAE) for feature synthesis that augments
the training data to combat long-tails. Phase 2 then learns the parameters of
the reconstruction and classification modules of LTAD. Extensive experiments
using the proposed long-tailed datasets show that LTAD substantially
outperforms the state-of-the-art methods for most forms of dataset imbalance.
The long-tailed dataset split is available at
https://zenodo.org/records/10854201 . |
This paper introduces the task of long-tailed anomaly detection (LTAD) where training datasets exhibit class imbalance. |
Prior anomaly detection methods, designed for balanced datasets, struggle in real-world scenarios with skewed class distributions common in manufacturing. |
The paper proposes LTAD, a new method combining reconstruction-based anomaly detection with semantic anomaly detection. It uses a data augmentation strategy based on a class-sensitive VAE and learns pseudo class names to overcome ambiguity of real class names. |
LTAD consistently outperforms state-of-the-art anomaly detection methods on long-tailed versions of MVTec, VisA, and DAGM datasets.
Both reconstruction and semantic anomaly detection modules contribute to LTAD's superior performance.
Learned pseudo class names prove more effective than real class names, highlighting the ability to handle class ambiguity. |
The paper relies on a single pretrained foundational model (ALIGN) and doesn't explore the effect of other models.
Future work includes investigating alternative data augmentation strategies beyond VAE. |
anomaly detection, long-tailed learning, data augmentation, computer vision, semantic anomaly detection |
2403.20231
Report |
U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation |
You Wu, Kean Liu, Xiaoyue Mi, Fan Tang, Juan Cao, Jintao Li |
Concept personalization methods enable large text-to-image models to learn
specific subjects (e.g., objects/poses/3D models) and synthesize renditions in
new contexts. Given that the image references are highly biased towards visual
attributes, state-of-the-art personalization models tend to overfit the whole
subject and cannot disentangle visual characteristics in pixel space. In this
study, we proposed a more challenging setting, namely fine-grained visual
appearance personalization. Different from existing methods, we allow users to
provide a sentence describing the desired attributes. A novel decoupled
self-augmentation strategy is proposed to generate target-related and
non-target samples to learn user-specified visual attributes. These augmented
data allow for refining the model's understanding of the target attribute while
mitigating the impact of unrelated attributes. At the inference stage,
adjustments are conducted on semantic space through the learned target and
non-target embeddings to further enhance the disentanglement of target
attributes. Extensive experiments on various kinds of visual attributes with
SOTA personalization methods show the ability of the proposed method to mimic
target visual appearance in novel contexts, thus improving the controllability
and flexibility of personalization. |
This paper introduces U-VAP, a novel method for user-specified visual appearance personalization in text-to-image generation that allows control over fine-grained attributes (e.g., color, pattern, structure) from reference images. |
Existing personalization methods struggle to disentangle fine-grained visual attributes within a concept, limiting controllability in combining specific appearances with new concepts. |
U-VAP employs a decoupled self-augmentation strategy. After an initial personalization, it uses an LLM to generate target- and non-target-specific text prompts. These prompts generate augmented image sets, further fine-tuning the model to learn and disentangle the desired attributes. Semantic adjustment during inference enhances disentanglement. |
U-VAP enables controlled and accurate personalization of specific visual attributes, as demonstrated through quantitative and qualitative comparisons with state-of-the-art methods.
The method exhibits flexibility in applying learned attributes to various novel concepts.
User studies confirm U-VAP's superiority in generating personalized images with high fidelity to both the specified attribute and the new concept. |
U-VAP's performance depends on the capability of the base personalization method used in pre-learning, potentially limiting disentanglement effectiveness.
Strong prior information associated with certain words in the inference prompt might sometimes overshadow the learned target attributes. |
text-to-image generation, personalization, attribute disentanglement, diffusion models, self-augmentation |
2403.20193
Report |
Motion Inversion for Video Customization |
Luozhou Wang, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Yingcong Chen |
In this research, we present a novel approach to motion customization in
video generation, addressing the widespread gap in the thorough exploration of
motion representation within video generative models. Recognizing the unique
challenges posed by video's spatiotemporal nature, our method introduces Motion
Embeddings, a set of explicit, temporally coherent one-dimensional embeddings
derived from a given video. These embeddings are designed to integrate
seamlessly with the temporal transformer modules of video diffusion models,
modulating self-attention computations across frames without compromising
spatial integrity. Our approach offers a compact and efficient solution to
motion representation and enables complex manipulations of motion
characteristics through vector arithmetic in the embedding space. Furthermore,
we identify the Temporal Discrepancy in video generative models, which refers
to variations in how different motion modules process temporal relationships
between frames. We leverage this understanding to optimize the integration of
our motion embeddings. Our contributions include the introduction of a tailored
motion embedding for customization tasks, insights into the temporal processing
differences in video models, and a demonstration of the practical advantages
and effectiveness of our method through extensive experiments. |
This work introduces motion embeddings for video diffusion models, enabling the isolation and manipulation of motion from a source video, facilitating motion transfer to different text-guided generations. |
Directly manipulating motion in text-guided video generation is challenging. This work offers a way to isolate and transfer motion, enhancing control and creative possibilities in video generation. |
The authors integrate motion embeddings into a video diffusion model's UNet architecture. They explore different training objectives and noise initialization strategies to optimize motion transfer for various scenarios, including camera, object, and hybrid motion. |
Motion embeddings successfully isolate motion from source videos, allowing for transfer to novel text-guided generations.
Different training objectives prove beneficial for specific motion types. For instance, appearance-debiased temporal loss excels in camera motion transfer.
The method allows for flexible motion manipulation, including using partial motion embeddings and interpolating across frames for longer sequences. |
The effectiveness of motion transfer can vary depending on the complexity of the motion and the quality of the source video.
The work primarily focuses on motion representation and transfer, with potential for future exploration in combining it with advanced appearance editing techniques. |
video generation, motion transfer, diffusion models, motion embeddings, text-guided synthesis |
2403.20159
Report |
HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes |
Ke Wu, Kaizhao Zhang, Zhiwei Zhang, Shanshuai Yuan, Muer Tie, Julong Wei, Zijun Xu, Jieru Zhao, Zhongxue Gan, Wenchao Ding |
Online dense mapping of urban scenes forms a fundamental cornerstone for
scene understanding and navigation of autonomous vehicles. Recent advancements
in mapping methods are mainly based on NeRF, whose rendering speed is too slow
to meet online requirements. 3D Gaussian Splatting (3DGS), with its rendering
speed hundreds of times faster than NeRF, holds greater potential in online
dense mapping. However, integrating 3DGS into a street-view dense mapping
framework still faces two challenges, including incomplete reconstruction due
to the absence of geometric information beyond the LiDAR coverage area and
extensive computation for reconstruction in large urban scenes. To this end, we
propose HGS-Mapping, an online dense mapping framework in unbounded large-scale
scenes. To attain complete construction, our framework introduces Hybrid
Gaussian Representation, which models different parts of the entire scene using
Gaussians with distinct properties. Furthermore, we employ a hybrid Gaussian
initialization mechanism and an adaptive update method to achieve high-fidelity
and rapid reconstruction. To the best of our knowledge, we are the first to
integrate Gaussian representation into online dense mapping of urban scenes.
Our approach achieves SOTA reconstruction accuracy while only employing 66%
number of Gaussians, leading to 20% faster reconstruction speed. |
This paper proposes HGS-Mapping, the first online dense mapping framework for urban scenes using a novel 3D Gaussian Splatting-based representation. |
Current NeRF-based mapping methods lack rendering speed for online applications, while existing 3DGS methods struggle with complete reconstruction and computational efficiency in large-scale urban environments. |
The HGS-Mapping framework leverages a Hybrid Gaussian Representation (Sphere Gaussian for sky, 2D Gaussian Plane for roads, and 3D Gaussian for scenery). It employs a hybrid Gaussian initialization mechanism (combining LiDAR and feature matching) and an adaptive update method (silhouette filtering, densify control, and importance pruning) for efficient and accurate reconstruction. |
HGS-Mapping achieves state-of-the-art reconstruction accuracy in urban environments, outperforming NeRF and Gaussian-based baselines in rendering quality.
The method demonstrates significant speed improvements, achieving 20% faster reconstruction than the current SOTA online method (SplaTAM) while using only 66% of the Gaussians.
The proposed Hybrid Gaussian Representation effectively addresses sky and road modeling challenges, leading to more efficient and accurate urban scene reconstruction. |
The RANSAC-based road surface extraction can be limited in scenarios with complex road geometry.
Future work could explore extending the framework to handle arbitrary outdoor scenes and incorporating dynamic object representation. |
gaussian splatting, dense mapping, autonomous driving, 3d reconstruction, urban scenes |
2403.20153
Report |
Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior |
Jaehoon Ko, Kyusun Cho, Joungbin Lee, Heeji Yoon, Sangmin Lee, Sangjun Ahn, Seungryong Kim |
Recent methods for audio-driven talking head synthesis often optimize neural
radiance fields (NeRF) on a monocular talking portrait video, leveraging its
capability to render high-fidelity and 3D-consistent novel-view frames.
However, they often struggle to reconstruct complete face geometry due to the
absence of comprehensive 3D information in the input monocular videos. In this
paper, we introduce a novel audio-driven talking head synthesis framework,
called Talk3D, that can faithfully reconstruct its plausible facial geometries
by effectively adopting the pre-trained 3D-aware generative prior. Given the
personalized 3D generative model, we present a novel audio-guided attention
U-Net architecture that predicts the dynamic face variations in the NeRF space
driven by audio. Furthermore, our model is further modulated by audio-unrelated
conditioning tokens which effectively disentangle variations unrelated to audio
features. Compared to existing methods, our method excels in generating
realistic facial geometries even under extreme head poses. We also conduct
extensive experiments showing our approach surpasses state-of-the-art
benchmarks in terms of both quantitative and qualitative evaluations. |
Talk3D, a novel framework for high-fidelity 3D talking head synthesis, leverages a 3D-aware GAN prior and region-aware motion prediction. |
Existing audio-driven talking head synthesis methods struggle to reconstruct complete face geometry and lack multi-view consistency, particularly from unseen viewpoints. |
Talk3D uses a personalized 3D generator fine-tuned with VIVE3D and an audio-guided attention U-Net architecture to predict triplane offsets (deltaplanes) that capture audio-driven facial dynamics. |
Talk3D achieves state-of-the-art results in quantitative and qualitative evaluations, outperforming previous methods in terms of image fidelity, lip synchronization accuracy, and robustness to novel viewpoints.
The method successfully disentangles local variations like eye blinks, torso movements, and background motion, ensuring accurate lip-sync and realistic facial animations.
Talk3D allows for facial attribute manipulation (e.g., age, hair length) by leveraging the latent space of the 3D-aware GAN. |
Talk3D, relying on GAN inversion, currently exhibits limited generalizability beyond photorealistic human faces.
The reliance on GAN inversion introduces data preparation complexities, requiring precise frame alignment and cropping. |
talking head synthesis, neural radiance fields (nerf), 3d-aware gans, audio-driven animation, deep learning |
2403.20126
Report |
ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning |
Beomyoung Kim, Joonsang Yu, Sung Ju Hwang |
Panoptic segmentation, combining semantic and instance segmentation, stands
as a cutting-edge computer vision task. Despite recent progress with deep
learning models, the dynamic nature of real-world applications necessitates
continual learning, where models adapt to new classes (plasticity) over time
without forgetting old ones (catastrophic forgetting). Current continual
segmentation methods often rely on distillation strategies like knowledge
distillation and pseudo-labeling, which are effective but result in increased
training complexity and computational overhead. In this paper, we introduce a
novel and efficient method for continual panoptic segmentation based on Visual
Prompt Tuning, dubbed ECLIPSE. Our approach involves freezing the base model
parameters and fine-tuning only a small set of prompt embeddings, addressing
both catastrophic forgetting and plasticity and significantly reducing the
trainable parameters. To mitigate inherent challenges such as error propagation
and semantic drift in continual segmentation, we propose logit manipulation to
effectively leverage common knowledge across the classes. Experiments on ADE20K
continual panoptic segmentation benchmark demonstrate the superiority of
ECLIPSE, notably its robustness against catastrophic forgetting and its
reasonable plasticity, achieving a new state-of-the-art. The code is available
at https://github.com/clovaai/ECLIPSE. |
This paper introduces ECLIPSE, a novel, efficient method for continual panoptic segmentation based on Visual Prompt Tuning. It freezes base model parameters and fine-tunes only prompt embeddings to learn new classes, mitigating catastrophic forgetting while enhancing plasticity. |
Continual learning in panoptic segmentation is crucial for real-world applications that require adapting to new classes over time without forgetting old ones. Existing methods rely on distillation strategies, leading to increased complexity and overhead. |
ECLIPSE freezes the base model and introduces new prompt embeddings for each set of new classes. It leverages logit manipulation, a novel strategy that leverages inter-class knowledge to address error propagation and semantic drift. |
ECLIPSE achieves state-of-the-art results on ADE20K continual panoptic segmentation benchmark with only 1.3% of trainable parameters.
It demonstrates superior robustness against catastrophic forgetting, especially as the number of continual steps increases.
The method also effectively learns new classes, even with limited base knowledge. |
The computational complexity increases with expanding prompt sets as the number of classes grows.
Future work may explore optimizing the computational complexity for scenarios with a massive number of classes. |
continual learning, panoptic segmentation, visual prompt tuning, logit manipulation, catastrophic forgetting |
2403.20105
Report |
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models |
Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, Matthieu Cord |
Foundation models have exhibited unprecedented capabilities in tackling many
domains and tasks. Models such as CLIP are currently widely used to bridge
cross-modal representations, and text-to-image diffusion models are arguably
the leading models in terms of realistic image generation. Image generative
models are trained on massive datasets that provide them with powerful internal
spatial representations. In this work, we explore the potential benefits of
such representations, beyond image generation, in particular, for dense visual
prediction tasks. We focus on the task of image segmentation, which is
traditionally solved by training models on closed-vocabulary datasets, with
pixel-level annotations. To avoid the annotation cost or training large
diffusion models, we constraint our setup to be zero-shot and training-free. In
a nutshell, our pipeline leverages different and relatively small-sized,
open-source foundation models for zero-shot open-vocabulary segmentation. The
pipeline is as follows: the image is passed to both a captioner model (i.e.
BLIP) and a diffusion model (i.e., Stable Diffusion Model) to generate a text
description and visual representation, respectively. The features are clustered
and binarized to obtain class agnostic masks for each object. These masks are
then mapped to a textual class, using the CLIP model to support
open-vocabulary. Finally, we add a refinement step that allows to obtain a more
precise segmentation mask. Our approach (dubbed FreeSeg-Diff), which does not
rely on any training, outperforms many training-based approaches on both Pascal
VOC and COCO datasets. In addition, we show very competitive results compared
to the recent weakly-supervised segmentation approaches. We provide
comprehensive experiments showing the superiority of diffusion model features
compared to other pretrained models. Project page:
https://bcorrad.github.io/freesegdiff/ |
This paper introduces FreeSeg-Diff, a zero-shot, training-free approach for open-vocabulary image segmentation leveraging pre-trained diffusion models. |
This approach eliminates the need for expensive pixel-level annotations and the training of large diffusion models, potentially making image segmentation more accessible and scalable. |
The method uses a pre-trained diffusion model to extract image features, clusters these features to generate class-agnostic masks, and then employs CLIP to map these masks to textual classes extracted from image captions. |
FreeSeg-Diff outperforms several training-based and weakly supervised approaches on Pascal VOC and COCO datasets.
The study highlights the superior semantic localization capabilities of diffusion models compared to other pre-trained models like CLIP, DINOv2, and ViT.
The approach demonstrates competitive performance against recent state-of-the-art weakly supervised segmentation methods. |
The performance of FreeSeg-Diff still lags behind state-of-the-art supervised segmentation approaches.
The reliance on multiple models, including a large diffusion model, introduces a slight computational overhead compared to traditional segmentation models. |
image segmentation, diffusion models, zero-shot learning, open-vocabulary segmentation, weakly supervised learning |
2403.20079
Report |
SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior |
Zhongrui Yu, Haoran Wang, Jinze Yang, Hanzhang Wang, Zeke Xie, Yunfeng Cai, Jiale Cao, Zhong Ji, Mingming Sun |
Novel View Synthesis (NVS) for street scenes play a critical role in the
autonomous driving simulation. The current mainstream technique to achieve it
is neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian
Splatting (3DGS). Although thrilling progress has been made, when handling
street scenes, current methods struggle to maintain rendering quality at the
viewpoint that deviates significantly from the training viewpoints. This issue
stems from the sparse training views captured by a fixed camera on a moving
vehicle. To tackle this problem, we propose a novel approach that enhances the
capacity of 3DGS by leveraging prior from a Diffusion Model along with
complementary multi-modal data. Specifically, we first fine-tune a Diffusion
Model by adding images from adjacent frames as condition, meanwhile exploiting
depth data from LiDAR point clouds to supply additional spatial information.
Then we apply the Diffusion Model to regularize the 3DGS at unseen views during
training. Experimental results validate the effectiveness of our method
compared with current state-of-the-art models, and demonstrate its advance in
rendering images from broader views. |
This paper proposes a novel method, SGD, that leverages a fine-tuned Diffusion Model to enhance the free-view rendering capabilities of 3D Gaussian Splatting for street view synthesis. |
Current neural rendering methods for street view synthesis struggle to maintain quality at viewpoints far from training views due to the limited perspective of vehicle-captured data. This limits their use in autonomous driving simulations which require high-quality rendering from diverse perspectives. |
The method fine-tunes a Stable Diffusion Model on driving scenes using adjacent frames as context and LiDAR data for spatial guidance. This fine-tuned model then regularizes the 3DGS training by providing priors for unseen views. |
SGD outperforms state-of-the-art methods in sparse-view settings on KITTI and KITTI-360 datasets.
The method significantly improves rendering quality at novel viewpoints distant from training views.
SGD preserves the real-time inference speed of 3DGS, making it suitable for driving simulations. |
The integration of the Diffusion Model increases training time due to the denoising process.
Future work includes exploring more efficient training strategies. |
novel view synthesis, 3d gaussian splatting, diffusion models, autonomous driving simulation, sparse-view reconstruction |
2403.20034
Report |
NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising |
Tianchen Deng, Yanbo Wang, Hongle Xie, Hesheng Wang, Jingchuan Wang, Danwei Wang, Weidong Chen |
In recent years, there have been significant advancements in 3D
reconstruction and dense RGB-D SLAM systems. One notable development is the
application of Neural Radiance Fields (NeRF) in these systems, which utilizes
implicit neural representation to encode 3D scenes. This extension of NeRF to
SLAM has shown promising results. However, the depth images obtained from
consumer-grade RGB-D sensors are often sparse and noisy, which poses
significant challenges for 3D reconstruction and affects the accuracy of the
representation of the scene geometry. Moreover, the original hierarchical
feature grid with occupancy value is inaccurate for scene geometry
representation. Furthermore, the existing methods select random pixels for
camera tracking, which leads to inaccurate localization and is not robust in
real-world indoor environments. To this end, we present NeSLAM, an advanced
framework that achieves accurate and dense depth estimation, robust camera
tracking, and realistic synthesis of novel views. First, a depth completion and
denoising network is designed to provide dense geometry prior and guide the
neural implicit representation optimization. Second, the occupancy scene
representation is replaced with Signed Distance Field (SDF) hierarchical scene
representation for high-quality reconstruction and view synthesis. Furthermore,
we also propose a NeRF-based self-supervised feature tracking algorithm for
robust real-time tracking. Experiments on various indoor datasets demonstrate
the effectiveness and accuracy of the system in reconstruction, tracking
quality, and novel view synthesis. |
NeSLAM, a dense RGB-D SLAM system for accurate and robust 3D reconstruction and novel view synthesis using neural implicit mapping and self-supervised feature tracking. |
Existing dense SLAM systems struggle with sparse, noisy depth images from consumer-grade sensors and inaccurate camera tracking in complex indoor environments. This work aims to address these limitations. |
The system features a depth completion and denoising network for improved geometry prior, utilizes Signed Distance Field (SDF) for enhanced scene representation, and incorporates a NeRF-based self-supervised feature tracking algorithm for robust pose estimation. |
Achieves more accurate and complete 3D reconstructions compared to existing NeRF-based SLAM methods like iMAP and NICE-SLAM.
Demonstrates superior camera tracking accuracy, outperforming other NeRF-based SLAM systems and achieving competitive results compared to traditional methods like ORB-SLAM2.
Generates higher-fidelity novel views with better clarity and completeness, as evidenced by qualitative and quantitative (PSNR) evaluation on various datasets. |
The system is currently limited to static environments and does not handle dynamic objects.
Future work will explore extending the approach to dynamic scenes and improving computational efficiency. |
slam, nerf, depth completion, feature tracking, 3d reconstruction |
2403.20032
Report |
HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes |
Zhuopeng Li, Yilin Zhang, Chenming Wu, Jianke Zhu, Liangjun Zhang |
The rapid growth of 3D Gaussian Splatting (3DGS) has revolutionized neural
rendering, enabling real-time production of high-quality renderings. However,
the previous 3DGS-based methods have limitations in urban scenes due to
reliance on initial Structure-from-Motion(SfM) points and difficulties in
rendering distant, sky and low-texture areas. To overcome these challenges, we
propose a hybrid optimization method named HO-Gaussian, which combines a
grid-based volume with the 3DGS pipeline. HO-Gaussian eliminates the dependency
on SfM point initialization, allowing for rendering of urban scenes, and
incorporates the Point Densitification to enhance rendering quality in
problematic regions during training. Furthermore, we introduce Gaussian
Direction Encoding as an alternative for spherical harmonics in the rendering
pipeline, which enables view-dependent color representation. To account for
multi-camera systems, we introduce neural warping to enhance object consistency
across different cameras. Experimental results on widely used autonomous
driving datasets demonstrate that HO-Gaussian achieves photo-realistic
rendering in real-time on multi-camera urban datasets. |
This paper presents HO-Gaussian, a hybrid optimization method for novel view rendering of multi-camera urban scenes that combines a grid-based volume with a 3D Gaussian Splatting pipeline. |
Existing 3D Gaussian Splatting (3DGS) methods struggle in urban scenes due to reliance on sparse SfM point initialization and difficulties in rendering distant, sky, and low-texture areas. This limits their effectiveness in large-scale urban environments. |
HO-Gaussian uses a grid-based volume to learn Gaussian positions and optimize geometric information, enabling point densification in challenging areas. It introduces Gaussian directional encoding (replacing spherical harmonics) for view-dependent color representation and neural warping to enhance object consistency across multiple cameras. |
HO-Gaussian achieves real-time rendering while maintaining photo-realistic texture details in urban scenes.
The method reduces disk space usage compared to traditional 3DGS by employing efficient encoding techniques.
Extensive evaluations on Waymo and Argoverse datasets demonstrate superior performance compared to state-of-the-art NeRF-based and 3DGS-based methods. |
The current implementation relies on a predefined bounding sphere, potentially limiting scalability to even larger scenes.
Future work could explore incorporating temporal information and dynamic elements for more comprehensive urban scene rendering. |
novel view synthesis, urban scenes, gaussian splatting, neural rendering, hybrid optimization |
2403.20018
Report |
SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image |
Yunhao Li, Xiaodong Wang, Ping Wang, Xin Yuan, Peidong Liu |
In this paper, we explore the potential of Snapshot Compressive Imaging (SCI)
technique for recovering the underlying 3D scene representation from a single
temporal compressed image. SCI is a cost-effective method that enables the
recording of high-dimensional data, such as hyperspectral or temporal
information, into a single image using low-cost 2D imaging sensors. To achieve
this, a series of specially designed 2D masks are usually employed, which not
only reduces storage requirements but also offers potential privacy protection.
Inspired by this, to take one step further, our approach builds upon the
powerful 3D scene representation capabilities of neural radiance fields (NeRF).
Specifically, we formulate the physical imaging process of SCI as part of the
training of NeRF, allowing us to exploit its impressive performance in
capturing complex scene structures. To assess the effectiveness of our method,
we conduct extensive evaluations using both synthetic data and real data
captured by our SCI system. Extensive experimental results demonstrate that our
proposed approach surpasses the state-of-the-art methods in terms of image
reconstruction and novel view image synthesis. Moreover, our method also
exhibits the ability to restore high frame-rate multi-view consistent images by
leveraging SCI and the rendering capabilities of NeRF. The code is available at
https://github.com/WU-CVGL/SCINeRF. |
This paper introduces SCINeRF, a novel method to recover 3D scene representations and multi-view images from a single snapshot compressed image. |
This method addresses limitations of existing SCI image reconstruction techniques that do not consider 3D scene structure and multi-view consistency. |
SCINeRF leverages NeRF to represent the scene and jointly optimizes NeRF parameters and camera poses by minimizing the difference between a synthesized compressed image and the actual measurement. |
SCINeRF achieves superior performance over state-of-the-art SCI image restoration methods on both synthetic and real datasets.
The method shows robustness to high compression ratios, maintaining high image quality even with increased compression.
Experimental results demonstrate the importance of considering 3D scene structure for accurate and consistent multi-view image recovery from SCI data. |
The rendering process in SCINeRF may introduce a marginal loss of image information compared to direct recovery methods.
Future work will focus on improving the capturing and reconstruction speed and exploring applications in dynamic scene capture. |
neural radiance fields, nerf, snapshot compressive imaging, sci, 3d scene representation |
2403.20002
Report |
Grounding and Enhancing Grid-based Models for Neural Fields |
Zelin Zhao, Fenglei Fan, Wenlong Liao, Junchi Yan |
Many contemporary studies utilize grid-based models for neural field
representation, but a systematic analysis of grid-based models is still
missing, hindering the improvement of those models. Therefore, this paper
introduces a theoretical framework for grid-based models. This framework points
out that these models' approximation and generalization behaviors are
determined by grid tangent kernels (GTK), which are intrinsic properties of
grid-based models. The proposed framework facilitates a consistent and
systematic analysis of diverse grid-based models. Furthermore, the introduced
framework motivates the development of a novel grid-based model named the
Multiplicative Fourier Adaptive Grid (MulFAGrid). The numerical analysis
demonstrates that MulFAGrid exhibits a lower generalization bound than its
predecessors, indicating its robust generalization performance. Empirical
studies reveal that MulFAGrid achieves state-of-the-art performance in various
tasks, including 2D image fitting, 3D signed distance field (SDF)
reconstruction, and novel view synthesis, demonstrating superior representation
ability. The project website is available at
https://sites.google.com/view/cvpr24-2034-submission/home. |
This paper introduces a theoretical framework for grid-based neural field models based on grid tangent kernels (GTKs), and proposes a novel model named Multiplicative Fourier Adaptive Grid (MulFAGrid). |
A systematic analysis of grid-based models, which are computationally efficient for neural field representation, has been missing, hindering their improvement. |
The paper introduces the concept of GTKs to analyze the training and generalization behaviors of grid-based models. It then proposes MulFAGrid, which leverages multiplicative filters and Fourier features for effective representation learning. |
MulFAGrid exhibits a wider GTK spectrum in the high-frequency domain, indicating better learning efficiency for high-frequency components.
Numerical studies show MulFAGrid has a tighter generalization bound than existing grid-based models.
Empirical evaluations demonstrate MulFAGrid achieves state-of-the-art performance in 2D image fitting, 3D SDF reconstruction, and novel view synthesis. |
The rendering speed of MulFAGrid is lower than some baselines like 3DGS.
Further research on improving rendering speed and exploring other applications of the GTK theory is warranted. |
neural fields, grid-based models, grid tangent kernel, multiplicative filters, fourier features |
2403.19985
Report |
Stable Surface Regularization for Fast Few-Shot NeRF |
Byeongin Joung, Byeong-Uk Lee, Jaesung Choe, Ukcheol Shin, Minjun Kang, Taeyeop Lee, In So Kweon, Kuk-Jin Yoon |
This paper proposes an algorithm for synthesizing novel views under few-shot
setup. The main concept is to develop a stable surface regularization technique
called Annealing Signed Distance Function (ASDF), which anneals the surface in
a coarse-to-fine manner to accelerate convergence speed. We observe that the
Eikonal loss - which is a widely known geometric regularization - requires
dense training signal to shape different level-sets of SDF, leading to
low-fidelity results under few-shot training. In contrast, the proposed surface
regularization successfully reconstructs scenes and produce high-fidelity
geometry with stable training. Our method is further accelerated by utilizing
grid representation and monocular geometric priors. Finally, the proposed
approach is up to 45 times faster than existing few-shot novel view synthesis
methods, and it produces comparable results in the ScanNet dataset and
NeRF-Real dataset. |
This paper introduces a novel surface regularization technique called Annealing Signed Distance Function (ASDF) for fast few-shot novel view synthesis. |
Existing methods struggle with few-shot novel view synthesis due to the difficulty of extracting reliable geometry information from sparse input views, leading to unstable optimization and low-fidelity results. |
The ASDF loss enforces adaptive geometric smoothing in a coarse-to-fine manner by gradually reducing the smoothing area during training. This allows the network to first learn the overall structure and then progressively recover detailed geometry. The method utilizes multi-level voxel grids, monocular geometric priors, and combines ASDF loss with rendering losses for color, depth, and surface normal. |
The ASDF loss leads to more stable optimization compared to conventional Eikonal loss in few-shot scenarios.
The proposed method achieves comparable performance to state-of-the-art few-shot NeRF methods while being up to 45 times faster.
The approach demonstrates robustness in reconstructing and synthesizing novel views, particularly in homogeneous regions and scenes with limited viewing directions. |
The Annealing Signed Distance Function (ASDF) loss requires hyperparameter tuning depending on scene geometry and SfM results.
Future work could explore adaptive methods for hyperparameter selection and integrate recent advancements like hash encoding for further optimization speed improvements. |
novel view synthesis, neural radiance fields (nerf), few-shot learning, surface regularization, geometric priors |
2403.19975
Report |
Context-Aware Integration of Language and Visual References for Natural Language Tracking |
Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, Jiming Chen |
Tracking by natural language specification (TNL) aims to consistently
localize a target in a video sequence given a linguistic description in the
initial frame. Existing methodologies perform language-based and template-based
matching for target reasoning separately and merge the matching results from
two sources, which suffer from tracking drift when language and visual
templates miss-align with the dynamic target state and ambiguity in the later
merging stage. To tackle the issues, we propose a joint multi-modal tracking
framework with 1) a prompt modulation module to leverage the complementarity
between temporal visual templates and language expressions, enabling precise
and context-aware appearance and linguistic cues, and 2) a unified target
decoding module to integrate the multi-modal reference cues and executes the
integrated queries on the search image to predict the target location in an
end-to-end manner directly. This design ensures spatio-temporal consistency by
leveraging historical visual information and introduces an integrated solution,
generating predictions in a single step. Extensive experiments conducted on
TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed
approach. The results demonstrate competitive performance against
state-of-the-art methods for both tracking and grounding. |
Proposes QueryNLT, a novel multi-modal tracking framework for tracking by natural language specification (TNL), which leverages the complementarity between visual and language features to improve target localization accuracy. |
Existing TNL methods suffer from tracking drift due to separate language and template matching, leading to misalignment with the dynamic target state and ambiguity in merging results. |
1. **Prompt Modulation Module:** Filters inconsistent descriptions from language and visual references to generate precise, context-aware cues. 2. **Unified Target Decoding Module:** Integrates multi-modal prompts and performs target retrieval from the search image in an end-to-end manner. |
Achieves competitive performance against state-of-the-art trackers on TNL2K, OTB-Lang, and LaSOT benchmarks.
Shows significant improvements over methods relying on separate language and template matching, highlighting the importance of multi-modal integration.
Demonstrates robust performance in handling challenging factors such as appearance variations, background clutter, and similar distractors. |
Limited exploration of more sophisticated language models for richer semantic understanding.
Further investigation into incorporating temporal reasoning mechanisms for enhanced long-term tracking. |
natural language tracking, visual tracking, multi-modal learning, prompt modulation, target decoding |
2403.19967
Report |
Rewrite the Stars |
Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, Yun Fu |
Recent studies have drawn attention to the untapped potential of the "star
operation" (element-wise multiplication) in network design. While intuitive
explanations abound, the foundational rationale behind its application remains
largely unexplored. Our study attempts to reveal the star operation's ability
to map inputs into high-dimensional, non-linear feature spaces -- akin to
kernel tricks -- without widening the network. We further introduce StarNet, a
simple yet powerful prototype, demonstrating impressive performance and low
latency under compact network structure and efficient budget. Like stars in the
sky, the star operation appears unremarkable but holds a vast universe of
potential. Our work encourages further exploration across tasks, with codes
available at https://github.com/ma-xu/Rewrite-the-Stars. |
This paper investigates the "star" operation (element-wise multiplication) in neural networks, showing it implicitly maps inputs to high-dimensional, non-linear feature spaces similar to kernel methods. |
Understanding the star operation's power can lead to more efficient and compact network designs. |
The authors analyze the star operation mathematically, rewrite it to reveal its dimensionality expansion, and compare it to summation in various experiments with a simple network (DemoNet). They also introduce StarNet, a proof-of-concept efficient architecture based on these insights. |
Star operation consistently outperforms summation in image classification, especially with narrower networks.
Visualizations of decision boundaries show the star operation allows for more complex representations, similar to polynomial kernels in SVMs.
StarNet achieves competitive performance on ImageNet while being significantly faster than other efficient models with similar complexity. |
The study primarily focuses on image classification, leaving its generalization to other tasks for future work.
While the paper demonstrates the potential of activation-free networks with star operations, further research is needed to fully realize this. |
element-wise multiplication, star operation, kernel methods, efficient networks, high-dimensional feature spaces |
2403.19964
Report |
FairRAG: Fair Human Generation via Fair Retrieval Augmentation |
Robik Shrestha, Yang Zou, Qiuyu Chen, Zhiheng Li, Yusheng Xie, Siqi Deng |
Existing text-to-image generative models reflect or even amplify societal
biases ingrained in their training data. This is especially concerning for
human image generation where models are biased against certain demographic
groups. Existing attempts to rectify this issue are hindered by the inherent
limitations of the pre-trained models and fail to substantially improve
demographic diversity. In this work, we introduce Fair Retrieval Augmented
Generation (FairRAG), a novel framework that conditions pre-trained generative
models on reference images retrieved from an external image database to improve
fairness in human generation. FairRAG enables conditioning through a
lightweight linear module that projects reference images into the textual
space. To enhance fairness, FairRAG applies simple-yet-effective debiasing
strategies, providing images from diverse demographic groups during the
generative process. Extensive experiments demonstrate that FairRAG outperforms
existing methods in terms of demographic diversity, image-text alignment, and
image fidelity while incurring minimal computational overhead during inference. |
Introduces Fair Retrieval Augmented Generation (FRAG), a framework that uses retrieved reference images to improve demographic diversity in human image generation, addressing biases in pre-trained text-to-image models. |
Existing text-to-image models perpetuate societal biases, particularly against certain demographic groups, necessitating fairer generation methods. |
FRAG trains a linear layer to project reference images into the textual space of a frozen pre-trained model. It employs debiasing techniques like debiased queries and balanced sampling for fair retrieval, and uses a transfer instruction to guide attribute transfer during generation. |
FRAG outperforms baselines in demographic diversity across various professions, improving from 0.341 to 0.438 compared to the best non-RAG method.
It also shows improvement in image-text alignment and maintains competitive image fidelity.
The framework incurs minimal computational overhead, adding just 0.2 seconds to generate an image compared to the baseline Stable Diffusion model. |
Current implementation uses a one-to-one image mapping, exploring multiple reference images for conditioning could further enhance diversity.
Generated images can still exhibit disfigurements, suggesting the need for incorporating human anatomy knowledge into the models. |
fairness, text-to-image generation, retrieval augmented generation, demographic diversity, bias mitigation |
2403.19963
Report |
Efficient Modulation for Vision Networks |
Xu Ma, Xiyang Dai, Jianwei Yang, Bin Xiao, Yinpeng Chen, Yun Fu, Lu Yuan |
In this work, we present efficient modulation, a novel design for efficient
vision networks. We revisit the modulation mechanism, which operates input
through convolutional context modeling and feature projection layers, and fuses
features via element-wise multiplication and an MLP block. We demonstrate that
the modulation mechanism is particularly well suited for efficient networks and
further tailor the modulation design by proposing the efficient modulation
(EfficientMod) block, which is considered the essential building block for our
networks. Benefiting from the prominent representational ability of modulation
mechanism and the proposed efficient design, our network can accomplish better
trade-offs between accuracy and efficiency and set new state-of-the-art
performance in the zoo of efficient networks. When integrating EfficientMod
with the vanilla self-attention block, we obtain the hybrid architecture which
further improves the performance without loss of efficiency. We carry out
comprehensive experiments to verify EfficientMod's performance. With fewer
parameters, our EfficientMod-s performs 0.6 top-1 accuracy better than
EfficientFormerV2-s2 and is 25% faster on GPU, and 2.9 better than
MobileViTv2-1.0 at the same GPU latency. Additionally, our method presents a
notable improvement in downstream tasks, outperforming EfficientFormerV2-s by
3.6 mIoU on the ADE20K benchmark. Code and checkpoints are available at
https://github.com/ma-xu/EfficientMod. |
This paper proposes Efficient Modulation (EfficientMod), a novel convolutional block designed for efficient vision networks. EfficientMod leverages a modulation mechanism with tailored context modeling and feature projection for enhanced efficiency. |
Existing efficient networks with attention mechanisms or convolutional alternatives often suffer from high computational costs. This work addresses this by introducing an efficient modulation block that balances performance and efficiency. |
The authors revisit the modulation mechanism used in FocalNet and VAN, simplifying the context modeling branch and streamlining the overall design to reduce computational overhead while retaining desirable properties like dynamics and large receptive fields. |
EfficientMod achieves state-of-the-art performance on ImageNet-1K, outperforming EfficientFormerV2-S2 by 0.6% top-1 accuracy while being 25% faster on GPU.
The proposed method demonstrates significant improvements in downstream tasks, surpassing EfficientFormerV2 by 3.6 mIoU on ADE20K semantic segmentation.
Comprehensive ablation studies validate the contribution of each component in EfficientMod and its superiority over alternative designs like MBConv. |
Further investigation is needed to explore the scalability of EfficientMod and address the latency gap observed with increasing model size.
Exploring more efficient ways to expand receptive fields beyond large kernels and attention mechanisms is crucial for future work. |
efficient networks, convolutional neural networks, modulation mechanism, computer vision, image classification |
2403.19926
Report |
Video-Based Human Pose Regression via Decoupled Space-Time Aggregation |
Jijie He, Wenwu Yang |
By leveraging temporal dependency in video sequences, multi-frame human pose
estimation algorithms have demonstrated remarkable results in complicated
situations, such as occlusion, motion blur, and video defocus. These algorithms
are predominantly based on heatmaps, resulting in high computation and storage
requirements per frame, which limits their flexibility and real-time
application in video scenarios, particularly on edge devices. In this paper, we
develop an efficient and effective video-based human pose regression method,
which bypasses intermediate representations such as heatmaps and instead
directly maps the input to the output joint coordinates. Despite the inherent
spatial correlation among adjacent joints of the human pose, the temporal
trajectory of each individual joint exhibits relative independence. In light of
this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to
separately capture the spatial contexts between adjacent joints and the
temporal cues of each individual joint, thereby avoiding the conflation of
spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token
for each joint to facilitate the modeling of their spatiotemporal dependencies.
With the proposed joint-wise local-awareness attention mechanism, our method is
capable of efficiently and flexibly utilizing the spatial dependency of
adjacent joints and the temporal dependency of each joint itself. Extensive
experiments demonstrate the superiority of our method. Compared to previous
regression-based single-frame human pose estimation methods, DSTA significantly
enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017.
Furthermore, our approach either surpasses or is on par with the
state-of-the-art heatmap-based multi-frame human pose estimation methods.
Project page: https://github.com/zgspose/DSTA. |
This paper presents DSTA, a novel regression-based framework for multi-person pose estimation in video sequences, which efficiently leverages temporal dependencies while reducing computational overhead common in heatmap-based methods. |
Existing multi-frame pose estimation methods rely heavily on heatmaps, leading to high computation and storage costs that limit their application in real-time video scenarios, especially on edge devices. This work explores a more efficient and flexible regression-based approach for this task. |
The proposed DSTA method decouples the modeling of spatial and temporal dependencies in human pose estimation. It first extracts joint-specific feature tokens from backbone features using a Joint-centric Feature Decoder (JFD). Then, a Space-Time Decoupling (STD) module with a joint-wise local-awareness attention mechanism separately captures spatial dependencies between adjacent joints and temporal dependencies of each joint across frames. Finally, aggregated spatiotemporal features are used to directly regress joint coordinates. |
DSTA significantly outperforms previous image-based regression methods, demonstrating the importance of incorporating temporal information.
DSTA achieves comparable or superior performance to state-of-the-art heatmap-based methods on challenging benchmarks like PoseTrack, while being significantly more computationally efficient.
DSTA exhibits strong robustness to low-resolution inputs, making it particularly suitable for resource-constrained scenarios. |
The performance improvement from capturing spatial context is limited as the extracted joint tokens already contain some spatial information.
Future work could explore more sophisticated JFD modules to further enhance the model's representational capacity. |
human pose estimation, video understanding, regression-based methods, spatiotemporal modeling, efficient deep learning |
2403.19924
Report |
SceneTracker: Long-term Scene Flow Estimation Network |
Bo Wang, Jian Li, Yang Yu, Li Liu, Zhenping Sun, Dewen Hu |
Considering the complementarity of scene flow estimation in the spatial
domain's focusing capability and 3D object tracking in the temporal domain's
coherence, this study aims to address a comprehensive new task that can
simultaneously capture fine-grained and long-term 3D motion in an online
manner: long-term scene flow estimation (LSFE). We introduce SceneTracker, a
novel learning-based LSFE network that adopts an iterative approach to
approximate the optimal trajectory. Besides, it dynamically indexes and
constructs appearance and depth correlation features simultaneously and employs
the Transformer to explore and utilize long-range connections within and
between trajectories. With detailed experiments, SceneTracker shows superior
capabilities in handling 3D spatial occlusion and depth noise interference,
highly tailored to the LSFE task's needs. Finally, we build the first
real-world evaluation dataset, LSFDriving, further substantiating
SceneTracker's commendable generalization capacity. The code and data for
SceneTracker is available at https://github.com/wwsource/SceneTracker. |
This paper introduces the novel task of Long-Term Scene Flow Estimation (LSFE) and proposes SceneTracker, a learning-based network to estimate the 3D trajectory of a target point over a video sequence. |
LSFE bridges the gap between Scene Flow Estimation, focusing on instantaneous motion, and 3D Object Tracking, limited to bounding boxes, by enabling fine-grained long-term 3D motion capture for comprehensive scene understanding. |
SceneTracker employs an iterative approach with a sliding window mechanism, dynamically constructing appearance and depth correlation features, and leveraging Transformer to model long-range dependencies within and across trajectories. |
SceneTracker significantly outperforms scene flow and tracking-based baselines on the synthetic LSFOdyssey dataset, demonstrating robustness against occlusion and depth noise.
The paper introduces the first real-world LSFE dataset, LSFDriving, featuring annotated 3D trajectories for static backgrounds, moving vehicles, and non-rigid pedestrians.
Evaluation on LSFDriving showcases SceneTracker's generalization ability from synthetic to real-world data, achieving promising results even for challenging non-rigid motions. |
The reliance on dense depth maps, obtained through completion methods for real-world data, introduces potential limitations.
Future work could explore event cameras or multi-view settings to enhance robustness and accuracy, particularly for non-rigid motion estimation. |
scene flow estimation, 3d object tracking, long-term scene flow estimation, transformer, autonomous driving |
2403.19919
Report |
Diff-Reg v1: Diffusion Matching Model for Registration Problem |
Qianliang Wu, Haobo Jiang, Lei Luo, Jun Li, Yaqing Ding, Jin Xie, Jian Yang |
Establishing reliable correspondences is essential for registration tasks
such as 3D and 2D3D registration. Existing methods commonly leverage geometric
or semantic point features to generate potential correspondences. However,
these features may face challenges such as large deformation, scale
inconsistency, and ambiguous matching problems (e.g., symmetry). Additionally,
many previous methods, which rely on single-pass prediction, may struggle with
local minima in complex scenarios. To mitigate these challenges, we introduce a
diffusion matching model for robust correspondence construction. Our approach
treats correspondence estimation as a denoising diffusion process within the
doubly stochastic matrix space, which gradually denoises (refines) a doubly
stochastic matching matrix to the ground-truth one for high-quality
correspondence estimation. It involves a forward diffusion process that
gradually introduces Gaussian noise into the ground truth matching matrix and a
reverse denoising process that iteratively refines the noisy matching matrix.
In particular, the feature extraction from the backbone occurs only once during
the inference phase. Our lightweight denoising module utilizes the same feature
at each reverse sampling step. Evaluation of our method on both 3D and 2D3D
registration tasks confirms its effectiveness. |
Introduces Diff-Reg, a novel diffusion matching model for robust correspondence construction in 3D and 2D3D registration tasks. |
Addresses challenges of existing methods in handling large deformation, scale inconsistency, and ambiguous matching in registration tasks by treating correspondence estimation as a denoising diffusion process. |
Utilizes a diffusion model within the doubly stochastic matrix space, iteratively refining a noisy matching matrix to the ground truth for optimal correspondence estimation. Employs a lightweight denoising module with Sinkhorn Projection, Weighted SVD, Warping Function, Denoising Transformer, and Matching function. |
Achieves state-of-the-art performance on 4DMatch and 4DLoMatch benchmarks for non-rigid registration, demonstrating improved handling of large deformation and low overlap.
Outperforms single-pass baselines on 3DMatch benchmark for rigid registration, highlighting the effectiveness of iterative refinement through reverse denoising sampling.
Shows promising results on the challenging RGB-D Scenes V2 benchmark for 2D3D registration, effectively addressing scale ambiguity issues. |
Limited performance on 3DLoMatch due to the absence of specialized geometric embedding in the feature backbone.
Generic transformer design in the denoising module might benefit from incorporating task-specific priors for further improvements, especially for challenging local non-rigid motions. |
3d registration, 2d3d registration, diffusion model, correspondence estimation, doubly stochastic matrix |
2403.19898
Report |
Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting |
Haipeng Liu, Yang Wang, Biao Qian, Meng Wang, Yong Rui |
Denoising diffusion probabilistic models for image inpainting aim to add the
noise to the texture of image during the forward process and recover masked
regions with unmasked ones of the texture via the reverse denoising process.
Despite the meaningful semantics generation, the existing arts suffer from the
semantic discrepancy between masked and unmasked regions, since the
semantically dense unmasked texture fails to be completely degraded while the
masked regions turn to the pure noise in diffusion process, leading to the
large discrepancy between them. In this paper, we aim to answer how unmasked
semantics guide texture denoising process;together with how to tackle the
semantic discrepancy, to facilitate the consistent and meaningful semantics
generation. To this end, we propose a novel structure-guided diffusion model
named StrDiffusion, to reformulate the conventional texture denoising process
under structure guidance to derive a simplified denoising objective for image
inpainting, while revealing: 1) the semantically sparse structure is beneficial
to tackle semantic discrepancy in early stage, while dense texture generates
reasonable semantics in late stage; 2) the semantics from unmasked regions
essentially offer the time-dependent structure guidance for the texture
denoising process, benefiting from the time-dependent sparsity of the structure
semantics. For the denoising process, a structure-guided neural network is
trained to estimate the simplified denoising objective by exploiting the
consistency of the denoised structure between masked and unmasked regions.
Besides, we devise an adaptive resampling strategy as a formal criterion as
whether structure is competent to guide the texture denoising process, while
regulate their semantic correlations. Extensive experiments validate the merits
of StrDiffusion over the state-of-the-arts. Our code is available at
https://github.com/htyjers/StrDiffusion. |
This paper proposes StrDiffusion, a novel structure-guided diffusion model for image inpainting that leverages structure guidance to improve semantic consistency between masked and unmasked regions during the denoising process. |
Existing diffusion-based inpainting methods often produce semantically meaningful results but struggle to maintain consistency between the restored and original image regions, especially with dense textures. |
The authors reformulate the traditional texture denoising process by incorporating guidance from a progressively sparser structure representation. This structure guides a time-dependent noise network to estimate a simplified denoising objective, balancing semantic consistency and meaningful generation. |
StrDiffusion demonstrates superior performance over state-of-the-art methods in terms of PSNR, SSIM, and FID scores.
The proposed method effectively mitigates semantic discrepancy issues between masked and unmasked regions.
An adaptive resampling strategy further enhances performance by regulating the semantic correlation between denoised texture and structure. |
The computational cost of StrDiffusion is higher than some competing methods due to the use of both structure and texture diffusion processes.
Future work could explore extending StrDiffusion to other image restoration tasks beyond inpainting. |
image inpainting, diffusion models, structure guidance, semantic consistency, adaptive resampling |
2403.19888
Report |
MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection |
Ali Behrouz, Michele Santacatterina, Ramin Zabih |
Recent advances in deep learning have mainly relied on Transformers due to
their data dependency and ability to learn at scale. The attention module in
these architectures, however, exhibits quadratic time and space in input size,
limiting their scalability for long-sequence modeling. State Space Models
(SSMs), and more specifically Selective SSMs (S6), with efficient
hardware-aware implementation, have shown promising potential for long causal
sequence modeling. They, however, use separate blocks for each channel and fail
to filter irrelevant channels and capture inter-channel dependencies. Natural
attempt to mix information across channels using MLP, attention, or SSMs
results in further instability in the training of SSMs for large networks
and/or nearly double the number of parameters. We present the MambaMixer block,
a new SSM-based architecture with data-dependent weights that uses a dual
selection mechanism across tokens and channels-called Selective Token and
Channel Mixer. To mitigate doubling the number of parameters, we present a new
non-causal heuristic of the S6 block using quasi-separable kernels with a
hardware-friendly implementation. We further present an efficient variant of
MambaMixer, called QSMixer, that mixes information along both sequence and
embedding dimensions. As a proof of concept, we design Vision MambaMixer (ViM2)
and Vision QSMixer (ViQS) architectures. To enhance their ability to capture
spatial information in images, we present Switch of Scans (SoS) that
dynamically uses a set of useful image scans to traverse image patches. We
evaluate the performance of our methods in image classification, segmentation,
and object detection. Our results underline the importance of selectively
mixing across both tokens and channels and show the competitive (resp.
superior) performance of our methods with well-established vision models (resp.
SSM-based models). |
The paper introduces MambaMixer and QSMixer, two novel sequence modeling architectures based on selective state space models (SSMs) with dual selection mechanisms across both channels and tokens, enabling efficient and effective information mixing and filtering. |
Existing SSM-based models lack channel mixing, limiting their performance and stability in multi-dimensional data like images and videos. MambaMixer and QSMixer address this by selectively mixing information across both channels and tokens, improving performance and efficiency in vision tasks. |
The authors leverage quasi-separable matrices as a heuristic for non-causal selective channel mixing, leading to a hardware-friendly linear-time training. For vision tasks, they design ViM2 and ViQS models based on MambaMixer and QSMixer, incorporating a Switch of Scans (SoS) module for dynamic scan selection and a gating mechanism with multi-resolution convolutions to enhance receptive fields. |
MambaMixer and QSMixer outperform existing SSM-based models in image classification on ImageNet and sCIFAR datasets, highlighting the importance of selective channel mixing.
ViM2 and ViQS achieve competitive performance compared to well-established vision models in image classification, object detection, and semantic segmentation tasks, with superior efficiency in terms of FLOPs and memory usage.
The quasi-separable formulation of channel mixing significantly improves throughput compared to traditional scan-based implementations. |
The study primarily focuses on vision tasks, leaving the evaluation of selective channel mixing on NLP tasks for future work.
Further exploration of techniques to enhance the efficiency of ViM2 and ViQS, beyond the current simple architecture, is a potential direction for future research. |
sequence modeling, state space models, vision transformers, channel mixing, quasi-separable matrices |
2403.19866
Report |
Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization |
Yuhang Li, Xin Dong, Chen Chen, Jingtao Li, Yuxin Wen, Michael Spranger, Lingjuan Lyu |
Synthetic image data generation represents a promising avenue for training
deep learning models, particularly in the realm of transfer learning, where
obtaining real images within a specific domain can be prohibitively expensive
due to privacy and intellectual property considerations. This work delves into
the generation and utilization of synthetic images derived from text-to-image
generative models in facilitating transfer learning paradigms. Despite the high
visual fidelity of the generated images, we observe that their naive
incorporation into existing real-image datasets does not consistently enhance
model performance due to the inherent distribution gap between synthetic and
real images. To address this issue, we introduce a novel two-stage framework
called bridged transfer, which initially employs synthetic images for
fine-tuning a pre-trained model to improve its transferability and subsequently
uses real data for rapid adaptation. Alongside, We propose dataset style
inversion strategy to improve the stylistic alignment between synthetic and
real images. Our proposed methods are evaluated across 10 different datasets
and 5 distinct models, demonstrating consistent improvements, with up to 30%
accuracy increase on classification tasks. Intriguingly, we note that the
enhancements were not yet saturated, indicating that the benefits may further
increase with an expanded volume of synthetic data. |
This paper explores using synthetic image data generated by text-to-image models to enhance transfer learning performance in computer vision. |
Transfer learning relies on large datasets, which are often expensive or difficult to acquire for specific domains. Synthetic data offers a solution. |
The authors introduce a two-stage 'bridged transfer' framework. First, an ImageNet-pretrained model is fine-tuned on synthetic data. Second, the model is further fine-tuned on the target domain's real data. They also propose a 'Dataset Style Inversion' technique to align synthetic images' style with the target domain. |
Simply mixing real and synthetic data hurts performance due to distribution mismatch.
Bridged transfer improves model transferability and achieves faster convergence on real data.
Dataset Style Inversion further improves accuracy by aligning synthetic and real image styles. |
The study primarily focuses on image classification tasks.
Future work can investigate extending these techniques to other computer vision tasks. |
transfer learning, synthetic data, text-to-image generation, dataset style inversion, computer vision |
2403.19838
Report |
Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving |
Akshay Gopalkrishnan, Ross Greer, Mohan Trivedi |
Vision-Language Models (VLMs) and Multi-Modal Language models (MMLMs) have
become prominent in autonomous driving research, as these models can provide
interpretable textual reasoning and responses for end-to-end autonomous driving
safety tasks using traffic scene images and other data modalities. However,
current approaches to these systems use expensive large language model (LLM)
backbones and image encoders, making such systems unsuitable for real-time
autonomous driving systems where tight memory constraints exist and fast
inference time is necessary. To address these previous issues, we develop
EM-VLM4AD, an efficient, lightweight, multi-frame vision language model which
performs Visual Question Answering for autonomous driving. In comparison to
previous approaches, EM-VLM4AD requires at least 10 times less memory and
floating point operations, while also achieving higher CIDEr and ROUGE-L scores
than the existing baseline on the DriveLM dataset. EM-VLM4AD also exhibits the
ability to extract relevant information from traffic views related to prompts
and can answer questions for various autonomous driving subtasks. We release
our code to train and evaluate our model at
https://github.com/akshaygopalkr/EM-VLM4AD. |
This paper introduces EM-VLM4AD, an efficient multi-frame vision language model for Visual Question Answering (VQA) in autonomous driving, designed to be lightweight and computationally less demanding than current models. |
Current VLM and MMLM models for autonomous driving rely on large, computationally expensive backbones, making them unsuitable for real-time applications in vehicles with limited resources. This work addresses this by proposing a smaller and more efficient model. |
EM-VLM4AD uses a pretrained ViT model for image encoding and T5 (Base or quantized Large) as the LM backbone. It employs a two-stage training process: 1) align multi-view image embeddings with LM embeddings and 2) finetune the LM. The model is trained and evaluated on the DriveLM dataset. |
EM-VLM4AD requires at least 10 times less memory and FLOPs compared to existing AD-VLMs.
Despite being smaller, EM-VLM4AD achieves higher CIDEr and ROUGE scores than the DriveLM baseline.
The model demonstrates the ability to process information from multiple camera views and answer diverse questions related to autonomous driving tasks. |
EM-VLM4AD struggles with questions related to predicting ego-vehicle behavior, possibly due to the lack of temporal context.
Future work includes extending the model to process video inputs for better handling of temporal information and incorporating multimodal retrieval augmented generation for improved context awareness. |
vision language models, multimodal learning, autonomous driving, visual question answering, efficient ai |
2403.19811
Report |
X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization |
Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, Shugao Ma |
Lately, there has been growing interest in adapting vision-language models
(VLMs) to image and third-person video classification due to their success in
zero-shot recognition. However, the adaptation of these models to egocentric
videos has been largely unexplored. To address this gap, we propose a simple
yet effective cross-modal adaptation framework, which we call X-MIC. Using a
video adapter, our pipeline learns to align frozen text embeddings to each
egocentric video directly in the shared embedding space. Our novel adapter
architecture retains and improves generalization of the pre-trained VLMs by
disentangling learnable temporal modeling and frozen visual encoder. This
results in an enhanced alignment of text embeddings to each egocentric video,
leading to a significant improvement in cross-dataset generalization. We
evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for
fine-grained cross-dataset action generalization, demonstrating the
effectiveness of our method. Code is available at
https://github.com/annusha/xmic |
This paper proposes X-MIC, a cross-modal adaptation framework for vision-language models, to improve egocentric video classification through aligning frozen text embeddings to videos. |
Egocentric action recognition suffers from domain gaps between web and egocentric data, making zero-shot generalization challenging. This paper aims to address this for real-world applications. |
The method uses a video adapter to learn aligned text embeddings for each egocentric video directly in the shared embedding space, disentangling temporal modeling and the visual encoder. It introduces a novel egocentric spatial-temporal attention module to enhance hand-object interaction information. |
X-MIC outperforms state-of-the-art VL adaptation methods in both within-dataset and cross-dataset evaluations on Ego4D, Epic-Kitchens, and EGTEA.
Using a separate visual encoder like DINO further enhances performance.
The ego-spatial-temporal attention module effectively captures hand-object interactions, improving recognition. |
The method is currently limited to video classification and doesn't cover text-vision tasks like text-to-video retrieval.
The impact of different pre-training strategies on verb and noun recognition needs further investigation. |
egocentric action recognition, vision-language models, cross-modal adaptation, zero-shot learning, attention mechanisms |
2403.19797
Report |
Efficient 3D Instance Mapping and Localization with Neural Fields |
George Tang, Krishna Murthy Jatavallabhula, Antonio Torralba |
We tackle the problem of learning an implicit scene representation for 3D
instance segmentation from a sequence of posed RGB images. Towards this, we
introduce 3DIML, a novel framework that efficiently learns a label field that
may be rendered from novel viewpoints to produce view-consistent instance
segmentation masks. 3DIML significantly improves upon training and inference
runtimes of existing implicit scene representation based methods. Opposed to
prior art that optimizes a neural field in a self-supervised manner, requiring
complicated training procedures and loss function design, 3DIML leverages a
two-phase process. The first phase, InstanceMap, takes as input 2D segmentation
masks of the image sequence generated by a frontend instance segmentation
model, and associates corresponding masks across images to 3D labels. These
almost view-consistent pseudolabel masks are then used in the second phase,
InstanceLift, to supervise the training of a neural label field, which
interpolates regions missed by InstanceMap and resolves ambiguities.
Additionally, we introduce InstanceLoc, which enables near realtime
localization of instance masks given a trained label field and an off-the-shelf
image segmentation model by fusing outputs from both. We evaluate 3DIML on
sequences from the Replica and ScanNet datasets and demonstrate 3DIML's
effectiveness under mild assumptions for the image sequences. We achieve a
large practical speedup over existing implicit scene representation methods
with comparable quality, showcasing its potential to facilitate faster and more
effective 3D scene understanding. |
This paper introduces \coolname{}, an efficient two-phase framework for 3D instance segmentation from posed RGB images using a neural label field. |
Existing neural field-based methods for 3D instance segmentation are computationally expensive and complex to train. \coolname{} offers a faster and simpler alternative. |
\coolname{} uses InstanceMap to associate 2D instance masks across images and generate pseudo-labels. Then, InstanceLift, a neural label field, refines these labels for 3D consistency. Finally, InstanceLoc enables fast instance localization in novel views. |
\coolname{} achieves comparable accuracy to state-of-the-art methods like Panoptic Lifting but with significantly faster runtime (14-24x).
InstanceLift effectively refines noisy pseudo-labels generated by InstanceMap, improving the overall 3D instance segmentation.
InstanceLoc, leveraging a fast 2D instance segmentation model and the trained label field, enables real-time instance localization in novel views. |
Extreme viewpoint changes in the input sequence can lead to discontinuous 3D instance labels.
Future work can focus on improving label consistency in challenging scenarios and exploring alternative neural field architectures for faster inference. |
3d instance segmentation, neural fields, instance segmentation, novel view synthesis, scene understanding |
2403.19776
Report |
CLoRA: A Contrastive Approach to Compose Multiple LoRA Models |
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag |
Low-Rank Adaptations (LoRAs) have emerged as a powerful and popular technique
in the field of image generation, offering a highly effective way to adapt and
refine pre-trained deep learning models for specific tasks without the need for
comprehensive retraining. By employing pre-trained LoRA models, such as those
representing a specific cat and a particular dog, the objective is to generate
an image that faithfully embodies both animals as defined by the LoRAs.
However, the task of seamlessly blending multiple concept LoRAs to capture a
variety of concepts in one image proves to be a significant challenge. Common
approaches often fall short, primarily because the attention mechanisms within
different LoRA models overlap, leading to scenarios where one concept may be
completely ignored (e.g., omitting the dog) or where concepts are incorrectly
combined (e.g., producing an image of two cats instead of one cat and one dog).
To overcome these issues, CLoRA addresses them by updating the attention maps
of multiple LoRA models and leveraging them to create semantic masks that
facilitate the fusion of latent representations. Our method enables the
creation of composite images that truly reflect the characteristics of each
LoRA, successfully merging multiple concepts or styles. Our comprehensive
evaluations, both qualitative and quantitative, demonstrate that our approach
outperforms existing methodologies, marking a significant advancement in the
field of image generation with LoRAs. Furthermore, we share our source code,
benchmark dataset, and trained LoRA models to promote further research on this
topic. |
This paper introduces CLoRA, a novel training-free method that addresses the challenges of composing multiple concept and style LoRAs (Low-Rank Adaptations) simultaneously during test time for image generation. |
The ability to combine LoRAs is crucial for leveraging compositionality in image generation. It enables users to create personalized and diverse images by combining various concepts and styles encoded in pre-trained LoRAs. |
CLoRA utilizes contrastive learning and attention map manipulation during test time. It generates multiple prompts with and without LoRA applications, groups attention maps by concept, and uses contrastive loss to guide latent representation updates. This resolves attention overlap and attribute binding issues, ensuring each LoRA contributes correctly to the final image. |
CLoRA successfully integrates multiple content and style LoRAs, generating images that faithfully reflect the characteristics of each LoRA model.
Qualitative comparisons demonstrate CLoRA's superiority over existing methods, showcasing its ability to maintain individual LoRA identities and avoid attribute blending.
Quantitative analysis using DINO-based metrics further confirms CLoRA's effectiveness in merging LoRA content, surpassing baselines in fidelity and accuracy. |
The effectiveness of CLoRA depends on the quality of the input LoRA models.
Computational complexity, especially with numerous LoRAs, might impact processing time and resource requirements. |
image generation, lora, contrastive learning, attention mechanism, compositionality |
2403.19738
Report |
MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models |
Hidir Yesiltepe, Kiymet Akdemir, Pinar Yanardag |
Diffusion-based text-to-image models have rapidly gained popularity for their
ability to generate detailed and realistic images from textual descriptions.
However, these models often reflect the biases present in their training data,
especially impacting marginalized groups. While prior efforts to debias
language models have focused on addressing specific biases, such as racial or
gender biases, efforts to tackle intersectional bias have been limited.
Intersectional bias refers to the unique form of bias experienced by
individuals at the intersection of multiple social identities. Addressing
intersectional bias is crucial because it amplifies the negative effects of
discrimination based on race, gender, and other identities. In this paper, we
introduce a method that addresses intersectional bias in diffusion-based
text-to-image models by modifying cross-attention maps in a disentangled
manner. Our approach utilizes a pre-trained Stable Diffusion model, eliminates
the need for an additional set of reference images, and preserves the original
quality for unaltered concepts. Comprehensive experiments demonstrate that our
method surpasses existing approaches in mitigating both single and
intersectional biases across various attributes. We make our source code and
debiased models for various attributes available to encourage fairness in
generative models and to support further research. |
This paper introduces MIST, a novel method for mitigating intersectional bias in text-to-image diffusion models by disentangled fine-tuning of cross-attention maps. |
Addressing intersectional bias in text-to-image models is crucial for ensuring fairness and preventing the amplification of discrimination against individuals at the intersection of multiple marginalized identities. |
MIST leverages the observation that the token in text embeddings can control image generation in a disentangled way. It optimizes the cross-attention projection matrices by minimizing the difference between the token embeddings of a source prompt and a guidance prompt, thus aligning the model's representation towards the desired, unbiased output. |
MIST effectively mitigates both single and intersectional biases across various attributes like gender, race, age, and eyeglasses, as demonstrated qualitatively and quantitatively.
Compared to existing debiasing methods, MIST achieves superior performance in reducing bias while preserving the fidelity of unrelated concepts, as evidenced by lower biasedness scores and average pixel shifts.
Unlike previous methods, MIST doesn't require additional reference images or manually curated preservation lists, making it more practical and scalable. |
The debiasing capabilities of MIST are limited by the biases present in the pre-trained Stable Diffusion model and the CLIP language model used for evaluation.
Future work includes exploring alternative evaluation metrics and addressing potential biases in the evaluation process itself. |
intersectional bias, text-to-image synthesis, diffusion models, debiasing, fairness |
2403.19716
Report |
Capability-aware Prompt Reformulation Learning for Text-to-Image Generation |
Jingtao Zhan, Qingyao Ai, Yiqun Liu, Jia Chen, Shaoping Ma |
Text-to-image generation systems have emerged as revolutionary tools in the
realm of artistic creation, offering unprecedented ease in transforming textual
prompts into visual art. However, the efficacy of these systems is intricately
linked to the quality of user-provided prompts, which often poses a challenge
to users unfamiliar with prompt crafting. This paper addresses this challenge
by leveraging user reformulation data from interaction logs to develop an
automatic prompt reformulation model. Our in-depth analysis of these logs
reveals that user prompt reformulation is heavily dependent on the individual
user's capability, resulting in significant variance in the quality of
reformulation pairs. To effectively use this data for training, we introduce
the Capability-aware Prompt Reformulation (CAPR) framework. CAPR innovatively
integrates user capability into the reformulation process through two key
components: the Conditional Reformulation Model (CRM) and Configurable
Capability Features (CCF). CRM reformulates prompts according to a specified
user capability, as represented by CCF. The CCF, in turn, offers the
flexibility to tune and guide the CRM's behavior. This enables CAPR to
effectively learn diverse reformulation strategies across various user
capacities and to simulate high-capability user reformulation during inference.
Extensive experiments on standard text-to-image generation benchmarks showcase
CAPR's superior performance over existing baselines and its remarkable
robustness on unseen systems. Furthermore, comprehensive analyses validate the
effectiveness of different components. CAPR can facilitate user-friendly
interaction with text-to-image systems and make advanced artistic creation more
achievable for a broader range of users. |
This paper presents CAPR, a novel capability-aware prompt reformulation framework for text-to-image generation, trained on user interaction logs to address the challenge of poor prompts from users unfamiliar with prompt crafting. |
Crafting effective prompts for text-to-image generation systems is difficult for most users, and existing query reformulation techniques don't translate well due to the lack of system feedback and dependence on user capability in this domain. |
CAPR decomposes the reformulation model into a Conditional Reformulation Model (CRM), trained on prompt pairs and user capability conditions derived from prompt quality metrics, and Configurable Capability Features (CCF) to represent and tune capability levels during inference. |
CAPR significantly outperforms baselines like GPT-4 and existing reformulation models in improving generation quality.
The framework effectively transfers to unseen, more advanced text-to-image generation systems, demonstrating robustness.
Analysis shows CRM can be effectively controlled by CCF conditions, even extrapolating beyond training data limitations. |
The study primarily focuses on overall user satisfaction, requiring further exploration for users with specific needs.
Future work can explore incorporating visual feedback to improve reformulation effectiveness. |
text-to-image generation, prompt reformulation, log analysis, user capability, conditional generation |
2403.19653
Report |
Detecting Image Attribution for Text-to-Image Diffusion Models in RGB and Beyond |
Katherine Xu, Lingzhi Zhang, Jianbo Shi |
Modern text-to-image (T2I) diffusion models can generate images with
remarkable realism and creativity. These advancements have sparked research in
fake image detection and attribution, yet prior studies have not fully explored
the practical and scientific dimensions of this task. In addition to
attributing images to 12 state-of-the-art T2I generators, we provide extensive
analyses on what inference stage hyperparameters and image modifications are
discernible. Our experiments reveal that initialization seeds are highly
detectable, along with other subtle variations in the image generation process
to some extent. We further investigate what visual traces are leveraged in
image attribution by perturbing high-frequency details and employing mid-level
representations of image style and structure. Notably, altering high-frequency
information causes only slight reductions in accuracy, and training an
attributor on style representations outperforms training on RGB images. Our
analyses underscore that fake images are detectable and attributable at various
levels of visual granularity than previously explored. |
This paper presents an in-depth analysis of detecting and attributing images generated by 12 state-of-the-art text-to-image diffusion models, going beyond RGB analysis by exploring detectable traces in high-frequency perturbations and mid-level representations. |
This research is crucial for advancing image forensics, copyright protection, and ensuring the integrity of digital content in the age of increasingly sophisticated AI-generated images. |
The authors generated a dataset of nearly half a million AI-generated images using diverse prompts and hyperparameters. They trained various image attributors (classifiers) and rigorously analyzed their performance under different conditions like hyperparameter variations, post-editing modifications, and varying levels of visual detail. |
Achieved over 90% accuracy in attributing images to their source generators, significantly outperforming random chance.
Demonstrated that even subtle variations in inference-stage hyperparameters, especially initialization seeds, can be detected with high accuracy.
Discovered that stylistic representations of images, captured using Gram matrices, are more effective than RGB data for image attribution, indicating unique stylistic fingerprints of generators. |
Limited exploration of dataset expansion due to budget constraints.
Difficulty in explaining the decision-making process of the attributors despite using Grad-CAM visualizations. |
generative models, image attribution, image forensics, text-to-image synthesis, deep learning |
2403.19596
Report |
LocCa: Visual Pretraining with Location-aware Captioners |
Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai |
Image captioning has been shown as an effective pretraining method similar to
contrastive pretraining. However, the incorporation of location-aware
information into visual pretraining remains an area with limited research. In
this paper, we propose a simple visual pretraining method with location-aware
captioners (LocCa). LocCa uses a simple image captioner task interface, to
teach a model to read out rich information, i.e. bounding box coordinates, and
captions, conditioned on the image pixel input. Thanks to the multitask
capabilities of an encoder-decoder architecture, we show that an image
captioner can easily handle multiple tasks during pretraining. Our experiments
demonstrate that LocCa outperforms standard captioners significantly on
localization downstream tasks while maintaining comparable performance on
holistic tasks. |
Proposes Location-aware Captioner (LocCa), a visual pretraining method using a multi-task decoder for image captioning, referring expression, and grounded captioning tasks. |
Enhances visual representations with location-aware context, improving performance on localization downstream tasks without complex model architectures. |
Pretrains an encoder-decoder model on WebLI dataset with OWL-ViT pseudo annotations, leveraging task-specific prefixes for multitask learning and predicting bounding boxes and captions sequentially. |
Achieves state-of-the-art results on referring expression comprehension benchmarks (RefCOCO, RefCOCO+, RefCOCOg).
Significantly outperforms baselines on referring expression segmentation and object detection.
Maintains strong performance on holistic image understanding tasks (image classification, captioning, VQA) and surpasses baselines on object-centric tasks (VQAv2, GQA). |
Limited exploration of zero-shot object detection capabilities.
Current decoding strategy struggles to balance the quantity and quality of predicted boxes and labels. |
localization, image captioning, vision language models, multitask learning, visual pretraining |
2403.19593
Report |
Frame by Familiar Frame: Understanding Replication in Video Diffusion Models |
Aimon Rahman, Malsha V. Perera, Vishal M. Patel |
Building on the momentum of image generation diffusion models, there is an
increasing interest in video-based diffusion models. However, video generation
poses greater challenges due to its higher-dimensional nature, the scarcity of
training data, and the complex spatiotemporal relationships involved. Image
generation models, due to their extensive data requirements, have already
strained computational resources to their limits. There have been instances of
these models reproducing elements from the training samples, leading to
concerns and even legal disputes over sample replication. Video diffusion
models, which operate with even more constrained datasets and are tasked with
generating both spatial and temporal content, may be more prone to replicating
samples from their training sets. Compounding the issue, these models are often
evaluated using metrics that inadvertently reward replication. In our paper, we
present a systematic investigation into the phenomenon of sample replication in
video diffusion models. We scrutinize various recent diffusion models for video
synthesis, assessing their tendency to replicate spatial and temporal content
in both unconditional and conditional generation scenarios. Our study
identifies strategies that are less likely to lead to replication. Furthermore,
we propose new evaluation strategies that take replication into account,
offering a more accurate measure of a model's ability to generate the original
content. |
This paper investigates sample replication in video diffusion models, exploring the extent, frequency, and strategies for mitigation. |
As video diffusion models gain popularity, it's crucial to understand if they generate truly novel content or simply replicate training data. This has implications for copyright, privacy, and the reliability of AI-generated content. |
The authors define video replication for different generation contexts (conditional and unconditional). They use the VSSCD metric to detect content replication and analyze FVD scores with augmented input frames to assess motion replication. Additionally, they examine data requirements for unique content generation and compare different model architectures. |
Video diffusion models trained on limited datasets are prone to replicating content and motion from the training data.
Image diffusion models require significantly less data than video diffusion models to generate unique content.
Using a pre-trained text-to-image backbone and fine-tuning only the temporal layers can mitigate replication in video diffusion models. |
Limited access to publicly available video diffusion models and their training data poses challenges for comprehensive analysis.
Further research is needed to explore motion replication across varying content and in models trained on large-scale datasets. |
video diffusion models, sample replication, generative ai, content originality, evaluation metrics |
2403.19588
Report |
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs |
Donghyun Kim, Byeongho Heo, Dongyoon Han |
This paper revives Densely Connected Convolutional Networks (DenseNets) and
reveals the underrated effectiveness over predominant ResNet-style
architectures. We believe DenseNets' potential was overlooked due to untouched
training methods and traditional design elements not fully revealing their
capabilities. Our pilot study shows dense connections through concatenation are
strong, demonstrating that DenseNets can be revitalized to compete with modern
architectures. We methodically refine suboptimal components - architectural
adjustments, block redesign, and improved training recipes towards widening
DenseNets and boosting memory efficiency while keeping concatenation shortcuts.
Our models, employing simple architectural elements, ultimately surpass Swin
Transformer, ConvNeXt, and DeiT-III - key architectures in the residual
learning lineage. Furthermore, our models exhibit near state-of-the-art
performance on ImageNet-1K, competing with the very recent models and
downstream tasks, ADE20k semantic segmentation, and COCO object
detection/instance segmentation. Finally, we provide empirical analyses that
uncover the merits of the concatenation over additive shortcuts, steering a
renewed preference towards DenseNet-style designs. Our code is available at
https://github.com/naver-ai/rdnet. |
This paper revitalizes Densely Connected Convolutional Networks (DenseNets) by modernizing their architecture and training methods to compete with prevailing ResNet-like architectures, showing the efficacy of concatenation shortcuts. |
This is important because it challenges the dominance of additive shortcut-based models and highlights the potential of DenseNet-style designs for enhanced performance and efficiency. |
The authors conducted a pilot study with thousands of random networks to validate the effectiveness of concatenation shortcuts. They then systematically refined DenseNets by widening the network, improving feature mixers, introducing more transition layers, and employing a patchification stem. |
The revitalized DenseNets (RDNets) outperform Swin Transformer, ConvNeXt, and DeiT-III on ImageNet-1K benchmark with competitive performance on downstream tasks like ADE20K and COCO.
RDNets demonstrate robustness to input image size variations, maintaining accuracy without significant latency or memory increase.
Analysis shows RDNets learn distinct features compared to ConvNeXt, highlighting the unique learning dynamics of concatenation-based models. |
Resource constraints prevented scaling RDNets to extremely large scales like ViT-G.
Future work can explore further optimization of training hyperparameters for downstream tasks to achieve maximum precisions. |
densenets, concatenation shortcuts, image classification, semantic segmentation, object detection |
2403.19580
Report |
OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation |
Zhenyu Wang, Yali Li, Taichi Liu, Hengshuang Zhao, Shengjin Wang |
In the current state of 3D object detection research, the severe scarcity of
annotated 3D data, substantial disparities across different data modalities,
and the absence of a unified architecture, have impeded the progress towards
the goal of universality. In this paper, we propose \textbf{OV-Uni3DETR}, a
unified open-vocabulary 3D detector via cycle-modality propagation. Compared
with existing 3D detectors, OV-Uni3DETR offers distinct advantages: 1)
Open-vocabulary 3D detection: During training, it leverages various accessible
data, especially extensive 2D detection images, to boost training diversity.
During inference, it can detect both seen and unseen classes. 2) Modality
unifying: It seamlessly accommodates input data from any given modality,
effectively addressing scenarios involving disparate modalities or missing
sensor information, thereby supporting test-time modality switching. 3) Scene
unifying: It provides a unified multi-modal model architecture for diverse
scenes collected by distinct sensors. Specifically, we propose the
cycle-modality propagation, aimed at propagating knowledge bridging 2D and 3D
modalities, to support the aforementioned functionalities. 2D semantic
knowledge from large-vocabulary learning guides novel class discovery in the 3D
domain, and 3D geometric knowledge provides localization supervision for 2D
detection images. OV-Uni3DETR achieves the state-of-the-art performance on
various scenarios, surpassing existing methods by more than 6\% on average. Its
performance using only RGB images is on par with or even surpasses that of
previous point cloud based methods. Code and pre-trained models will be
released later. |
Proposes OV-Uni3DETR, a unified open-vocabulary 3D object detector that leverages cycle-modality propagation for knowledge transfer between 2D and 3D modalities. |
Addresses limitations of existing 3D detectors, which are restricted to closed-vocabulary detection, specific input modalities, and often limited to either indoor or outdoor scenes. Aims to achieve universality in 3D object detection. |
Introduces a unified multi-modal architecture that accommodates point clouds and RGB images during training, enabling test-time modality switching. Employs cycle-modality propagation: leverages 2D semantic knowledge for 3D novel class discovery and 3D geometric knowledge for supervising 2D detection without 3D annotations. |
Achieves state-of-the-art performance on open-vocabulary 3D object detection benchmarks, surpassing previous methods.
Demonstrates modality-switching capability, with performance using only RGB images on par with or surpassing point cloud-based methods.
Effectively detects objects in both indoor and outdoor scenes, achieving scene-unifying capability. |
Potential for improvement in handling noisy 3D boxes generated from 2D images.
Exploration of incorporating more diverse 2D data sources and larger-scale pre-trained models for enhanced novel class detection. |
open-vocabulary learning, 3d object detection, multi-modal learning, knowledge distillation, scene understanding |
2403.19534
Report |
Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance |
Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang |
Prior studies have made significant progress in image inpainting guided by
either text or subject image. However, the research on editing with their
combined guidance is still in the early stages. To tackle this challenge, we
present LAR-Gen, a novel approach for image inpainting that enables seamless
inpainting of masked scene images, incorporating both the textual prompts and
specified subjects. Our approach adopts a coarse-to-fine manner to ensure
subject identity preservation and local semantic coherence. The process
involves (i) Locate: concatenating the noise with masked scene image to achieve
precise regional editing, (ii) Assign: employing decoupled cross-attention
mechanism to accommodate multi-modal guidance, and (iii) Refine: using a novel
RefineNet to supplement subject details. Additionally, to address the issue of
scarce training data, we introduce a novel data construction pipeline. This
pipeline extracts substantial pairs of data consisting of local text prompts
and corresponding visual instances from a vast image dataset, leveraging
publicly available large models. Extensive experiments and varied application
scenarios demonstrate the superiority of LAR-Gen in terms of both identity
preservation and text semantic consistency. Project page can be found at
\url{https://ali-vilab.github.io/largen-page/}. |
This paper presents LAR-Gen, a novel text-subject-guided image inpainting approach that seamlessly incorporates specified subjects into scene images while adhering to textual prompts, enhancing customized image editing. |
Existing inpainting methods often struggle to balance subject fidelity and local semantic coherence when guided by both text and subject images. LAR-Gen addresses this gap, enabling more precise and creative image editing. |
LAR-Gen employs a coarse-to-fine strategy: (i) Locate mechanism confines editing to the masked region, (ii) Assign mechanism uses decoupled cross-attention for multi-modal guidance, and (iii) Refine mechanism leverages an auxiliary U-Net (RefineNet) to enhance subject details. |
LAR-Gen demonstrates superior performance in preserving both subject identity and text semantic consistency.
A novel data construction pipeline is introduced to address data scarcity, extracting region-level quadruples from large image datasets.
LAR-Gen acts as a unified framework supporting text-only, image-only, and combined text-subject-guided inpainting within a single model. |
Subject deformation capabilities are limited due to reliance on a single reference image.
The model might prioritize certain conditions over others when multiple conditions conflict. |
image inpainting, diffusion model, text-subject-guided, customized image editing, multi-modal guidance |
2403.19522
Report |
Model Stock: All we need is just a few fine-tuned models |
Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han |
This paper introduces an efficient fine-tuning method for large pre-trained
models, offering strong in-distribution (ID) and out-of-distribution (OOD)
performance. Breaking away from traditional practices that need a multitude of
fine-tuned models for averaging, our approach employs significantly fewer
models to achieve final weights yet yield superior accuracy. Drawing from key
insights in the weight space of fine-tuned weights, we uncover a strong link
between the performance and proximity to the center of weight space. Based on
this, we introduce a method that approximates a center-close weight using only
two fine-tuned models, applicable during or after training. Our innovative
layer-wise weight averaging technique surpasses state-of-the-art model methods
such as Model Soup, utilizing only two fine-tuned models. This strategy can be
aptly coined Model Stock, highlighting its reliance on selecting a minimal
number of models to draw a more optimized-averaged model. We demonstrate the
efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP
architectures, achieving remarkable performance on both ID and OOD tasks on the
standard benchmarks, all while barely bringing extra computational demands. Our
code and pre-trained models are available at
https://github.com/naver-ai/model-stock. |
This paper proposes "Model Stock," an efficient fine-tuning technique for large pre-trained models achieving strong performance in both in-distribution (ID) and out-of-distribution (OOD) settings using significantly fewer models than traditional averaging methods. |
Fine-tuning is crucial in adapting pre-trained models for specific tasks, impacting both accuracy and robustness against distribution shifts. Model Stock offers an efficient alternative to computationally expensive model averaging techniques like Model Soup. |
The authors analyze the weight space of fine-tuned models, discovering that: 1) weights lie on a thin shell, and 2) proximity to the center of this shell correlates with improved ID and OOD performance. Leveraging these insights and a pre-trained model as a robust anchor, Model Stock approximates the center with minimal fine-tuned models. |
Model Stock achieves comparable or superior performance to Model Soup using only two fine-tuned models, significantly reducing computational cost.
On CLIP ViT-L/14, Model Stock achieves state-of-the-art 87.8% top-1 accuracy on ImageNet (ID) and 74.9% average on five OOD benchmarks.
The method's effectiveness is demonstrated across various CLIP architectures and benchmark datasets. |
Resource limitations prevented evaluation on larger-scale models beyond ViT-L.
Future work will explore applying Model Stock to even larger models like ViT-G. |
fine-tuning, model averaging, distribution shift, robustness, pre-trained models |
2403.19517
Report |
XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold |
Guangyu Wang, Jinzhi Zhang, Fan Wang, Ruqi Huang, Lu Fang |
We propose XScale-NVS for high-fidelity cross-scale novel view synthesis of
real-world large-scale scenes. Existing representations based on explicit
surface suffer from discretization resolution or UV distortion, while implicit
volumetric representations lack scalability for large scenes due to the
dispersed weight distribution and surface ambiguity. In light of the above
challenges, we introduce hash featurized manifold, a novel hash-based
featurization coupled with a deferred neural rendering framework. This approach
fully unlocks the expressivity of the representation by explicitly
concentrating the hash entries on the 2D manifold, thus effectively
representing highly detailed contents independent of the discretization
resolution. We also introduce a novel dataset, namely GigaNVS, to benchmark
cross-scale, high-resolution novel view synthesis of realworld large-scale
scenes. Our method significantly outperforms competing baselines on various
real-world scenes, yielding an average LPIPS that is 40% lower than prior
state-of-the-art on the challenging GigaNVS benchmark. Please see our project
page at: xscalenvs.github.io. |
This paper proposes XScale-NVS, a novel hash featurized manifold representation, coupled with deferred neural rendering, for high-fidelity cross-scale novel view synthesis of large-scale scenes. |
Existing methods struggle to represent large-scale scenes with both macro-structure and micro-details. Explicit surface representations suffer from discretization resolution or UV distortion, while implicit volumetric representations lack scalability and have surface ambiguities. |
The method leverages a pre-computed mesh as a surface proxy. It utilizes volumetric multi-resolution hash encoding to featurize the surface manifold directly. A deferred neural rendering pipeline with surface multisampling and a manifold deformation mechanism decodes the representation. |
Significantly outperforms prior arts on the challenging GigaNVS benchmark and Tanks & Temples dataset.
Reduces average LPIPS by 40% on GigaNVS compared to state-of-the-art.
Demonstrates robustness to mesh resolution and superior efficiency. |
Current method cannot fully address the incompleteness and occlusions caused by incorrect geometry.
Future work includes exploring differentiable rendering for better geometry handling. |
novel view synthesis, neural rendering, large-scale scene representation, hash encoding, deferred rendering |
2403.19495
Report |
CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians |
Avinash Paliwal, Wei Ye, Jinhui Xiong, Dmytro Kotovenko, Rakesh Ranjan, Vikas Chandra, Nima Khademi Kalantari |
The field of 3D reconstruction from images has rapidly evolved in the past
few years, first with the introduction of Neural Radiance Field (NeRF) and more
recently with 3D Gaussian Splatting (3DGS). The latter provides a significant
edge over NeRF in terms of the training and inference speed, as well as the
reconstruction quality. Although 3DGS works well for dense input images, the
unstructured point-cloud like representation quickly overfits to the more
challenging setup of extremely sparse input images (e.g., 3 images), creating a
representation that appears as a jumble of needles from novel views. To address
this issue, we propose regularized optimization and depth-based initialization.
Our key idea is to introduce a structured Gaussian representation that can be
controlled in 2D image space. We then constraint the Gaussians, in particular
their position, and prevent them from moving independently during optimization.
Specifically, we introduce single and multiview constraints through an implicit
convolutional decoder and a total variation loss, respectively. With the
coherency introduced to the Gaussians, we further constrain the optimization
through a flow-based loss function. To support our regularized optimization, we
propose an approach to initialize the Gaussians using monocular depth estimates
at each input view. We demonstrate significant improvements compared to the
state-of-the-art sparse-view NeRF-based approaches on a variety of scenes. |
This paper introduces CoherentGS, a novel approach for sparse novel view synthesis using 3D Gaussian Splatting (3DGS) by enforcing coherency among Gaussians through regularized optimization and depth-based initialization. |
Existing 3DGS methods struggle with sparse inputs, leading to overfitting and poor novel view quality. NeRF-based alternatives, while designed for sparsity, have limitations in regularization and are not directly applicable to the explicit, unstructured nature of 3DGS. |
CoherentGS assigns a Gaussian to each input image pixel, initializes their positions using monocular depth, and regularizes optimization using: 1) An implicit decoder for smooth single-view depth residuals. 2) Total variation loss for multi-view consistent depth. 3) Flow-based loss for similar Gaussian positions in corresponding image pairs. |
Outperforms state-of-the-art sparse-view NeRF methods on LLFF and NVS-RGBD datasets, particularly in perceptual quality (LPIPS).
Reconstructs high-quality textures and smooth geometry even with extremely sparse inputs (2-4 images).
Identifies occluded regions, enabling targeted inpainting for realistic hallucination of missing details. |
Struggles with transparent objects due to the single-Gaussian-per-pixel representation.
Relies on monocular depth quality, potentially impacting performance with inaccurate estimates. |
sparse view synthesis, 3d gaussian splatting, implicit decoder, novel view synthesis, 3d reconstruction |
2403.19473
Report |
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM |
Tongyan Hua, Lin Wang |
Implicit neural representation (INR), in combination with geometric
rendering, has recently been employed in real-time dense RGB-D SLAM. Despite
active research endeavors being made, there lacks a unified protocol for fair
evaluation, impeding the evolution of this area. In this work, we establish, to
our knowledge, the first open-source benchmark framework to evaluate the
performance of a wide spectrum of commonly used INRs and rendering functions
for mapping and localization. The goal of our benchmark is to 1) gain an
intuition of how different INRs and rendering functions impact mapping and
localization and 2) establish a unified evaluation protocol w.r.t. the design
choices that may impact the mapping and localization. With the framework, we
conduct a large suite of experiments, offering various insights in choosing the
INRs and geometric rendering functions: for example, the dense feature grid
outperforms other INRs (e.g. tri-plane and hash grid), even when geometric and
color features are jointly encoded for memory efficiency. To extend the
findings into the practical scenario, a hybrid encoding strategy is proposed to
bring the best of the accuracy and completion from the grid-based and
decomposition-based INRs. We further propose explicit hybrid encoding for
high-fidelity dense grid mapping to comply with the RGB-D SLAM system that puts
the premise on robustness and computation efficiency. |
This paper introduces the first open-source benchmark framework for evaluating the performance of different Implicit Neural Representations (INRs) and rendering functions within a unified RGB-D SLAM system. |
A standardized benchmark is crucial for fair comparison and understanding how different INR and rendering choices impact mapping and localization accuracy, especially given the lack of such a framework in active NeRF-SLAM research. |
The benchmark evaluates various INR structures (MLP, Dense Grid, Sparse Grid, Tri-plane, Factorization) and rendering functions (SDF-based) on two scenarios: a controlled lab setting (Replica dataset) and a practical setting with noisy data and partial scene coverage (NeuralRGBD dataset). Performance is assessed through metrics like ATE, PSNR, Depth L1, Accuracy, Completion, and Completion Ratio. |
Dense grid representation consistently outperforms other INRs in the lab setting, achieving the best accuracy and speed.
Decomposition-based INRs (Tri-plane, Factorization) show advantages in the practical setting with partial scene coverage, indicating better generalization but less accurate than dense grid.
A novel "hybrid encoding" strategy combining dense grid and tri-plane achieves superior trajectory estimation and reconstruction fidelity in both scenarios. |
The benchmark mainly focuses on orthogonal spatial splitting representations, neglecting recent advances in point-based methods like Point-NeRF and 3D Gaussians.
Future work should explore diverse scene types and integrate SLAM-centric and NeRF-centric methodologies for a unified evaluation. |
slam, nerf, implicit neural representation, benchmarking, 3d reconstruction |
2403.19456
Report |
Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization |
Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Oliver Deussen, Weiming Dong, Jintao Li, Tong-Yee Lee |
Personalized generation paradigms empower designers to customize visual
intellectual properties with the help of textual descriptions by tuning or
adapting pre-trained text-to-image models on a few images. Recent works explore
approaches for concurrently customizing both content and detailed visual style
appearance. However, these existing approaches often generate images where the
content and style are entangled. In this study, we reconsider the customization
of content and style concepts from the perspective of parameter space
construction. Unlike existing methods that utilize a shared parameter space for
content and style, we propose a learning framework that separates the parameter
space to facilitate individual learning of content and style, thereby enabling
disentangled content and style. To achieve this goal, we introduce "partly
learnable projection" (PLP) matrices to separate the original adapters into
divided sub-parameter spaces. We propose "break-for-make" customization
learning pipeline based on PLP, which is simple yet effective. We break the
original adapters into "up projection" and "down projection", train content and
style PLPs individually with the guidance of corresponding textual prompts in
the separate adapters, and maintain generalization by employing a
multi-correspondence projection learning strategy. Based on the adapters broken
apart for separate training content and style, we then make the entity
parameter space by reconstructing the content and style PLPs matrices, followed
by fine-tuning the combined adapter to generate the target object with the
desired appearance. Experiments on various styles, including textures,
materials, and artistic style, show that our method outperforms
state-of-the-art single/multiple concept learning pipelines in terms of
content-style-prompt alignment. |
Introduces "Break-for-Make", a novel learning framework using "partly learnable projection" (PLP) matrices to disentangle content and style customization in text-to-image generation. |
Existing methods for content and style customization in text-to-image generation often lead to entangled results, limiting control over individual aspects. |
Employs PLP matrices to separate parameter space for content and style, enabling individual training guided by corresponding textual prompts. Utilizes a multi-correspondence projection learning strategy for generalization. |
Achieves superior content-style-prompt alignment compared to state-of-the-art methods.
Demonstrates effective disentanglement of content and style customization.
Maintains high fidelity in generated images. |
Exploration of alternative projection learning strategies for potential improvements.
Evaluation of the approach on a wider range of visual styles and complexities. |
text-to-image generation, content-style disentanglement, personalized image generation, deep learning, computer vision |
2403.19386
Report |
PointCloud-Text Matching: Benchmark Datasets and a Baseline |
Yanglin Feng, Yang Qin, Dezhong Peng, Hongyuan Zhu, Xi Peng, Peng Hu |
In this paper, we present and study a new instance-level retrieval task:
PointCloud-Text Matching~(PTM), which aims to find the exact cross-modal
instance that matches a given point-cloud query or text query. PTM could be
applied to various scenarios, such as indoor/urban-canyon localization and
scene retrieval. However, there exists no suitable and targeted dataset for PTM
in practice. Therefore, we construct three new PTM benchmark datasets, namely
3D2T-SR, 3D2T-NR, and 3D2T-QA. We observe that the data is challenging and with
noisy correspondence due to the sparsity, noise, or disorder of point clouds
and the ambiguity, vagueness, or incompleteness of texts, which make existing
cross-modal matching methods ineffective for PTM. To tackle these challenges,
we propose a PTM baseline, named Robust PointCloud-Text Matching method (RoMa).
RoMa consists of two modules: a Dual Attention Perception module (DAP) and a
Robust Negative Contrastive Learning module (RNCL). Specifically, DAP leverages
token-level and feature-level attention to adaptively focus on useful local and
global features, and aggregate them into common representations, thereby
reducing the adverse impact of noise and ambiguity. To handle noisy
correspondence, RNCL divides negative pairs, which are much less error-prone
than positive pairs, into clean and noisy subsets, and assigns them forward and
reverse optimization directions respectively, thus enhancing robustness against
noisy correspondence. We conduct extensive experiments on our benchmarks and
demonstrate the superiority of our RoMa. |
This paper introduces PointCloud-Text Matching (PTM), a novel instance-level retrieval task aiming to match point cloud and text data, and proposes RoMa, a robust baseline method for PTM. |
PTM addresses the need for precise instance-level alignment between point clouds and textual descriptions, with applications in indoor/urban localization and scene retrieval. Existing methods are insufficient due to the challenges of noisy, sparse point cloud data and ambiguous textual descriptions. |
The authors propose RoMa, comprising a Dual Attention Perception (DAP) module and a Robust Negative Contrastive Learning (RNCL) module. DAP captures local and global features through token and feature-level attention. RNCL handles noisy correspondences by identifying and differently optimizing for clean and noisy negative pairs. |
RoMa significantly outperforms existing Image-Text Matching methods adapted to PTM, demonstrating its effectiveness.
The study highlights the significant challenge noisy correspondences pose in PTM datasets.
Ablation studies show the contribution of both DAP and RNCL to RoMa's performance. |
The performance on PTM datasets is still relatively low compared to Image-Text Matching datasets, indicating room for improvement.
The paper primarily focuses on indoor scene datasets, and future work could explore other environments like outdoor urban scenes. |
pointcloud-text matching, cross-modal retrieval, 3d vision and language, dual attention, robust contrastive learning |
2403.19322
Report |
Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models |
Jiaxing Chen, Yuxuan Liu, Dehu Li, Xiang An, Ziyong Feng, Yongle Zhao, Yin Xie |
The surge of Multimodal Large Language Models (MLLMs), given their prominent
emergent capabilities in instruction following and reasoning, has greatly
advanced the field of visual reasoning. However, constrained by their
non-lossless image tokenization, most MLLMs fall short of comprehensively
capturing details of text and objects, especially in high-resolution images. To
address this, we propose P2G, a novel framework for plug-and-play grounding of
reasoning in MLLMs. Specifically, P2G exploits the tool-usage potential of
MLLMs to employ expert agents to achieve on-the-fly grounding to critical
visual and textual objects of image, thus achieving deliberate reasoning via
multimodal prompting. We further create P2GB, a benchmark aimed at assessing
MLLMs' ability to understand inter-object relationships and text in challenging
high-resolution images. Comprehensive experiments on visual reasoning tasks
demonstrate the superiority of P2G. Noteworthy, P2G achieved comparable
performance with GPT-4V on P2GB, with a 7B backbone. Our work highlights the
potential of plug-and-play grounding of reasoning and opens up a promising
alternative beyond model scaling. |
This paper proposes **P$^2$G (Plug-and-Play Grounding)**, a framework that enhances multimodal large language models (MLLMs) to perform grounded reasoning on high-resolution and text-rich images, by leveraging external agents for retrieving crucial visual and textual clues. |
Current MLLMs struggle to comprehensively capture details in complex images due to limitations in image tokenization and the need for extensive instruction tuning data. P$^2$G addresses these limitations by enabling MLLMs to call upon specialized agents for on-the-fly grounding, leading to more accurate and grounded reasoning. |
P$^2$G employs OCR and visual grounding agents (PaddleOCR and Grounding-DINO) to extract textual and visual clues from images based on the MLLM’s assessment of the complexity of the given query. These clues, along with their positions, are integrated into multimodal prompts for subsequent reasoning by the MLLM. |
P$^2$G significantly outperforms existing MLLMs on text-rich visual reasoning benchmarks, including DocVQA and ChartVQA, achieving up to 3x improvement.
On a newly proposed challenging benchmark P$^2$GB, which includes high-resolution and text-rich images, P$^2$G demonstrates superior performance, even surpassing GPT-4V on certain tasks.
Ablation studies confirm the importance of both grounding agents and the inclusion of spatial information for achieving optimal performance. |
The model's reliance on external agents may introduce latency.
The current implementation has a limited context window size for processing large amounts of textual information. |
multimodal large language models, visual reasoning, plug-and-play, grounding, text recognition |
2403.19319
Report |
Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation |
Yujin Chen, Yinyu Nie, Benjamin Ummenhofer, Reiner Birkl, Michael Paulitsch, Matthias Müller, Matthias Nießner |
We present Mesh2NeRF, an approach to derive ground-truth radiance fields from
textured meshes for 3D generation tasks. Many 3D generative approaches
represent 3D scenes as radiance fields for training. Their ground-truth
radiance fields are usually fitted from multi-view renderings from a
large-scale synthetic 3D dataset, which often results in artifacts due to
occlusions or under-fitting issues. In Mesh2NeRF, we propose an analytic
solution to directly obtain ground-truth radiance fields from 3D meshes,
characterizing the density field with an occupancy function featuring a defined
surface thickness, and determining view-dependent color through a reflection
function considering both the mesh and environment lighting. Mesh2NeRF extracts
accurate radiance fields which provides direct supervision for training
generative NeRFs and single scene representation. We validate the effectiveness
of Mesh2NeRF across various tasks, achieving a noteworthy 3.12dB improvement in
PSNR for view synthesis in single scene representation on the ABO dataset, a
0.69 PSNR enhancement in the single-view conditional generation of ShapeNet
Cars, and notably improved mesh extraction from NeRF in the unconditional
generation of Objaverse Mugs. |
Presents Mesh2NeRF, a method to derive ground-truth radiance fields directly from textured 3D meshes for improved 3D generation. |
Addresses limitations of existing methods that rely on 2D supervision from multi-view renderings, which can lead to inaccurate reconstructions, particularly with limited or imbalanced views. |
Analytically generates a radiance field from a textured mesh by modeling the density field with an occupancy function and determining view-dependent color using a reflection function considering mesh and environment lighting. |
Achieves a 3.12dB PSNR improvement in single scene representation on the ABO dataset.
Shows a 0.69 PSNR enhancement in single-view conditional generation on ShapeNet Cars.
Generates significantly improved mesh extractions from NeRF in unconditional generation on Objaverse Mugs. |
Current implementation bakes lighting information into the appearance, similar to NeRF.
Relies on existing ray sampling techniques designed for rendered images, limiting efficiency. |
radiance field supervision, nerf generation, mesh prior, 3d generation, novel view synthesis |
2403.19314
Report |
Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction |
Xiaoyang Lyu, Chirui Chang, Peng Dai, Yang-Tian Sun, Xiaojuan Qi |
Scene reconstruction from multi-view images is a fundamental problem in
computer vision and graphics. Recent neural implicit surface reconstruction
methods have achieved high-quality results; however, editing and manipulating
the 3D geometry of reconstructed scenes remains challenging due to the absence
of naturally decomposed object entities and complex object/background
compositions. In this paper, we present Total-Decom, a novel method for
decomposed 3D reconstruction with minimal human interaction. Our approach
seamlessly integrates the Segment Anything Model (SAM) with hybrid
implicit-explicit neural surface representations and a mesh-based
region-growing technique for accurate 3D object decomposition. Total-Decom
requires minimal human annotations while providing users with real-time control
over the granularity and quality of decomposition. We extensively evaluate our
method on benchmark datasets and demonstrate its potential for downstream
applications, such as animation and scene editing. The code is available at
https://github.com/CVMI-Lab/Total-Decom.git. |
Presents Total-Decom, a novel framework for 3D scene reconstruction and decomposition into individual objects and backgrounds from multi-view images, minimizing the need for human annotations by leveraging the Segment Anything Model (SAM). |
Editing and manipulating the 3D geometry of traditionally reconstructed scenes is challenging due to the lack of decomposed object entities. Total-Decom addresses this by enabling the extraction of object-level shapes for applications like editing, animation, and simulation. |
Integrates SAM with a hybrid implicit-explicit neural surface representation. Employs an implicit neural field for reconstruction while distilling features from SAM. Extracts explicit mesh surfaces and distills features into their vertices. Uses SAM decoder to convert user clicks into object masks, guiding a mesh-based region-growing algorithm for object decomposition. |
Achieves superior scene and decomposed object reconstruction quality compared to state-of-the-art methods like ObjSDF++ on the Replica dataset.
Enables interactive decomposition of scenes at varying granularity levels, typically requiring only one click per object.
Demonstrates robust background reconstruction, accurately reconstructing even occluded areas. |
Limitations in handling occluded foreground areas due to the absence of training supervision for invisible regions.
Future work will explore integrating generative methods to complete occluded 3D objects and further improve mesh quality. |
3d reconstruction, scene decomposition, segment anything model (sam), neural implicit surfaces, region growing |
2403.19254
Report |
Imperceptible Protection against Style Imitation from Diffusion Models |
Namhyuk Ahn, Wonhyuk Ahn, KiYoon Yoo, Daesik Kim, Seung-Hun Nam |
Recent progress in diffusion models has profoundly enhanced the fidelity of
image generation. However, this has raised concerns about copyright
infringements. While prior methods have introduced adversarial perturbations to
prevent style imitation, most are accompanied by the degradation of artworks'
visual quality. Recognizing the importance of maintaining this, we develop a
visually improved protection method that preserves its protection capability.
To this end, we create a perceptual map to identify areas most sensitive to
human eyes. We then adjust the protection intensity guided by an instance-aware
refinement. We also integrate a perceptual constraints bank to further improve
the imperceptibility. Results show that our method substantially elevates the
quality of the protected image without compromising on protection efficacy. |
This paper proposes IMPASTO, a novel method to protect artistic styles from unauthorized imitation by diffusion models while preserving the visual quality of the protected artwork. |
The rise of powerful image generation models like Stable Diffusion leads to concerns about copyright infringement as they can be used to replicate artistic styles without permission. |
IMPASTO introduces a perception-aware protection (PAP) strategy using perceptual maps based on Just Noticeable Difference (JND) models to identify areas less sensitive to human perception for perturbation. It further enhances imperceptibility by incorporating a perceptual constraint bank that leverages LPIPS, low-pass filtering, and CLIP features. |
IMPASTO significantly improves the visual quality of protected images compared to existing methods while maintaining comparable protection performance.
The instance-wise refinement in IMPASTO allows adaptation to specific artworks, leading to better trade-offs between imperceptibility and protection.
IMPASTO demonstrates robustness against various countermeasures and generalizes well to unknown personalization methods and diffusion models. |
Current protection methods rely on adversarial perturbations, which are computationally expensive and time-consuming.
Future research could explore more efficient protection mechanisms to address the time constraints. |
style protection, diffusion models, copyright infringement, adversarial perturbation, perceptual quality |
2403.19205
Report |
From Activation to Initialization: Scaling Insights for Optimizing Neural Fields |
Hemanth Saratchandran, Sameera Ramasinghe, Simon Lucey |
In the realm of computer vision, Neural Fields have gained prominence as a
contemporary tool harnessing neural networks for signal representation. Despite
the remarkable progress in adapting these networks to solve a variety of
problems, the field still lacks a comprehensive theoretical framework. This
article aims to address this gap by delving into the intricate interplay
between initialization and activation, providing a foundational basis for the
robust optimization of Neural Fields. Our theoretical insights reveal a
deep-seated connection among network initialization, architectural choices, and
the optimization process, emphasizing the need for a holistic approach when
designing cutting-edge Neural Fields. |
This paper provides theoretical insights into the scaling dynamics of Neural Fields, particularly focusing on how the number of parameters affects gradient descent convergence in relation to dataset size. |
The paper addresses the lack of a comprehensive theoretical framework for Neural Fields, aiming to establish a foundation for their robust optimization. |
The authors theoretically analyze the scaling laws for neural fields with sine, sinc, Gaussian, and wavelet activations, proving the convergence of gradient descent under specific overparameterization conditions. They also develop a novel initialization scheme and empirically validate their findings on various applications. |
Neural Fields with sine, sinc, Gaussian, or wavelet activations require less overparameterization than those with ReLU for gradient descent convergence.
The authors propose a novel initialization scheme that significantly improves parameter efficiency compared to standard methods like LeCun, Xavier, and Kaiming.
Empirical validation on applications like image regression, super-resolution, shape reconstruction, and physics-informed neural networks supports the theoretical findings. |
Theoretical results currently apply only to full-batch gradient descent, not mini-batch training.
Exploring the generalization of the findings to other activation functions and network architectures is left for future work. |
neural fields, overparameterization, initialization, gradient descent, scaling laws |
2403.19164
Report |
RecDiffusion: Rectangling for Image Stitching with Diffusion Models |
Tianhao Zhou, Haipeng Li, Ziyi Wang, Ao Luo, Chen-Lin Zhang, Jiajun Li, Bing Zeng, Shuaicheng Liu |
Image stitching from different captures often results in non-rectangular
boundaries, which is often considered unappealing. To solve non-rectangular
boundaries, current solutions involve cropping, which discards image content,
inpainting, which can introduce unrelated content, or warping, which can
distort non-linear features and introduce artifacts. To overcome these issues,
we introduce a novel diffusion-based learning framework, \textbf{RecDiffusion},
for image stitching rectangling. This framework combines Motion Diffusion
Models (MDM) to generate motion fields, effectively transitioning from the
stitched image's irregular borders to a geometrically corrected intermediary.
Followed by Content Diffusion Models (CDM) for image detail refinement.
Notably, our sampling process utilizes a weighted map to identify regions
needing correction during each iteration of CDM. Our RecDiffusion ensures
geometric accuracy and overall visual appeal, surpassing all previous methods
in both quantitative and qualitative measures when evaluated on public
benchmarks. Code is released at https://github.com/lhaippp/RecDiffusion. |
This paper presents RecDiffusion, the first diffusion-based learning framework for rectangling images stitched from multiple captures, overcoming limitations of cropping, inpainting, and warping methods. |
Image stitching often results in non-rectangular boundaries, which are aesthetically unappealing. Existing solutions either sacrifice content, introduce artifacts, or struggle with non-linear features. |
RecDiffusion utilizes a two-step process: 1) Motion Diffusion Models (MDM) generate motion fields to rectify irregular boundaries, and 2) Content Diffusion Models (CDM) refine image details using a weighted sampling map based on the Rank-Nullity Theorem. |
RecDiffusion outperforms previous state-of-the-art methods in both quantitative metrics (FID, SSIM, PSNR) and qualitative comparisons.
The method effectively eliminates white edges and minimizes artifacts like line discontinuities and distortions.
RecDiffusion demonstrates strong generalization ability, effectively rectangling images from different datasets. |
The model currently relies on pre-trained motion fields, potentially limiting performance.
Future work could explore joint optimization of motion estimation and content refinement within the diffusion framework. |
image stitching, image rectangling, diffusion models, motion estimation, content refinement |
2403.19046
Report |
LITA: Language Instructed Temporal-Localization Assistant |
De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz |
There has been tremendous progress in multimodal Large Language Models
(LLMs). Recent works have extended these models to video input with promising
instruction following capabilities. However, an important missing piece is
temporal localization. These models cannot accurately answer the "When?"
questions. We identify three key aspects that limit their temporal localization
capabilities: (i) time representation, (ii) architecture, and (iii) data. We
address these shortcomings by proposing Language Instructed
Temporal-Localization Assistant (LITA) with the following features: (1) We
introduce time tokens that encode timestamps relative to the video length to
better represent time in videos. (2) We introduce SlowFast tokens in the
architecture to capture temporal information at fine temporal resolution. (3)
We emphasize temporal localization data for LITA. In addition to leveraging
existing video datasets with timestamps, we propose a new task, Reasoning
Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for
learning and evaluating this task. Reasoning temporal localization requires
both the reasoning and temporal localization of Video LLMs. LITA demonstrates
strong performance on this challenging task, nearly doubling the temporal mean
intersection-over-union (mIoU) of baselines. In addition, we show that our
emphasis on temporal localization also substantially improves video-based text
generation compared to existing Video LLMs, including a 36% relative
improvement of Temporal Understanding. Code is available at:
https://github.com/NVlabs/LITA |
This paper proposes Language Instructed Temporal-Localization Assistant (LITA), a novel Video LLM framework designed to enable accurate temporal event localization in videos, addressing a key limitation of existing Video LLMs. |
Temporal localization is crucial for comprehensive video understanding, differentiating videos from images. Current Video LLMs struggle to pinpoint event timings, hindering their ability to fully interpret and interact with video content. |
LITA introduces three key innovations: 1) **Time tokens:** Representing relative timestamps (e.g., first 10% of the video) instead of absolute ones for improved time representation. 2) **SlowFast tokens:** Inspired by the SlowFast architecture, LITA uses densely sampled fast tokens for temporal information and sparsely sampled slow tokens for spatial details, enabling efficient processing of numerous frames. 3) **Emphasis on temporal localization data:** LITA is trained on a diverse range of tasks including a novel Reasoning Temporal Localization (RTL) task with the ActivityNet-RTL dataset. RTL requires models to reason about events not explicitly stated, promoting both temporal and contextual understanding. |
On the ActivityNet-RTL benchmark, LITA significantly outperforms baseline models, nearly doubling the temporal mean intersection-over-union (mIoU) score.
LITA demonstrates the ability to provide detailed and accurate explanations for its temporal localization reasoning, showcasing enhanced video understanding.
Beyond accurate temporal localization, LITA exhibits substantial improvements in general video-based text generation tasks compared to existing Video LLMs, including a 36% relative improvement in Temporal Understanding on a benchmark by Maaz et al. (2023). |
The discretization of timestamps into time tokens, while beneficial, introduces a level of discretization error in temporal localization.
Future work could explore alternative time representation methods within the LLM framework to potentially mitigate discretization error and further enhance temporal accuracy. |
video language models, temporal localization, reasoning, multimodal learning, computer vision |
2403.18978
Report |
TextCraftor: Your Text Encoder Can be Image Quality Controller |
Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, Jian Ren |
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have
revolutionized the field of content generation, enabling significant
advancements in areas like image editing and video synthesis. Despite their
formidable capabilities, these models are not without their limitations. It is
still challenging to synthesize an image that aligns well with the input text,
and multiple runs with carefully crafted prompts are required to achieve
satisfactory results. To mitigate these limitations, numerous studies have
endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing
various technologies. Yet, amidst these efforts, a pivotal question of
text-to-image diffusion model training has remained largely unexplored: Is it
possible and feasible to fine-tune the text encoder to improve the performance
of text-to-image diffusion models? Our findings reveal that, instead of
replacing the CLIP text encoder used in Stable Diffusion with other large
language models, we can enhance it through our proposed fine-tuning approach,
TextCraftor, leading to substantial improvements in quantitative benchmarks and
human assessments. Interestingly, our technique also empowers controllable
image generation through the interpolation of different text encoders
fine-tuned with various rewards. We also demonstrate that TextCraftor is
orthogonal to UNet finetuning, and can be combined to further improve
generative quality. |
This paper introduces TextCraftor, a novel approach to fine-tuning the text encoder in text-to-image diffusion models for improved image quality and text-image alignment. |
Existing methods for improving diffusion models primarily focus on fine-tuning the UNet or using larger language models, which can be computationally expensive. Fine-tuning the text encoder offers a more efficient way to enhance performance. |
TextCraftor leverages public reward functions (e.g., aesthetics, text-image alignment) to guide the fine-tuning process. It employs a prompt-based approach, eliminating the need for paired text-image datasets and enabling optimization with only text prompts. To ensure generality and avoid mode collapse, it incorporates CLIP space similarity as a constraint during training. |
TextCraftor significantly improves image quality and text-image alignment compared to pre-trained models like SDv1.5, SDv2.0, SDXL Base 0.9, and DeepFloyd-XL.
It outperforms prompt engineering techniques and previous state-of-the-art methods like DDPO.
The approach allows for controllable generation through interpolation of text embeddings from different fine-tuned models, enabling style mixing. |
The reliance on public reward functions can limit performance to the capabilities of those functions.
Fine-tuning larger diffusion models with TextCraftor can be computationally expensive, though the authors demonstrate strong generalization capabilities allowing for fine-tuning on smaller models and transferring to larger ones. |
text-to-image generation, diffusion models, text encoder fine-tuning, reward functions, controllable image synthesis |
2403.18922
Report |
Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D |
Mukund Varma T, Peihao Wang, Zhiwen Fan, Zhangyang Wang, Hao Su, Ravi Ramamoorthi |
In recent years, there has been an explosion of 2D vision models for numerous
tasks such as semantic segmentation, style transfer or scene editing, enabled
by large-scale 2D image datasets. At the same time, there has been renewed
interest in 3D scene representations such as neural radiance fields from
multi-view images. However, the availability of 3D or multiview data is still
substantially limited compared to 2D image datasets, making extending 2D vision
models to 3D data highly desirable but also very challenging. Indeed, extending
a single 2D vision operator like scene editing to 3D typically requires a
highly creative method specialized to that task and often requires per-scene
optimization. In this paper, we ask the question of whether any 2D vision model
can be lifted to make 3D consistent predictions. We answer this question in the
affirmative; our new Lift3D method trains to predict unseen views on feature
spaces generated by a few visual models (i.e. DINO and CLIP), but then
generalizes to novel vision operators and tasks, such as style transfer,
super-resolution, open vocabulary segmentation and image colorization; for some
of these tasks, there is no comparable previous 3D method. In many cases, we
even outperform state-of-the-art methods specialized for the task in question.
Moreover, Lift3D is a zero-shot method, in the sense that it requires no
task-specific training, nor scene-specific optimization. |
Lift3D is a novel method that leverages generalizable novel view synthesis to lift any 2D vision model to 3D, enabling view-consistent predictions from arbitrary angles without task-specific training or scene-specific optimization. |
Extending 2D vision models to 3D is crucial for applications like autonomous driving and robotics, but is challenging due to the limited availability of 3D data and the complexity of existing methods. |
Lift3D trains a neural renderer to interpolate features from pre-trained 2D vision models across multiple views, using a corrective aggregation strategy to ensure consistency. |
Lift3D achieves comparable or better performance than state-of-the-art methods on 3D semantic segmentation, style transfer, and scene editing.
It exhibits strong zero-shot generalization, enabling the lifting of various 2D vision models for tasks like open vocabulary segmentation and image colorization without additional training.
The method is computationally efficient, particularly when generating predictions for numerous viewpoints. |
Lift3D's performance may be limited in scenes with sparse views or complex light transport where epipolar geometry doesn't hold.
The interpolation strategy may result in a slight loss of visual quality compared to per-scene optimization methods. |
3d vision, novel view synthesis, feature lifting, zero-shot learning, multi-view consistency |
2403.18820
Report |
MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering |
Guoxing Sun, Rishabh Dabral, Pascal Fua, Christian Theobalt, Marc Habermann |
Faithful human performance capture and free-view rendering from sparse RGB
observations is a long-standing problem in Vision and Graphics. The main
challenges are the lack of observations and the inherent ambiguities of the
setting, e.g. occlusions and depth ambiguity. As a result, radiance fields,
which have shown great promise in capturing high-frequency appearance and
geometry details in dense setups, perform poorly when na\"ively supervising
them on sparse camera views, as the field simply overfits to the sparse-view
inputs. To address this, we propose MetaCap, a method for efficient and
high-quality geometry recovery and novel view synthesis given very sparse or
even a single view of the human. Our key idea is to meta-learn the radiance
field weights solely from potentially sparse multi-view videos, which can serve
as a prior when fine-tuning them on sparse imagery depicting the human. This
prior provides a good network weight initialization, thereby effectively
addressing ambiguities in sparse-view capture. Due to the articulated structure
of the human body and motion-induced surface deformations, learning such a
prior is non-trivial. Therefore, we propose to meta-learn the field weights in
a pose-canonicalized space, which reduces the spatial feature range and makes
feature learning more effective. Consequently, one can fine-tune our field
parameters to quickly generalize to unseen poses, novel illumination conditions
as well as novel and sparse (even monocular) camera views. For evaluating our
method under different scenarios, we collect a new dataset, WildDynaCap, which
contains subjects captured in, both, a dense camera dome and in-the-wild sparse
camera rigs, and demonstrate superior results compared to recent
state-of-the-art methods on both public and WildDynaCap dataset. |
\model{} is a novel method for high-quality human performance capture and rendering from sparse multi-view or even monocular images using a meta-learned implicit human representation. |
Sparse view human capture suffers from inherent ambiguities such as occlusions and depth ambiguity. Existing methods struggle to achieve both high fidelity and fast adaptation to novel poses, views, and illumination. |
The method meta-learns optimal network weights of an implicit human representation in a pose-canonicalized space from multi-view imagery. This prior enables fast fine-tuning on sparse in-the-wild images and handles occlusions via a visibility map and proxy images. |
Outperforms state-of-the-art methods in terms of geometry reconstruction and novel view synthesis on both public and a new \dataset{} dataset.
Generalizes to novel poses, surface deformations, lighting conditions, and camera parameters.
Supports reconstruction from various sparse multi-view and monocular imagery during both training and inference. |
The method can be sensitive to template fitting and motion capture inaccuracies.
Temporal information is not fully leveraged and could further enhance robustness.
Future work includes exploring real-time fine-tuning and cross-identity prior learning. |
human performance capture, meta-learning, implicit representations, sparse-view reconstruction, novel view synthesis |
2403.18819
Report |
Benchmarking Object Detectors with COCO: A New Path Forward |
Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, Karan Desai |
The Common Objects in Context (COCO) dataset has been instrumental in
benchmarking object detectors over the past decade. Like every dataset, COCO
contains subtle errors and imperfections stemming from its annotation
procedure. With the advent of high-performing models, we ask whether these
errors of COCO are hindering its utility in reliably benchmarking further
progress. In search for an answer, we inspect thousands of masks from COCO
(2017 version) and uncover different types of errors such as imprecise mask
boundaries, non-exhaustively annotated instances, and mislabeled masks. Due to
the prevalence of COCO, we choose to correct these errors to maintain
continuity with prior research. We develop COCO-ReM (Refined Masks), a cleaner
set of annotations with visibly better mask quality than COCO-2017. We evaluate
fifty object detectors and find that models that predict visually sharper masks
score higher on COCO-ReM, affirming that they were being incorrectly penalized
due to errors in COCO-2017. Moreover, our models trained using COCO-ReM
converge faster and score higher than their larger variants trained using
COCO-2017, highlighting the importance of data quality in improving object
detectors. With these findings, we advocate using COCO-ReM for future object
detection research. Our dataset is available at https://cocorem.xyz |
The paper introduces COCO-ReM, a refined version of the COCO dataset for object detection with higher-quality instance annotations. |
COCO, while popular, has imperfections like coarse boundaries and non-exhaustive annotations, hindering its reliability in benchmarking object detectors. |
The authors developed a semi-automatic pipeline using SAM for mask refinement, imported instances from LVIS for exhaustiveness, and manually verified the validation set. |
All 50 evaluated object detectors scored higher on COCO-ReM than COCO-2017.
Query-based detectors outperform region-based detectors on COCO-ReM, aligning with human judgment of mask sharpness.
Models trained on COCO-ReM converge faster and perform better than those trained on COCO-2017, demonstrating the impact of data quality. |
Potential noise from SAM's occasional hallucination of disconnected components.
Limited manual verification to the validation set due to the large size of the training set. |
object detection, instance segmentation, dataset, benchmarking, coco |
2403.18814
Report |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models |
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia |
In this work, we introduce Mini-Gemini, a simple and effective framework
enhancing multi-modality Vision Language Models (VLMs). Despite the
advancements in VLMs facilitating basic visual dialog and reasoning, a
performance gap persists compared to advanced models like GPT-4 and Gemini. We
try to narrow the gap by mining the potential of VLMs for better performance
and any-to-any workflow from three aspects, i.e., high-resolution visual
tokens, high-quality data, and VLM-guided generation. To enhance visual tokens,
we propose to utilize an additional visual encoder for high-resolution
refinement without increasing the visual token count. We further construct a
high-quality dataset that promotes precise image comprehension and
reasoning-based generation, expanding the operational scope of current VLMs. In
general, Mini-Gemini further mines the potential of VLMs and empowers current
frameworks with image understanding, reasoning, and generation simultaneously.
Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs)
from 2B to 34B. It is demonstrated to achieve leading performance in several
zero-shot benchmarks and even surpasses the developed private models. Code and
models are available at https://github.com/dvlab-research/MiniGemini. |
Introduces Mini-Gemini, a simple yet effective framework that enhances multi-modality Vision Language Models (VLMs) by focusing on efficient high-resolution solutions, high-quality data, and expanded applications. |
Aims to bridge the performance gap between existing VLMs and advanced models like GPT-4 and Gemini, particularly in academic settings with limited resources. |
Utilizes dual vision encoders for low-resolution embedding and high-resolution candidate generation; employs patch info mining for efficient high-resolution detail extraction; constructs a high-quality dataset for training; and integrates with generative models for text and image generation. |
Achieves leading performance in various zero-shot benchmarks, outperforming existing methods, including LLaVA-1.5 and LLaVA-NeXT.
Demonstrates superior performance even compared to high-resource private models like Gemini Pro and Qwen-VL-Plus on challenging benchmarks like MMB and MMMU.
Showcases strong capabilities in handling complex visual understanding and reasoning tasks, as well as generating contextually relevant images from multi-modal instructions. |
Limitations in counting ability and complex visual reasoning due to potential gaps in training data.
Exploration of more advanced methods for visual understanding, reasoning, and generation, particularly in bridging VLMs and diffusion models. |
vision language models, multi-modality, image understanding, image generation, reasoning |
2403.18807
Report |
ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation |
Suraj Patni, Aradhye Agarwal, Chetan Arora |
In the absence of parallax cues, a learning-based single image depth
estimation (SIDE) model relies heavily on shading and contextual cues in the
image. While this simplicity is attractive, it is necessary to train such
models on large and varied datasets, which are difficult to capture. It has
been shown that using embeddings from pre-trained foundational models, such as
CLIP, improves zero shot transfer in several applications. Taking inspiration
from this, in our paper we explore the use of global image priors generated
from a pre-trained ViT model to provide more detailed contextual information.
We argue that the embedding vector from a ViT model, pre-trained on a large
dataset, captures greater relevant information for SIDE than the usual route of
generating pseudo image captions, followed by CLIP based text embeddings. Based
on this idea, we propose a new SIDE model using a diffusion backbone which is
conditioned on ViT embeddings. Our proposed design establishes a new
state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of
0.059 (14% improvement) compared to 0.069 by the current SOTA (VPD). And on
KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to
0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model
trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%)
over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%,
18%, 45%, 9%) by ZoeDepth. The project page is available at
https://ecodepth-iitd.github.io |
This paper proposes a novel single image depth estimation (SIDE) model using a diffusion model conditioned on global image priors generated from a pre-trained Vision Transformer (ViT). |
This approach addresses the limitations of learning-based SIDE models that heavily rely on shading and contextual cues, making them domain-specific and difficult to generalize. |
The method utilizes a conditional diffusion architecture where semantic context is provided through embeddings generated using a pre-trained ViT model, rather than relying on pseudo image captions. |
Achieves state-of-the-art performance on NYU Depth v2 and KITTI datasets, significantly outperforming previous methods.
Demonstrates that using ViT embeddings for semantic context is more effective than employing pseudo captions and their CLIP embeddings.
Exhibits strong generalization and zero-shot transfer capabilities, outperforming state-of-the-art methods even when trained on a single dataset. |
The model requires significant computational resources for training.
Further exploration of optimal ViT architectures and embedding dimensions could potentially improve performance. |
single image depth estimation, diffusion models, vision transformer (vit), zero-shot transfer, semantic context |
2403.18795
Report |
Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction |
Qiuhong Shen, Xuanyu Yi, Zike Wu, Pan Zhou, Hanwang Zhang, Shuicheng Yan, Xinchao Wang |
We tackle the challenge of efficiently reconstructing a 3D asset from a
single image with growing demands for automated 3D content creation pipelines.
Previous methods primarily rely on Score Distillation Sampling (SDS) and Neural
Radiance Fields (NeRF). Despite their significant success, these approaches
encounter practical limitations due to lengthy optimization and considerable
memory usage. In this report, we introduce Gamba, an end-to-end amortized 3D
reconstruction model from single-view images, emphasizing two main insights:
(1) 3D representation: leveraging a large number of 3D Gaussians for an
efficient 3D Gaussian splatting process; (2) Backbone design: introducing a
Mamba-based sequential network that facilitates context-dependent reasoning and
linear scalability with the sequence (token) length, accommodating a
substantial number of Gaussians. Gamba incorporates significant advancements in
data preprocessing, regularization design, and training methodologies. We
assessed Gamba against existing optimization-based and feed-forward 3D
generation approaches using the real-world scanned OmniObject3D dataset. Here,
Gamba demonstrates competitive generation capabilities, both qualitatively and
quantitatively, while achieving remarkable speed, approximately 0.6 second on a
single NVIDIA A100 GPU. |
Introducing Gamba, an end-to-end amortized 3D reconstruction model from single-view images using 3D Gaussian Splatting and a Mamba-based sequential network. |
Addresses limitations of previous Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF) methods, which suffer from lengthy optimization, high memory usage, and rendering inefficiencies. |
Combines 3D Gaussian Splatting for efficient representation with a Mamba-based sequential network (GambaFormer) for context-dependent reasoning and linear scalability with token length. Employs robust training techniques like Gaussian parameter constraints and data augmentation. |
Achieves competitive generation quality compared to state-of-the-art methods, both qualitatively and quantitatively (PSNR, LPIPS, CLIP Distance).
Exhibits remarkable speed, reconstructing a 3D asset in approximately 0.6 seconds on a single NVIDIA A100 GPU, significantly faster than optimization-based alternatives.
Demonstrates effectiveness on the OmniObject3D dataset, showcasing reasonable geometry understanding and plausible texture generation. |
Struggles to generate sharp textures for occluded areas, particularly with complex textures.
Limited generalization to 'unseen' 3D assets with large domain disparity from the training data (OmniObject3D). |
3d reconstruction, single-view reconstruction, 3d gaussian splatting, mamba network, amortized inference |
2403.18784
Report |
SplatFace: Gaussian Splat Face Reconstruction Leveraging an Optimizable Surface |
Jiahao Luo, Jing Liu, James Davis |
We present SplatFace, a novel Gaussian splatting framework designed for 3D
human face reconstruction without reliance on accurate pre-determined geometry.
Our method is designed to simultaneously deliver both high-quality novel view
rendering and accurate 3D mesh reconstructions. We incorporate a generic 3D
Morphable Model (3DMM) to provide a surface geometric structure, making it
possible to reconstruct faces with a limited set of input images. We introduce
a joint optimization strategy that refines both the Gaussians and the morphable
surface through a synergistic non-rigid alignment process. A novel distance
metric, splat-to-surface, is proposed to improve alignment by considering both
the Gaussian position and covariance. The surface information is also utilized
to incorporate a world-space densification process, resulting in superior
reconstruction quality. Our experimental analysis demonstrates that the
proposed method is competitive with both other Gaussian splatting techniques in
novel view synthesis and other 3D reconstruction methods in producing 3D face
meshes with high geometric precision. |
SplatFace, a novel Gaussian splatting framework for 3D human face reconstruction from a limited set of input images without relying on accurate pre-determined geometry. |
Existing methods for 3D face reconstruction either rely on a large number of input images or require accurate pre-determined geometry, limiting their practical application. This paper aims to address this limitation. |
SplatFace incorporates a generic 3D Morphable Model (3DMM) and jointly optimizes the Gaussian splats and the morphable surface through a non-rigid alignment process guided by a novel splat-to-surface distance metric and world-space densification. |
SplatFace achieves higher quality novel view synthesis with fewer artifacts compared to baseline Gaussian splatting methods.
SplatFace outperforms state-of-the-art multi-view 3D face reconstruction methods in terms of geometric accuracy.
Joint optimization with a generic 3DMM initialization effectively reconstructs 3D face shapes, achieving comparable results to using ground truth surface initialization. |
The method might suffer from over-regularization in regions with complex geometry, such as teeth and hair, due to limitations of the surface model.
While outperforming existing methods, the rendered images, especially in far test views, are not entirely artifact-free, indicating room for further improvement. |
3d face reconstruction, gaussian splatting, novel view synthesis, 3d morphable model, few-shot learning |
2403.18775
Report |
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object |
Chenshuang Zhang, Fei Pan, Junmo Kim, In So Kweon, Chengzhi Mao |
We establish rigorous benchmarks for visual perception robustness. Synthetic
images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific
type of evaluation over synthetic corruptions, backgrounds, and textures, yet
those robustness benchmarks are restricted in specified variations and have low
synthetic quality. In this work, we introduce generative model as a data source
for synthesizing hard images that benchmark deep models' robustness. Leveraging
diffusion models, we are able to generate images with more diversified
backgrounds, textures, and materials than any prior work, where we term this
benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a
significant accuracy drop to a range of vision models, from the standard ResNet
visual classifier to the latest foundation models like CLIP and MiniGPT-4,
significantly reducing their accuracy by up to 60\%. Our work suggests that
diffusion models can be an effective source to test vision models. The code and
dataset are available at https://github.com/chenshuang-zhang/imagenet_d. |
This paper introduces ImageNet-D, a new synthetic dataset for benchmarking the robustness of visual perception models, particularly against variations in background, texture, and material. |
Existing robustness benchmarks often rely on synthetic images with limited diversity and realism, failing to accurately assess model robustness in real-world scenarios. |
The authors leverage diffusion models to generate a vast pool of images with diverse object and nuisance combinations. Hard images, those misclassified by multiple surrogate models, are selectively retained and further validated by human annotators, forming the final ImageNet-D dataset. |
ImageNet-D causes a significant accuracy drop (up to 60%) across a range of vision models, including ResNets, ViTs, CLIP, LLaVa, and MiniGPT-4.
Existing data augmentation techniques, while effective on benchmarks like ImageNet-C, fail to improve robustness on ImageNet-D, suggesting its unique challenges.
Training models on diffusion-generated images with diverse attributes can enhance robustness on ImageNet-D and generalize better to real-world datasets like ObjectNet. |
The current version of ImageNet-D only includes a subset of ImageNet categories.
Future work could explore generating even more challenging images by leveraging advancements in generative models and incorporating additional nuisance factors. |
robustness, benchmarking, visual perception, diffusion models, synthetic data |
2403.18660
Report |
InstructBrush: Learning Attention-based Instruction Optimization for Image Editing |
Ruoyu Zhao, Qingnan Fan, Fei Kou, Shuai Qin, Hong Gu, Wei Wu, Pengcheng Xu, Mingrui Zhu, Nannan Wang, Xinbo Gao |
In recent years, instruction-based image editing methods have garnered
significant attention in image editing. However, despite encompassing a wide
range of editing priors, these methods are helpless when handling editing tasks
that are challenging to accurately describe through language. We propose
InstructBrush, an inversion method for instruction-based image editing methods
to bridge this gap. It extracts editing effects from exemplar image pairs as
editing instructions, which are further applied for image editing. Two key
techniques are introduced into InstructBrush, Attention-based Instruction
Optimization and Transformation-oriented Instruction Initialization, to address
the limitations of the previous method in terms of inversion effects and
instruction generalization. To explore the ability of instruction inversion
methods to guide image editing in open scenarios, we establish a
TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set
of scenes and editing types. The creation of this benchmark paves the way for
further exploration of instruction inversion. Quantitatively and qualitatively,
our approach achieves superior performance in editing and is more semantically
consistent with the target editing effects. |
Proposes InstructBrush, a novel method to extract editing instructions from exemplar image pairs for image editing, addressing the limitations of language in describing complex editing tasks. |
Instruction-based image editing methods, while powerful, struggle with edits that are challenging to express through language. Instruction inversion, learning instructions from visual examples, offers a solution. |
Introduces Attention-based Instruction Optimization, directly optimizing instructions within the cross-attention layers of a diffusion model for enhanced representation. Also proposes Transformation-oriented Instruction Initialization, incorporating editing-specific priors by identifying unique phrases differentiating before-and-after edit images. |
Outperforms existing methods in both local and global image editing tasks.
Demonstrates superior instruction generalization, avoiding the introduction of irrelevant content from training images.
Achieves higher scores on quantitative metrics such as PSNR, SSIM, LPIPS, and CLIP directional similarity. |
Editing capabilities are limited by the prior of the base instruction-based editing model.
Effectiveness of Transformation-oriented Instruction Initialization is dependent on the vocabulary used for unique phrase extraction. |
image editing, prompt inversion, diffusion models, instruction learning, visual prompts |
2403.18551
Report |
Attention Calibration for Disentangled Text-to-Image Personalization |
Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang |
Recent thrilling progress in large-scale text-to-image (T2I) models has
unlocked unprecedented synthesis quality of AI-generated content (AIGC)
including image generation, 3D and video composition. Further, personalized
techniques enable appealing customized production of a novel concept given only
several images as reference. However, an intriguing problem persists: Is it
possible to capture multiple, novel concepts from one single reference image?
In this paper, we identify that existing approaches fail to preserve visual
consistency with the reference image and eliminate cross-influence from
concepts. To alleviate this, we propose an attention calibration mechanism to
improve the concept-level understanding of the T2I model. Specifically, we
first introduce new learnable modifiers bound with classes to capture
attributes of multiple concepts. Then, the classes are separated and
strengthened following the activation of the cross-attention operation,
ensuring comprehensive and self-contained concepts. Additionally, we suppress
the attention activation of different classes to mitigate mutual influence
among concepts. Together, our proposed method, dubbed DisenDiff, can learn
disentangled multiple concepts from one single image and produce novel
customized images with learned concepts. We demonstrate that our method
outperforms the current state of the art in both qualitative and quantitative
evaluations. More importantly, our proposed techniques are compatible with LoRA
and inpainting pipelines, enabling more interactive experiences. |
This paper introduces DisenDiff, a personalized text-to-image generation model that can learn multiple novel concepts from a single image and use them to generate novel images with these concepts in various contexts while preserving high fidelity to the original image. |
Existing personalized text-to-image generation methods struggle to capture and independently manipulate multiple concepts from a single image, limiting their flexibility for creative editing and content creation. |
DisenDiff achieves this by introducing an attention calibration mechanism that binds new word embeddings to corresponding class tokens and uses a separate and strengthen strategy to ensure distinct attention maps for different concepts during training. |
DisenDiff outperforms state-of-the-art methods in both qualitative and quantitative evaluations, demonstrating superior image fidelity and editing capabilities.
The proposed attention calibration mechanism, specifically the binding and separate & strengthen constraints, are crucial for achieving high-fidelity and disentangled concept representation.
DisenDiff is compatible with LoRA and inpainting pipelines, showcasing its potential for broader applications like personalized image editing and concept manipulation. |
Disentangling fine-grained categories within the same semantic class (e.g., two dog breeds) remains challenging.
Extending the method to handle more than three concepts effectively requires further algorithmic development. |
text-to-image generation, personalized image synthesis, concept learning, attention mechanism, disentanglement |
2403.18493
Report |
VersaT2I: Improving Text-to-Image Models with Versatile Reward |
Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang |
Recent text-to-image (T2I) models have benefited from large-scale and
high-quality data, demonstrating impressive performance. However, these T2I
models still struggle to produce images that are aesthetically pleasing,
geometrically accurate, faithful to text, and of good low-level quality. We
present VersaT2I, a versatile training framework that can boost the performance
with multiple rewards of any T2I model. We decompose the quality of the image
into several aspects such as aesthetics, text-image alignment, geometry,
low-level quality, etc. Then, for every quality aspect, we select high-quality
images in this aspect generated by the model as the training set to finetune
the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a
gating function to combine multiple quality aspects, which can avoid conflicts
between different quality aspects. Our method is easy to extend and does not
require any manual annotation, reinforcement learning, or model architecture
changes. Extensive experiments demonstrate that VersaT2I outperforms the
baseline methods across various quality criteria. |
This paper introduces VersaT2I, a novel training framework to improve text-to-image (T2I) models by incorporating various reward signals without relying on resource-intensive reinforcement learning. |
Existing T2I models often struggle to generate images that are aesthetically pleasing, geometrically accurate, and faithful to the input text. VersaT2I aims to address these limitations and improve the overall quality of generated images. |
VersaT2I decomposes image quality into four aspects: aesthetics, text-image alignment, geometry, and low-level quality. It leverages pre-trained evaluation models for each aspect to select high-quality images generated by the T2I model. These selected images form a training set used to fine-tune the model using LoRA. Further, a novel Mixture of LoRA (MoL) approach combines multiple LoRA models trained on different aspects, improving the model's overall performance. |
VersaT2I outperforms baseline methods, including direct LoRA merging and RL approaches, across various quality metrics.
Single-reward LoRA models fine-tuned using VersaT2I show significant improvements in respective evaluation benchmarks for both SD v2.1 and SDXL.
MoL successfully alleviates conflicts between different LoRAs, leading to consistent improvement in overall image quality. |
The current implementation of VersaT2I relies on a limited number of predefined aspects and their corresponding evaluation models. Exploring a wider range of quality aspects and fine-grained annotations could further enhance the framework.
Future work could focus on mitigating the potential societal impact of improved T2I models, such as the generation of deepfakes and manipulated content. |
text-to-image generation, generative models, diffusion models, low-rank adaptation (lora), reward learning |
2403.18476
Report |
Modeling uncertainty for Gaussian Splatting |
Luca Savant, Diego Valsesia, Enrico Magli |
We present Stochastic Gaussian Splatting (SGS): the first framework for
uncertainty estimation using Gaussian Splatting (GS). GS recently advanced the
novel-view synthesis field by achieving impressive reconstruction quality at a
fraction of the computational cost of Neural Radiance Fields (NeRF). However,
contrary to the latter, it still lacks the ability to provide information about
the confidence associated with their outputs. To address this limitation, in
this paper, we introduce a Variational Inference-based approach that seamlessly
integrates uncertainty prediction into the common rendering pipeline of GS.
Additionally, we introduce the Area Under Sparsification Error (AUSE) as a new
term in the loss function, enabling optimization of uncertainty estimation
alongside image reconstruction. Experimental results on the LLFF dataset
demonstrate that our method outperforms existing approaches in terms of both
image rendering quality and uncertainty estimation accuracy. Overall, our
framework equips practitioners with valuable insights into the reliability of
synthesized views, facilitating safer decision-making in real-world
applications. |
Introduced Stochastic Gaussian Splatting (SGS), the first framework for uncertainty estimation using Gaussian Splatting (GS), enabling real-time synthesis of high-quality images with accurate uncertainty predictions. |
Gaussian Splatting lacks a mechanism for estimating uncertainty in synthesized views, crucial for real-world applications requiring reliability assessments. |
Employs Variational Inference to learn parameters of the GS radiance field in a Bayesian framework, incorporating uncertainty prediction into the rendering pipeline. Introduces Area Under Sparsification Error (AUSE) for optimizing uncertainty estimation alongside image reconstruction. Leverages Empirical Bayes for informative prior initialization. |
Significantly improves rendering quality metrics (PSNR, SSIM, LPIPS) compared to state-of-the-art methods on the LLFF dataset.
Achieves superior uncertainty estimation accuracy, measured by AUSE RMSE, compared to existing approaches.
Demonstrates the effectiveness of the AUSE loss term in enhancing uncertainty map prediction. |
Independence assumption between Gaussian kernels, though more general than previous works, might be limiting.
Exploration of alternative uncertainty estimation metrics beyond AUSE for potential further improvements. |
gaussian splatting, uncertainty estimation, novel view synthesis, variational inference, ause |
2403.18417
Report |
ECNet: Effective Controllable Text-to-Image Diffusion Models |
Sicheng Li, Keqiang Sun, Zhixin Lai, Xiaoshi Wu, Feng Qiu, Haoran Xie, Kazunori Miyata, Hongsheng Li |
The conditional text-to-image diffusion models have garnered significant
attention in recent years. However, the precision of these models is often
compromised mainly for two reasons, ambiguous condition input and inadequate
condition guidance over single denoising loss. To address the challenges, we
introduce two innovative solutions. Firstly, we propose a Spatial Guidance
Injector (SGI) which enhances conditional detail by encoding text inputs with
precise annotation information. This method directly tackles the issue of
ambiguous control inputs by providing clear, annotated guidance to the model.
Secondly, to overcome the issue of limited conditional supervision, we
introduce Diffusion Consistency Loss (DCL), which applies supervision on the
denoised latent code at any given time step. This encourages consistency
between the latent code at each time step and the input signal, thereby
enhancing the robustness and accuracy of the output. The combination of SGI and
DCL results in our Effective Controllable Network (ECNet), which offers a more
accurate controllable end-to-end text-to-image generation framework with a more
precise conditioning input and stronger controllable supervision. We validate
our approach through extensive experiments on generation under various
conditions, such as human body skeletons, facial landmarks, and sketches of
general objects. The results consistently demonstrate that our method
significantly enhances the controllability and robustness of the generated
images, outperforming existing state-of-the-art controllable text-to-image
models. |
This paper introduces ECNet, a novel framework for controllable text-to-image generation that leverages precise annotation information alongside text descriptions and a new Diffusion Consistency Loss (DCL). |
Existing controllable text-to-image diffusion models often lack precision due to ambiguous condition inputs and inadequate condition guidance. |
ECNet employs a Spatial Guidance Injector (SGI) to combine annotations with text for precise control. It introduces DCL to supervise the denoised latent code at each time step, ensuring consistency with the input signal. |
ECNet achieves state-of-the-art performance on skeleton control tasks, surpassing HumanSD and ControlNet in metrics like AP and CAP.
It demonstrates superior accuracy in facial landmark control tasks, exhibiting significant improvements in NME scores compared to baselines.
ECNet effectively handles sketch control tasks, showcasing its versatility and capability in generating images from various conditions. |
The effectiveness of ECNet's supervision relies on accurate annotation detection, which could be affected by detector performance.
The evaluation of ECNet is limited in scope, lacking comprehensive testing across diverse conditions and scenarios. |
text-to-image generation, diffusion models, controllable generation, spatial guidance injector, diffusion consistency loss |
2403.18361
Report |
ViTAR: Vision Transformer with Any Resolution |
Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang |
This paper tackles a significant challenge faced by Vision Transformers
(ViTs): their constrained scalability across different image resolutions.
Typically, ViTs experience a performance decline when processing resolutions
different from those seen during training. Our work introduces two key
innovations to address this issue. Firstly, we propose a novel module for
dynamic resolution adjustment, designed with a single Transformer block,
specifically to achieve highly efficient incremental token integration.
Secondly, we introduce fuzzy positional encoding in the Vision Transformer to
provide consistent positional awareness across multiple resolutions, thereby
preventing overfitting to any single training resolution. Our resulting model,
ViTAR (Vision Transformer with Any Resolution), demonstrates impressive
adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and
80.4\% accuracy at a 4032x4032 resolution, all while reducing computational
costs. ViTAR also shows strong performance in downstream tasks such as instance
and semantic segmentation and can easily combined with self-supervised learning
techniques like Masked AutoEncoder. Our work provides a cost-effective solution
for enhancing the resolution scalability of ViTs, paving the way for more
versatile and efficient high-resolution image processing. |
This paper introduces ViTAR (Vision Transformer with Any Resolution) to enhance the scalability of Vision Transformers (ViTs) across different image resolutions. |
Existing ViTs often suffer performance degradation when processing resolutions different from training data, limiting their real-world applicability. |
ViTAR incorporates two key innovations: (1) Adaptive Token Merger (ATM) for efficient incremental token integration across resolutions, and (2) Fuzzy Positional Encoding (FPE) to enhance positional awareness consistency across resolutions. |
ViTAR achieves strong resolution generalization, reaching 83.3% top-1 accuracy at 1120x1120 and 80.4% at 4032x4032 resolution while reducing computational costs.
ViTAR demonstrates robust performance in downstream tasks like instance and semantic segmentation.
The model effectively combines with self-supervised learning techniques like Masked AutoEncoder (MAE). |
The impact of varying the number of iterations in ATM across different tasks and datasets requires further exploration.
Investigating the effectiveness of FPE in other self-supervised learning frameworks beyond MAE is a promising direction. |
vision transformer, multi-resolution, positional encoding, adaptive token merger, self-supervised learning |
2403.18036
Report |
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance |
Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, Siyuan Huang |
Despite significant advancements in text-to-motion synthesis, generating
language-guided human motion within 3D environments poses substantial
challenges. These challenges stem primarily from (i) the absence of powerful
generative models capable of jointly modeling natural language, 3D scenes, and
human motion, and (ii) the generative models' intensive data requirements
contrasted with the scarcity of comprehensive, high-quality,
language-scene-motion datasets. To tackle these issues, we introduce a novel
two-stage framework that employs scene affordance as an intermediate
representation, effectively linking 3D scene grounding and conditional motion
generation. Our framework comprises an Affordance Diffusion Model (ADM) for
predicting explicit affordance map and an Affordance-to-Motion Diffusion Model
(AMDM) for generating plausible human motions. By leveraging scene affordance
maps, our method overcomes the difficulty in generating human motion under
multimodal condition signals, especially when training with limited data
lacking extensive language-scene-motion pairs. Our extensive experiments
demonstrate that our approach consistently outperforms all baselines on
established benchmarks, including HumanML3D and HUMANISE. Additionally, we
validate our model's exceptional generalization capabilities on a specially
curated evaluation set featuring previously unseen descriptions and scenes. |
This paper introduces a novel two-stage model for generating human motion in 3D scenes guided by language descriptions, using scene affordance maps as an intermediate representation to connect scene grounding with motion generation. |
Generating realistic human-scene interactions within 3D environments from language instructions is challenging due to the complexity of joint modeling and the scarcity of comprehensive language-scene-motion datasets. |
The model consists of two stages: an Affordance Diffusion Model (ADM) predicts affordance maps from scene point clouds and language descriptions, and an Affordance-to-Motion Diffusion Model (AMDM) synthesizes human motions conditioned on the predicted affordance maps and language. |
The method outperforms baselines in text-to-motion generation on HumanML3D and scene-aware motion generation on HUMANISE datasets.
It exhibits strong generalization ability, generating plausible motions for novel language-scene pairs.
Using scene affordance as an intermediate representation enhances both scene grounding and motion detail. |
The model's reliance on diffusion models leads to slower inference times, which can be addressed in future work.
Although the use of affordance maps alleviates the data scarcity issue, collecting more diverse and comprehensive language-scene-motion data remains crucial. |
human-scene interaction, motion generation, scene affordance, diffusion model, 3d scene understanding |
2403.18035
Report |
Bidirectional Consistency Models |
Liangchen Li, Jiajun He |
Diffusion models (DMs) are capable of generating remarkably high-quality
samples by iteratively denoising a random vector, a process that corresponds to
moving along the probability flow ordinary differential equation (PF ODE).
Interestingly, DMs can also invert an input image to noise by moving backward
along the PF ODE, a key operation for downstream tasks such as interpolation
and image editing. However, the iterative nature of this process restricts its
speed, hindering its broader application. Recently, Consistency Models (CMs)
have emerged to address this challenge by approximating the integral of the PF
ODE, largely reducing the number of iterations. Yet, the absence of an explicit
ODE solver complicates the inversion process. To resolve this, we introduce the
Bidirectional Consistency Model (BCM), which learns a single neural network
that enables both forward and backward traversal along the PF ODE, efficiently
unifying generation and inversion tasks within one framework. Notably, our
proposed method enables one-step generation and inversion while also allowing
the use of additional steps to enhance generation quality or reduce
reconstruction error. Furthermore, by leveraging our model's bidirectional
consistency, we introduce a sampling strategy that can enhance FID while
preserving the generated image content. We further showcase our model's
capabilities in several downstream tasks, such as interpolation and inpainting,
and present demonstrations of potential applications, including blind
restoration of compressed images and defending black-box adversarial attacks. |
The paper introduces Bidirectional Consistency Model (BCM), which learns a single neural network for both forward and backward traversal along the Probability Flow ODE, unifying generation and inversion tasks for diffusion models within a single framework. |
Diffusion models are powerful generative models but their iterative nature for generation and inversion limits their speed. This paper aims to accelerate both tasks while maintaining or improving quality. |
The paper extends Consistency Models by learning a bidirectional mapping between points on the same trajectory of the PF ODE. It introduces Bidirectional Consistency Training (BCT) that combines a consistency term with a soft trajectory constraint. The model enables one-step generation and inversion and also supports multi-step sampling strategies like ancestral and zigzag sampling. |
BCM achieves comparable or better generation quality than earlier diffusion models with significantly fewer function evaluations (NFEs).
BCM can achieve lower reconstruction error than ODE-based diffusion models with significantly fewer NFEs.
BCM enables applications like image interpolation between real images, superior inpainting, blind restoration of compressed images, and defending black-box adversarial attacks. |
While multi-step sampling in BCM improves results, the gains plateau quickly beyond a certain point due to error accumulation.
The inversion process can sometimes alter image content, impacting downstream applications. |
diffusion models, generative models, image generation, image inversion, consistency models |
2403.17998
Report |
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval |
Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao |
The increasing prevalence of video clips has sparked growing interest in
text-video retrieval. Recent advances focus on establishing a joint embedding
space for text and video, relying on consistent embedding representations to
compute similarity. However, the text content in existing datasets is generally
short and concise, making it hard to fully describe the redundant semantics of
a video. Correspondingly, a single text embedding may be less expressive to
capture the video embedding and empower the retrieval. In this study, we
propose a new stochastic text modeling method T-MASS, i.e., text is modeled as
a stochastic embedding, to enrich text embedding with a flexible and resilient
semantic range, yielding a text mass. To be specific, we introduce a
similarity-aware radius module to adapt the scale of the text mass upon the
given text-video pairs. Plus, we design and develop a support text
regularization to further control the text mass during the training. The
inference pipeline is also tailored to fully exploit the text mass for accurate
retrieval. Empirical evidence suggests that T-MASS not only effectively
attracts relevant text-video pairs while distancing irrelevant ones, but also
enables the determination of precise text embeddings for relevant pairs. Our
experimental results show a substantial improvement of T-MASS over baseline (3%
to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five
benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades. |
This paper presents T-MASS, a new stochastic text modeling approach for text-video retrieval, enhancing text embedding with a resilient semantic range to better capture video clues. |
Current methods struggle to align short, semantically limited text with the rich content of videos, hindering accurate retrieval. |
T-MASS models text as a "mass" using stochastic embedding, incorporating a similarity-aware radius module for scale adaptation and support text regularization for position and scale control. |
T-MASS effectively bridges relevant pairs while distancing irrelevant ones.
The method facilitates precise text semantics mapping, adapting to video variations.
T-MASS achieves state-of-the-art performance on five benchmark datasets, surpassing baselines by 3-6.3% on R@1. |
The study primarily focuses on text embedding without extensively exploring advanced video feature extraction techniques.
Further research could investigate the incorporation of additional modalities, like audio, to enhance retrieval accuracy. |
text-video retrieval, stochastic text modeling, text mass, similarity-aware radius, support text regularization |
2403.17935
Report |
OmniVid: A Generative Framework for Universal Video Understanding |
Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang |
The core of video understanding tasks, such as recognition, captioning, and
tracking, is to automatically detect objects or actions in a video and analyze
their temporal evolution. Despite sharing a common goal, different tasks often
rely on distinct model architectures and annotation formats. In contrast,
natural language processing benefits from a unified output space, i.e., text
sequences, which simplifies the training of powerful foundational language
models, such as GPT-3, with extensive training corpora. Inspired by this, we
seek to unify the output space of video understanding tasks by using languages
as labels and additionally introducing time and box tokens. In this way, a
variety of video tasks could be formulated as video-grounded token generation.
This enables us to address various types of video tasks, including
classification (such as action recognition), captioning (covering clip
captioning, video question answering, and dense video captioning), and
localization tasks (such as visual object tracking) within a fully shared
encoder-decoder architecture, following a generative framework. Through
comprehensive experiments, we demonstrate such a simple and straightforward
idea is quite effective and can achieve state-of-the-art or competitive results
on seven video benchmarks, providing a novel perspective for more universal
video understanding. Code is available at https://github.com/wangjk666/OmniVid. |
This paper introduces OmniVid, a generative framework that unifies various video understanding tasks by representing the output as a sequence of tokens from an enriched vocabulary, encompassing words, time tokens, and box tokens. |
Existing video understanding models typically rely on task-specific architectures and annotations, hindering generalization. OmniVid addresses this limitation by unifying the output space, enabling a single framework to handle diverse video tasks. |
OmniVid utilizes an encoder-decoder architecture. A video encoder extracts features, while a language encoder processes prompts. A novel Mixed Q-former aggregates frame features into content, sentence, and box queries. A token decoder generates the final token sequence based on the multimodal input. |
OmniVid achieves state-of-the-art performance on multiple video benchmarks, including action recognition (83.6% on Kinetics-400), clip captioning (56.6 CIDEr on MSRVTT), and dense video captioning (5.6 SODA_c on ActivityNet).
The framework effectively handles both coarse-grained tasks like action recognition and fine-grained tasks like object tracking.
Jointly training the model across different video tasks shows promising results for classification and captioning while revealing challenges for localization tasks. |
Joint training of OmniVid for spatial-temporal localization tasks currently shows performance degradation compared to separate training.
Sparse frame sampling in dense video captioning can lead to overlooking subtle activity changes. |
video understanding, generative model, unified framework, video captioning, object tracking |
2403.17931
Report |
Track Everything Everywhere Fast and Robustly |
Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, Kostas Daniilidis |
We propose a novel test-time optimization approach for efficiently and
robustly tracking any pixel at any time in a video. The latest state-of-the-art
optimization-based tracking technique, OmniMotion, requires a prohibitively
long optimization time, rendering it impractical for downstream applications.
OmniMotion is sensitive to the choice of random seeds, leading to unstable
convergence. To improve efficiency and robustness, we introduce a novel
invertible deformation network, CaDeX++, which factorizes the function
representation into a local spatial-temporal feature grid and enhances the
expressivity of the coupling blocks with non-linear functions. While CaDeX++
incorporates a stronger geometric bias within its architectural design, it also
takes advantage of the inductive bias provided by the vision foundation models.
Our system utilizes monocular depth estimation to represent scene geometry and
enhances the objective by incorporating DINOv2 long-term semantics to regulate
the optimization process. Our experiments demonstrate a substantial improvement
in training speed (more than \textbf{10 times} faster), robustness, and
accuracy in tracking over the SoTA optimization-based method OmniMotion. |
This paper introduces an optimization-based approach for fast and robust tracking of any pixel in a video, improving upon the efficiency and robustness of OmniMotion. |
Long-term pixel tracking is fundamental for various computer vision tasks, but existing methods struggle with efficiency, robustness, or accuracy. This work addresses these limitations. |
The authors propose CaDeX++, an invertible deformation network with local feature grid factorization and non-linear interpolation. They leverage monocular depth estimation (ZoeDepth) for geometry initialization and integrate DINOv2 semantics for long-term correspondence. |
CaDeX++ significantly improves training speed (over 10 times faster) compared to OmniMotion.
The proposed method achieves higher accuracy and robustness in tracking, particularly in challenging scenarios with occlusions and complex motions.
Depth prior initialization and long-term semantic integration are shown to contribute significantly to the performance gains. |
The method's performance heavily depends on the accuracy of the input depth and pixel correspondences.
Future work includes exploring the application of CaDeX++ to other tasks like 3D reconstruction and object pose estimation. |
pixel tracking, long-term tracking, test-time optimization, invertible deformation network, vision foundation models |
2403.17924
Report |
AID: Attention Interpolation of Text-to-Image Diffusion |
Qiyuan He, Jinghao Wang, Ziwei Liu, Angela Yao |
Conditional diffusion models can create unseen images in various settings,
aiding image interpolation. Interpolation in latent spaces is well-studied, but
interpolation with specific conditions like text or poses is less understood.
Simple approaches, such as linear interpolation in the space of conditions,
often result in images that lack consistency, smoothness, and fidelity. To that
end, we introduce a novel training-free technique named Attention Interpolation
via Diffusion (AID). Our key contributions include 1) proposing an inner/outer
interpolated attention layer; 2) fusing the interpolated attention with
self-attention to boost fidelity; and 3) applying beta distribution to
selection to increase smoothness. We also present a variant, Prompt-guided
Attention Interpolation via Diffusion (PAID), that considers interpolation as a
condition-dependent generative process. This method enables the creation of new
images with greater consistency, smoothness, and efficiency, and offers control
over the exact path of interpolation. Our approach demonstrates effectiveness
for conceptual and spatial interpolation. Code and demo are available at
https://github.com/QY-H00/attention-interpolation-diffusion. |
This paper proposes AID, a training-free technique for text-to-image diffusion models, enabling nuanced spatial and conceptual interpolations between images with different text prompts. |
Existing methods for interpolation in the latent space fail to generate consistent, smooth, and high-fidelity images when interpolating between distinct textual conditions. |
AID introduces an inner/outer interpolated attention layer, fuses it with self-attention, and utilizes beta distribution for sequence selection to enhance interpolation quality. It also presents PAID, a variant allowing prompt-guided interpolation paths. |
AID significantly improves smoothness, consistency, and fidelity of interpolated image sequences compared to text embedding interpolation.
Inner attention interpolation (AID-I) excels in conceptual blending, while outer attention interpolation (AID-O) is superior in spatial blending.
Prompt guidance in PAID enables the generation of compositional scenes and offers control over the interpolation path. |
The selection of optimal hyperparameters for beta distribution requires Bayesian optimization, adding computational overhead.
The effectiveness of prompt guidance can be sensitive to the choice of warm-up steps. |
text-to-image synthesis, diffusion models, image interpolation, attention mechanism, prompt engineering |
2403.17898
Report |
Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians |
Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, Bo Dai |
The recent 3D Gaussian splatting (3D-GS) has shown remarkable rendering
fidelity and efficiency compared to NeRF-based neural scene representations.
While demonstrating the potential for real-time rendering, 3D-GS encounters
rendering bottlenecks in large scenes with complex details due to an excessive
number of Gaussian primitives located within the viewing frustum. This
limitation is particularly noticeable in zoom-out views and can lead to
inconsistent rendering speeds in scenes with varying details. Moreover, it
often struggles to capture the corresponding level of details at different
scales with its heuristic density control operation. Inspired by the
Level-of-Detail (LOD) techniques, we introduce Octree-GS, featuring an
LOD-structured 3D Gaussian approach supporting level-of-detail decomposition
for scene representation that contributes to the final rendering results. Our
model dynamically selects the appropriate level from the set of
multi-resolution anchor points, ensuring consistent rendering performance with
adaptive LOD adjustments while maintaining high-fidelity rendering results. |
\modelname introduces a novel Level-of-Detail (LOD) structure to 3D Gaussian Splatting using an octree for hierarchical organization of anchor Gaussians, enabling consistent real-time rendering in large scenes. |
Existing 3D Gaussian Splatting methods struggle with inconsistent rendering speeds and compromised quality in large, detail-rich scenes due to the lack of LOD awareness. |
An octree partitions the scene, assigning anchor Gaussians to LOD levels based on observation distance and scene richness. Progressive training refines anchors, and opacity blending ensures smooth LOD transitions during rendering. |
\modelname achieves competitive rendering quality with significantly fewer Gaussian primitives compared to baselines, leading to faster rendering.
The method effectively handles multi-resolution datasets and addresses aliasing issues inherent in previous approaches.
Ablation studies demonstrate the effectiveness of the LOD structure, adaptive anchor control, and progressive training. |
The octree construction and progressive training require hyperparameter tuning.
Future work includes addressing the inherent limitations of 3D-GS, such as dependency on initial sparse point clouds and lack of geometry support. |
neural scene rendering, 3d gaussian splatting, consistent real-time rendering, level-of-detail, octree |
2403.17888
Report |
2D Gaussian Splatting for Geometrically Accurate Radiance Fields |
Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, Shenghua Gao |
3D Gaussian Splatting (3DGS) has recently revolutionized radiance field
reconstruction, achieving high quality novel view synthesis and fast rendering
speed without baking. However, 3DGS fails to accurately represent surfaces due
to the multi-view inconsistent nature of 3D Gaussians. We present 2D Gaussian
Splatting (2DGS), a novel approach to model and reconstruct geometrically
accurate radiance fields from multi-view images. Our key idea is to collapse
the 3D volume into a set of 2D oriented planar Gaussian disks. Unlike 3D
Gaussians, 2D Gaussians provide view-consistent geometry while modeling
surfaces intrinsically. To accurately recover thin surfaces and achieve stable
optimization, we introduce a perspective-accurate 2D splatting process
utilizing ray-splat intersection and rasterization. Additionally, we
incorporate depth distortion and normal consistency terms to further enhance
the quality of the reconstructions. We demonstrate that our differentiable
renderer allows for noise-free and detailed geometry reconstruction while
maintaining competitive appearance quality, fast training speed, and real-time
rendering. Our code will be made publicly available. |
This paper introduces 2D Gaussian Splatting (2DGS), a novel method for reconstructing geometrically accurate radiance fields from multi-view images, using 2D oriented planar Gaussian disks as primitives. |
Existing methods like 3D Gaussian Splatting (3DGS) struggle to accurately capture intricate surface details. This new approach aims to improve geometric accuracy in radiance field reconstruction while maintaining high-quality novel view synthesis. |
The method utilizes 2D Gaussian primitives, employs a perspective-accurate 2D splatting process leveraging ray-splat intersection and rasterization, and incorporates depth distortion and normal consistency terms to enhance reconstruction quality. |
2DGS achieves state-of-the-art geometry reconstruction compared to other explicit representation methods on DTU and Tanks and Temples datasets.
It offers competitive novel view synthesis results compared to leading implicit and explicit methods on the Mip-NeRF360 dataset.
The method boasts significantly faster reconstruction times, approximately 100 times faster than implicit methods and more than 3 times faster than concurrent work. |
2DGS assumes surfaces with full opacity, potentially causing inaccuracies when handling semi-transparent surfaces.
The current densification strategy might not adequately represent fine geometric details in texture-less regions, requiring further investigation. |
novel view synthesis, radiance fields, surface reconstruction, 2d gaussian splatting, differentiable rendering |
2403.17870
Report |
Boosting Diffusion Models with Moving Average Sampling in Frequency Domain |
Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei |
Diffusion models have recently brought a powerful revolution in image
generation. Despite showing impressive generative capabilities, most of these
models rely on the current sample to denoise the next one, possibly resulting
in denoising instability. In this paper, we reinterpret the iterative denoising
process as model optimization and leverage a moving average mechanism to
ensemble all the prior samples. Instead of simply applying moving average to
the denoised samples at different timesteps, we first map the denoised samples
to data space and then perform moving average to avoid distribution shift
across timesteps. In view that diffusion models evolve the recovery from
low-frequency components to high-frequency details, we further decompose the
samples into different frequency components and execute moving average
separately on each component. We name the complete approach "Moving Average
Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into
mainstream pre-trained diffusion models and sampling schedules. Extensive
experiments on both unconditional and conditional diffusion models demonstrate
that our MASF leads to superior performances compared to the baselines, with
almost negligible additional complexity cost. |
This paper introduces MASF (Moving Average Sampling in Frequency domain), a training-free method to enhance the stability of diffusion models during image generation. |
Existing diffusion models often suffer from denoising instability due to relying solely on the current sample for denoising and not fully exploiting frequency evolution during generation. |
MASF reinterprets denoising as model optimization and utilizes moving average on prior samples in the data space. It then leverages DWT to apply moving average separately on different frequency components, further enhanced by a dynamic weighting scheme that prioritizes low-frequency components initially and gradually shifts focus to high-frequency details. |
MASF consistently improves FID scores across various datasets (ImageNet, MS-COCO, LSUN, FFHQ), especially for smaller NFEs where instability is more prominent.
MASF is compatible with different solvers (DDIM, DPM-Solver++, UniPC, F-PNDM) and sampling techniques like Classifier Guidance, demonstrating its generalizability.
Ablation studies confirm the effectiveness of each component in MASF, with moving average in the frequency domain and dynamic weighting contributing most significantly. |
The paper primarily focuses on image generation and hasn't been explored for other diffusion model applications.
Exploring more sophisticated frequency decomposition techniques beyond DWT might further enhance MASF. |
diffusion models, image generation, denoising stability, moving average, frequency domain |
2403.17839
Report |
ReMamber: Referring Image Segmentation with Mamba Twister |
Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, Yanfeng Wang |
Referring Image Segmentation (RIS) leveraging transformers has achieved great
success on the interpretation of complex visual-language tasks. However, the
quadratic computation cost makes it resource-consuming in capturing long-range
visual-language dependencies. Fortunately, Mamba addresses this with efficient
linear complexity in processing. However, directly applying Mamba to
multi-modal interactions presents challenges, primarily due to inadequate
channel interactions for the effective fusion of multi-modal data. In this
paper, we propose ReMamber, a novel RIS architecture that integrates the power
of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly
models image-text interaction, and fuses textual and visual features through
its unique channel and spatial twisting mechanism. We achieve the
state-of-the-art on three challenging benchmarks. Moreover, we conduct thorough
analyses of ReMamber and discuss other fusion designs using Mamba. These
provide valuable perspectives for future research. |
This paper presents ReMamber, a novel architecture for Referring Image Segmentation (RIS) that leverages the Mamba framework for efficient and effective multi-modal understanding. |
Existing transformer-based RIS models face limitations in efficiently capturing long-range visual-language dependencies due to quadratic computation costs. ReMamber addresses this by utilizing Mamba, which offers linear complexity. |
ReMamber employs Mamba Twister blocks, consisting of visual state space (VSS) layers and a Twisting layer. The VSS layers process spatial features, while the Twisting layer injects textual information via global and local interactions, enhancing cross-modality communication using a twisting mechanism. |
ReMamber achieves state-of-the-art results on three challenging RIS benchmarks: RefCOCO, RefCOCO+, and G-Ref.
The proposed Mamba Twister outperforms other multi-modal fusion designs, including attention-based, in-context, and norm adaptation approaches.
Ablation studies highlight the importance of both Channel and Spatial Scans within the twisting mechanism for effective modality fusion. |
The current segmentation decoder uses a simple convolutional design, which could be improved by exploring more sophisticated multi-modal decoders.
Further research is needed to address the sub-optimal compatibility of cross-attention mechanisms within the Mamba architecture. |
referring image segmentation, multi-modal understanding, mamba architecture, state space models, vision-language fusion |
2403.17823
Report |
Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders |
Alexandre Eymaël, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck |
Self-supervised pre-training of image encoders is omnipresent in the
literature, particularly following the introduction of Masked autoencoders
(MAE). Current efforts attempt to learn object-centric representations from
motion in videos. In particular, SiamMAE recently introduced a Siamese network,
training a shared-weight encoder from two frames of a video with a high
asymmetric masking ratio (95%). In this work, we propose CropMAE, an
alternative approach to the Siamese pre-training introduced by SiamMAE. Our
method specifically differs by exclusively considering pairs of cropped images
sourced from the same image but cropped differently, deviating from the
conventional pairs of frames extracted from a video. CropMAE therefore
alleviates the need for video datasets, while maintaining competitive
performances and drastically reducing pre-training time. Furthermore, we
demonstrate that CropMAE learns similar object-centric representations without
explicit motion, showing that current self-supervised learning methods do not
learn objects from motion, but rather thanks to the Siamese architecture.
Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling
the reconstruction of images using only two visible patches. Our code is
available at https://github.com/alexandre-eymael/CropMAE. |
Introduces CropMAE, a self-supervised pre-training method using cropped image pairs with high asymmetric masking for learning object-centric representations, eliminating the need for video data. |
Addresses limitations of Siamese MAEs relying on video data and extensive training by enabling faster and more efficient pre-training on image datasets while achieving competitive performance. |
Trains a Siamese ViT encoder-decoder to reconstruct a highly masked random crop of an image using another crop as reference, exploring different cropping strategies and pushing masking ratio to 98.5%. |
Achieves faster pre-training and better performance on DAVIS-2017 object propagation than SiamMAE trained on K400.
Demonstrates learning object-centric representations from still images without explicit motion, challenging the assumption that motion is essential for such representations.
Shows the effectiveness of extremely high masking ratios (98.5%) with only two visible patches, exceeding previous limits. |
Scalability to larger models and datasets requires further investigation.
Understanding the unique contributions of video frames beyond still images for pre-training is crucial. |
self-supervised learning, masked autoencoders, siamese networks, image pre-training, video segmentation |
2403.17804
Report |
Improving Text-to-Image Consistency via Automatic Prompt Optimization |
Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal |
Impressive advances in text-to-image (T2I) generative models have yielded a
plethora of high performing models which are able to generate aesthetically
appealing, photorealistic images. Despite the progress, these models still
struggle to produce images that are consistent with the input prompt,
oftentimes failing to capture object quantities, relations and attributes
properly. Existing solutions to improve prompt-image consistency suffer from
the following challenges: (1) they oftentimes require model fine-tuning, (2)
they only focus on nearby prompt samples, and (3) they are affected by
unfavorable trade-offs among image quality, representation diversity, and
prompt-image consistency. In this paper, we address these challenges and
introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a
large language model (LLM) to improve prompt-image consistency in T2I models.
Our framework starts from a user prompt and iteratively generates revised
prompts with the goal of maximizing a consistency score. Our extensive
validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost
the initial consistency score by up to 24.9% in terms of DSG score while
preserving the FID and increasing the recall between generated and real data.
Our work paves the way toward building more reliable and robust T2I systems by
harnessing the power of LLMs. |
This paper introduces OPT2I, the first text-to-image (T2I) optimization-by-prompting framework, designed to enhance prompt-image consistency. |
Existing methods for improving consistency often require modifying model weights, limiting their applicability. OPT2I addresses this by working exclusively in text space, making it compatible with various T2I models, even those accessible only through APIs. |
OPT2I employs an iterative process involving a pre-trained T2I model, a large language model (LLM), and a consistency metric (e.g., decomposed CLIPScore or Davidsonian Scene Graph). The LLM refines user prompts by leveraging past prompt-score pairs to generate alternatives that maximize consistency. |
OPT2I consistently improves prompt-image consistency, outperforming paraphrasing baselines and achieving up to 24.9% improvement over user prompts.
The framework demonstrates robustness across various LLMs, T2I models, and consistency metrics.
Qualitative analysis reveals that OPT2I emphasizes initially ignored visual elements by either adding detail or strategically reordering prompt components. |
The method relies on the reliability of prompt-image consistency scores, which can be inaccurate due to limitations in current metrics (e.g., bag-of-words behavior in CLIP).
The iterative optimization process introduces runtime overhead compared to directly using the user prompt. |
text-to-image generation, prompt optimization, large language models, prompt-image consistency, in-context learning |
2403.17782
Report |
GenesisTex: Adapting Image Denoising Diffusion to Texture Space |
Chenjian Gao, Boyan Jiang, Xinghui Li, Yingpeng Zhang, Qian Yu |
We present GenesisTex, a novel method for synthesizing textures for 3D
geometries from text descriptions. GenesisTex adapts the pretrained image
diffusion model to texture space by texture space sampling. Specifically, we
maintain a latent texture map for each viewpoint, which is updated with
predicted noise on the rendering of the corresponding viewpoint. The sampled
latent texture maps are then decoded into a final texture map. During the
sampling process, we focus on both global and local consistency across multiple
viewpoints: global consistency is achieved through the integration of style
consistency mechanisms within the noise prediction network, and low-level
consistency is achieved by dynamically aligning latent textures. Finally, we
apply reference-based inpainting and img2img on denser views for texture
refinement. Our approach overcomes the limitations of slow optimization in
distillation-based methods and instability in inpainting-based methods.
Experiments on meshes from various sources demonstrate that our method
surpasses the baseline methods quantitatively and qualitatively. |
Presents GenesisTex, a novel method for synthesizing textures on 3D geometries from text descriptions using texture space sampling in an image diffusion model. |
Addresses limitations of existing methods (slow optimization in distillation-based and instability in inpainting-based) for generating high-quality textures directly from text input. |
Adapts a pretrained image diffusion model (Stable Diffusion) to texture space. It utilizes texture space sampling for multi-view consistent generation, enhanced by style consistency mechanisms and dynamic alignment. Further refinement is achieved through reference-based inpainting and Img2Img on denser views. |
Achieves state-of-the-art texture synthesis quality, surpassing baselines in FID/KID metrics and user studies.
Generates detailed, clean, and naturally colored textures for diverse geometries within minutes.
Demonstrates the effectiveness of texture space sampling, style consistency, and dynamic alignment in achieving multi-view consistency. |
Significant memory cost limits the number of viewpoints during generation, requiring post-processing steps.
Future work could explore hierarchical style consistency to reduce memory cost and investigate texture map generation compatible with PBR workflows. |
texture synthesis, text-to-3d, image diffusion models, multi-view consistency, 3d content generation |
2403.17765
Report |
MUTE-SLAM: Real-Time Neural SLAM with Multiple Tri-Plane Hash Representations |
Yifan Yan, Ruomin He, Zhenghua Liu |
We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing
multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM
effectively tracks camera positions and incrementally builds a scalable
multi-map representation for both small and large indoor environments. It
dynamically allocates sub-maps for newly observed local regions, enabling
constraint-free mapping without prior scene information. Unlike traditional
grid-based methods, we use three orthogonal axis-aligned planes for
hash-encoding scene properties, significantly reducing hash collisions and the
number of trainable parameters. This hybrid approach not only speeds up
convergence but also enhances the fidelity of surface reconstruction.
Furthermore, our optimization strategy concurrently optimizes all sub-maps
intersecting with the current camera frustum, ensuring global consistency.
Extensive testing on both real-world and synthetic datasets has shown that
MUTE-SLAM delivers state-of-the-art surface reconstruction quality and
competitive tracking performance across diverse indoor settings. The code will
be made public upon acceptance of the paper. |
MUTE-SLAM, a real-time neural RGB-D SLAM system using multiple tri-plane hash-encodings for efficient and scalable scene representation, enabling detailed mapping in unknown indoor environments. |
Existing neural implicit SLAM methods struggle with scalability and often require pre-defined scene boundaries, limiting their use in large and unknown environments. |
The system dynamically allocates sub-maps with tri-plane hash-encoding for new regions. It jointly optimizes all currently observed sub-maps and camera poses, and employs global bundle adjustment for consistency. |
Achieves state-of-the-art surface reconstruction quality on Replica, surpassing baselines in detail preservation.
Demonstrates competitive tracking performance on ScanNet and TUM-RGBD, outperforming some methods even without pre-defined boundaries.
Exhibits strong scalability on the large-scale Apartment dataset, maintaining efficient run-time performance. |
Remains sensitive to illumination changes and depth measurement inaccuracies inherent to RGB-D sensors.
Global bundle adjustment, based on random keyframe sampling, may inadequately optimize less frequently observed areas, potentially impacting reconstruction in those regions. |
slam, neural implicit representation, tri-plane encoding, hash-encoding, multi-map representation |
2403.17695
Report |
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition |
Chenhongyi Yang, Zehui Chen, Miguel Espinosa, Linus Ericsson, Zhenyu Wang, Jiaming Liu, Elliot J. Crowley |
We present PlainMamba: a simple non-hierarchical state space model (SSM)
designed for general visual recognition. The recent Mamba model has shown how
SSMs can be highly competitive with other architectures on sequential data and
initial attempts have been made to apply it to images. In this paper, we
further adapt the selective scanning process of Mamba to the visual domain,
enhancing its ability to learn features from two-dimensional images by (i) a
continuous 2D scanning process that improves spatial continuity by ensuring
adjacency of tokens in the scanning sequence, and (ii) direction-aware updating
which enables the model to discern the spatial relations of tokens by encoding
directional information. Our architecture is designed to be easy to use and
easy to scale, formed by stacking identical PlainMamba blocks, resulting in a
model with constant width throughout all layers. The architecture is further
simplified by removing the need for special tokens. We evaluate PlainMamba on a
variety of visual recognition tasks including image classification, semantic
segmentation, object detection, and instance segmentation. Our method achieves
performance gains over previous non-hierarchical models and is competitive with
hierarchical alternatives. For tasks requiring high-resolution inputs, in
particular, PlainMamba requires much less computing while maintaining high
performance. Code and models are available at
https://github.com/ChenhongyiYang/PlainMamba |
This work introduces PlainMamba, a simple non-hierarchical State Space Model (SSM) for visual recognition that enhances the selective scanning process of Mamba for 2D image data processing. |
Plain non-hierarchical visual encoders like ViT are favored for their simplicity and widespread adoption in vision foundation models, offering ease of feature integration across levels and modalities, scalability, and hardware optimization. |
PlainMamba replaces hierarchical structures with identical blocks of constant width, eliminating the need for special tokens. It introduces 'Continuous 2D Scanning' for spatial continuity and 'Direction-Aware Updating' to encode directional information in selective scanning. |
PlainMamba outperforms non-hierarchical counterparts, including SSMs and Transformers, on ImageNet1K classification, COCO object detection/instance segmentation, and ADE20K semantic segmentation.
The model shows competitive performance compared to hierarchical models while maintaining simplicity.
PlainMamba exhibits high efficiency with high-resolution inputs, requiring significantly less computation than ViTs in such cases. |
The model's performance slightly lags behind hierarchical models on tasks that benefit from multi-resolution architectures.
Future work could explore enhancements in efficiency for low-resolution inputs to match ViT's performance in that domain. |
state space models, visual recognition, non-hierarchical architecture, continuous 2d scanning, direction-aware updating |
2403.17638
Report |
Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency |
Yingjie Xu, Bangzhen Liu, Hao Tang, Bailin Deng, Shengfeng He |
We propose a voxel-based optimization framework, ReVoRF, for few-shot
radiance fields that strategically address the unreliability in pseudo novel
view synthesis. Our method pivots on the insight that relative depth
relationships within neighboring regions are more reliable than the absolute
color values in disoccluded areas. Consequently, we devise a bilateral
geometric consistency loss that carefully navigates the trade-off between color
fidelity and geometric accuracy in the context of depth consistency for
uncertain regions. Moreover, we present a reliability-guided learning strategy
to discern and utilize the variable quality across synthesized views,
complemented by a reliability-aware voxel smoothing algorithm that smoothens
the transition between reliable and unreliable data patches. Our approach
allows for a more nuanced use of all available data, promoting enhanced
learning from regions previously considered unsuitable for high-quality
reconstruction. Extensive experiments across diverse datasets reveal that our
approach attains significant gains in efficiency and accuracy, delivering
rendering speeds of 3 FPS, 7 mins to train a $360^\circ$ scene, and a 5\%
improvement in PSNR over existing few-shot methods. Code is available at
https://github.com/HKCLynn/ReVoRF. |
This paper presents ReVoRF, a voxel-based optimization framework for fast few-shot radiance field reconstruction that leverages the relative depth information within unreliable regions of synthesized novel views, enabling enhanced multi-view consistency learning. |
Few-shot NeRF methods struggle to maintain geometric and texture accuracy due to the sparsity of input views. Utilizing unreliable areas in synthesized views, which contain relative depth information, can enhance multi-view consistency and improve reconstruction quality. |
The method involves: 1) Synthesizing novel views from sparse inputs using depth-guided warping. 2) Identifying reliable and unreliable regions in warped views based on pixel correlation. 3) Introducing a bilateral geometric consistency loss that leverages color and density for reliable regions and relative depth for unreliable ones. 4) Employing a reliability-aware voxel smoothing procedure and a learning strategy that prioritizes reliable areas during training. |
ReVoRF achieves state-of-the-art accuracy in PSNR and LPIPS on the Realistic Synthetic 360° dataset.
It demonstrates superior performance in capturing fine details and preserving structural integrity compared to existing methods on both synthetic and real-world datasets.
The method achieves fast reconstruction, with rendering speeds of 3 FPS and a training time of 7 minutes for a 360° scene. |
The voxel-based nature of ReVoRF can lead to the smoothing of fine details in the reconstructed scenes.
The method's performance in highly complex and large-scale scenes remains to be explored. |
neural radiance fields, few-shot learning, view synthesis, 3d reconstruction, unreliability modeling |
2403.17465
Report |
LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection |
Yunpeng Luo, Junlong Du, Ke Yan, Shouhong Ding |
The evolution of Diffusion Models has dramatically improved image generation
quality, making it increasingly difficult to differentiate between real and
generated images. This development, while impressive, also raises significant
privacy and security concerns. In response to this, we propose a novel Latent
REconstruction error guided feature REfinement method (LaRE^2) for detecting
the diffusion-generated images. We come up with the Latent Reconstruction Error
(LaRE), the first reconstruction-error based feature in the latent space for
generated image detection. LaRE surpasses existing methods in terms of feature
extraction efficiency while preserving crucial cues required to differentiate
between the real and the fake. To exploit LaRE, we propose an Error-Guided
feature REfinement module (EGRE), which can refine the image feature guided by
LaRE to enhance the discriminativeness of the feature. Our EGRE utilizes an
align-then-refine mechanism, which effectively refines the image feature for
generated-image detection from both spatial and channel perspectives. Extensive
experiments on the large-scale GenImage benchmark demonstrate the superiority
of our LaRE^2, which surpasses the best SoTA method by up to 11.9%/12.1%
average ACC/AP across 8 different image generators. LaRE also surpasses
existing methods in terms of feature extraction cost, delivering an impressive
speed enhancement of 8 times. |
This paper introduces LaRE², a novel method for detecting diffusion-generated images using latent reconstruction errors and an error-guided feature refinement module. |
The rise of highly realistic diffusion models necessitates robust detection methods to address privacy and security concerns arising from the potential misuse of generated images. |
LaRE² extracts Latent Reconstruction Error (LaRE) in the latent space through single-step reconstruction. Then, it uses an Error-guided Feature REfinement module (EGRE) to refine image features spatially and channel-wise based on LaRE, improving discriminative capability for generated image detection. |
LaRE² significantly outperforms existing methods, achieving up to 11.9%/12.1% ACC/AP gain on the large-scale GenImage benchmark.
LaRE feature extraction is 8 times faster than previous reconstruction-based methods.
Ablation studies confirm the effectiveness of EGRE and the robustness of LaRE² to hyperparameter choices like noise ensemble size and sample step. |
The model's generalizability to entirely unseen diffusion models or future, more advanced generative models needs further investigation.
Further research can explore incorporating class-specific prompts or leveraging textual information for more informative LaRE extraction. |
diffusion model, image generation, image forensics, reconstruction error, feature refinement |
2403.17422
Report |
InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion |
Jihyun Lee, Shunsuke Saito, Giljoo Nam, Minhyuk Sung, Tae-Kyun Kim |
We present InterHandGen, a novel framework that learns the generative prior
of two-hand interaction. Sampling from our model yields plausible and diverse
two-hand shapes in close interaction with or without an object. Our prior can
be incorporated into any optimization or learning methods to reduce ambiguity
in an ill-posed setup. Our key observation is that directly modeling the joint
distribution of multiple instances imposes high learning complexity due to its
combinatorial nature. Thus, we propose to decompose the modeling of joint
distribution into the modeling of factored unconditional and conditional single
instance distribution. In particular, we introduce a diffusion model that
learns the single-hand distribution unconditional and conditional to another
hand via conditioning dropout. For sampling, we combine anti-penetration and
classifier-free guidance to enable plausible generation. Furthermore, we
establish the rigorous evaluation protocol of two-hand synthesis, where our
method significantly outperforms baseline generative models in terms of
plausibility and diversity. We also demonstrate that our diffusion prior can
boost the performance of two-hand reconstruction from monocular in-the-wild
images, achieving new state-of-the-art accuracy. |
This paper introduces InterHandGen, a novel framework that learns a generative prior of two-hand interactions, enabling the generation of plausible and diverse two-hand shapes with or without an object. |
Modeling two-hand interactions is crucial for capturing human behavior, with applications in AR/VR and HCI. Existing methods primarily focus on reconstruction, while generative modeling remains underexplored. |
The framework decomposes the complex joint distribution of two hands into unconditional and conditional single-hand distributions, learned using a cascaded diffusion model with conditioning dropout. It employs anti-penetration and classifier-free guidance during inference to ensure plausibility and diversity. |
InterHandGen outperforms baseline generative models in terms of plausibility and diversity, as measured by newly introduced two-hand interaction generation metrics.
The framework effectively generalizes to two-hand and object interactions, demonstrating superior performance on the ARCTIC dataset.
Integrating the learned prior into downstream tasks, such as monocular two-hand reconstruction, results in improved accuracy, achieving state-of-the-art results. |
The current prior, while effective for two-hand interactions, does not yet offer a significant advantage as a universal hand prior across all hand-related tasks.
Future work includes exploring temporal extensions for generating hand interaction sequences and expanding the framework to other interaction synthesis problems beyond hands. |
two-hand interaction, generative prior, diffusion model, cascaded inference, hand pose estimation |
2403.17410
Report |
On permutation-invariant neural networks |
Masanari Kimura, Ryotaro Shimizu, Yuki Hirakawa, Ryosuke Goto, Yuki Saito |
Conventional machine learning algorithms have traditionally been designed
under the assumption that input data follows a vector-based format, with an
emphasis on vector-centric paradigms. However, as the demand for tasks
involving set-based inputs has grown, there has been a paradigm shift in the
research community towards addressing these challenges. In recent years, the
emergence of neural network architectures such as Deep Sets and Transformers
has presented a significant advancement in the treatment of set-based data.
These architectures are specifically engineered to naturally accommodate sets
as input, enabling more effective representation and processing of set
structures. Consequently, there has been a surge of research endeavors
dedicated to exploring and harnessing the capabilities of these architectures
for various tasks involving the approximation of set functions. This
comprehensive survey aims to provide an overview of the diverse problem
settings and ongoing research efforts pertaining to neural networks that
approximate set functions. By delving into the intricacies of these approaches
and elucidating the associated challenges, the survey aims to equip readers
with a comprehensive understanding of the field. Through this comprehensive
perspective, we hope that researchers can gain valuable insights into the
potential applications, inherent limitations, and future directions of
set-based neural networks. Indeed, from this survey we gain two insights: i)
Deep Sets and its variants can be generalized by differences in the aggregation
function, and ii) the behavior of Deep Sets is sensitive to the choice of the
aggregation function. From these observations, we show that Deep Sets, one of
the well-known permutation-invariant neural networks, can be generalized in the
sense of a quasi-arithmetic mean. |
This paper surveys neural network architectures for approximating set functions. It particularly highlights Deep Sets and its variants, emphasizing their generalization potential through different aggregation functions, especially quasi-arithmetic means. |
With the growing need to process set-based data in machine learning, understanding and improving neural networks capable of handling permutation-invariant inputs like sets is crucial. |
The paper reviews existing architectures like Deep Sets, PointNet, and Set Transformers, analyzing their strengths, limitations, and theoretical properties. It connects them through the lens of Janossy pooling and explores the impact of aggregation functions. It introduces "H"{o}lder's Power Deep Sets", a novel generalization based on power mean, and evaluates its performance on various datasets. |
Deep Sets, PointNet, and Set Transformers can be unified and analyzed under the framework of Janossy pooling.
Theoretical analysis reveals limitations in Deep Sets' expressive power depending on latent space dimensionality and set size.
Experiments show that H"{o}lder's Power Deep Sets, with its power mean aggregation, can outperform standard Deep Sets and PointNet depending on the dataset and optimized power exponent. |
The paper primarily focuses on linear cases for H"{o}lder's Power Deep Sets. Further investigation is needed for non-linear scenarios.
While promising, the proposed generalization requires more extensive experimental validation and theoretical analysis across diverse datasets and tasks. |
set function approximation, permutation invariance, deep sets, janossy pooling, power mean |
2403.17377
Report |
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance |
Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, Seungryong Kim |
Recent studies have demonstrated that diffusion models are capable of
generating high-quality samples, but their quality heavily depends on sampling
guidance techniques, such as classifier guidance (CG) and classifier-free
guidance (CFG). These techniques are often not applicable in unconditional
generation or in various downstream tasks such as image restoration. In this
paper, we propose a novel sampling guidance, called Perturbed-Attention
Guidance (PAG), which improves diffusion sample quality across both
unconditional and conditional settings, achieving this without requiring
additional training or the integration of external modules. PAG is designed to
progressively enhance the structure of samples throughout the denoising
process. It involves generating intermediate samples with degraded structure by
substituting selected self-attention maps in diffusion U-Net with an identity
matrix, by considering the self-attention mechanisms' ability to capture
structural information, and guiding the denoising process away from these
degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves
sample quality in conditional and even unconditional scenarios. Moreover, PAG
significantly improves the baseline performance in various downstream tasks
where existing guidances such as CG or CFG cannot be fully utilized, including
ControlNet with empty prompts and image restoration such as inpainting and
deblurring. |
This paper introduces Perturbed-Attention Guidance (PAG), a novel sampling guidance method for diffusion models that enhances sample quality by perturbing the self-attention maps in the model's U-Net architecture. |
Existing guidance methods like Classifier-Free Guidance (CFG) rely on additional training or external modules, are not applicable for unconditional generation, and may decrease sample diversity. PAG addresses these limitations. |
PAG perturbs the self-attention maps in the diffusion U-Net by replacing them with identity matrices, disrupting structural information while preserving appearance. This perturbed output guides the denoising process towards more structurally coherent samples. |
PAG significantly improves FID and IS scores in both conditional and unconditional image generation with ADM and Stable Diffusion.
PAG complements CFG, leading to further quality improvements when used together.
PAG enhances performance in downstream tasks like image restoration (PSLD) and ControlNet with empty prompts, where CFG is not applicable. |
High guidance scales in PAG can lead to over-saturation, requiring careful scale calibration.
PAG requires two forward passes per generation step, impacting computational efficiency. |
diffusion models, image generation, sampling guidance, self-attention, unconditional generation |
2403.17237
Report |
DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion |
Yuanze Lin, Ronald Clark, Philip Torr |
We present DreamPolisher, a novel Gaussian Splatting based method with
geometric guidance, tailored to learn cross-view consistency and intricate
detail from textual descriptions. While recent progress on text-to-3D
generation methods have been promising, prevailing methods often fail to ensure
view-consistency and textural richness. This problem becomes particularly
noticeable for methods that work with text input alone. To address this, we
propose a two-stage Gaussian Splatting based approach that enforces geometric
consistency among views. Initially, a coarse 3D generation undergoes refinement
via geometric optimization. Subsequently, we use a ControlNet driven refiner
coupled with the geometric consistency term to improve both texture fidelity
and overall consistency of the generated 3D asset. Empirical evaluations across
diverse textual prompts spanning various object categories demonstrate the
efficacy of DreamPolisher in generating consistent and realistic 3D objects,
aligning closely with the semantics of the textual instructions. |
DreamPolisher, a novel text-to-3D generation method based on 3D Gaussian Splatting, generates high-quality and view-consistent 3D assets from textual descriptions. |
Existing text-to-3D methods often struggle with view-consistency and lack intricate textural details. DreamPolisher addresses this gap by combining Gaussian Splatting with geometric diffusion and ControlNet refinement. |
Two-stage approach: 1) Coarse optimization learns coarse 3D Gaussians from text using a point cloud diffusion model and ISM loss. 2) Appearance refinement enhances texture and consistency using a ControlNet-driven refiner and a novel view-consistency loss. |
Significantly outperforms existing methods in visual quality and view consistency.
Demonstrates robust generality across diverse object categories (food, vehicles, furniture, etc.).
Generates high-fidelity 3D objects with fine details and accurate geometry. |
Current implementation requires 30 minutes generation time per object.
Relies solely on text prompts, limiting the ability to guide generation using images. |
text-to-3d generation, gaussian splatting, geometric diffusion, controlnet, view consistency |
2403.17213
Report |
AnimateMe: 4D Facial Expressions via Diffusion Models |
Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Stefanos Zafeiriou |
The field of photorealistic 3D avatar reconstruction and generation has
garnered significant attention in recent years; however, animating such avatars
remains challenging. Recent advances in diffusion models have notably enhanced
the capabilities of generative models in 2D animation. In this work, we
directly utilize these models within the 3D domain to achieve controllable and
high-fidelity 4D facial animation. By integrating the strengths of diffusion
processes and geometric deep learning, we employ Graph Neural Networks (GNNs)
as denoising diffusion models in a novel approach, formulating the diffusion
process directly on the mesh space and enabling the generation of 3D facial
expressions. This facilitates the generation of facial deformations through a
mesh-diffusion-based model. Additionally, to ensure temporal coherence in our
animations, we propose a consistent noise sampling method. Under a series of
both quantitative and qualitative experiments, we showcase that the proposed
method outperforms prior work in 4D expression synthesis by generating
high-fidelity extreme expressions. Furthermore, we applied our method to
textured 4D facial expression generation, implementing a straightforward
extension that involves training on a large-scale textured 4D facial expression
database. |
Introduces AnimateMe, the first diffusion-based method for customizable 4D facial expression generation directly on the mesh space using Graph Neural Networks (GNNs) as denoising models. |
Addresses the limitations of prior 4D facial expression generation methods, particularly in producing high-fidelity extreme expressions and capturing fine details, by leveraging the power of diffusion models. |
Presents a novel mesh diffusion process using GNNs to capture mesh structure and introduces a consistent noise sampling strategy for smooth animations. The method is trained on deformations from a neutral mesh and conditioned on expression progression and intensity, enabling customization. |
Achieves state-of-the-art performance on 4D expression synthesis, outperforming previous methods in both quantitative metrics (classification accuracy, specificity) and qualitative evaluations.
Successfully generates high-fidelity extreme expressions, a challenge that previous methods struggled with.
Demonstrates the adaptability of the method by extending it to textured 4D animation on a large-scale dataset, showcasing its potential for realistic and detailed facial animation. |
Reliance on an expression progression signal for conditioning limits versatility.
The diffusion-based approach can be computationally expensive, especially for high-resolution meshes, despite efforts to improve efficiency. |
4d facial expression, diffusion models, graph neural networks, mesh animation, consistent noise sampling |
2403.17064
Report |
Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions |
Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Vincent Tao Hu, Björn Ommer |
In recent years, advances in text-to-image (T2I) diffusion models have
substantially elevated the quality of their generated images. However,
achieving fine-grained control over attributes remains a challenge due to the
limitations of natural language prompts (such as no continuous set of
intermediate descriptions existing between ``person'' and ``old person''). Even
though many methods were introduced that augment the model or generation
process to enable such control, methods that do not require a fixed reference
image are limited to either enabling global fine-grained attribute expression
control or coarse attribute expression control localized to specific subjects,
not both simultaneously. We show that there exist directions in the commonly
used token-level CLIP text embeddings that enable fine-grained subject-specific
control of high-level attributes in text-to-image models. Based on this
observation, we introduce one efficient optimization-free and one robust
optimization-based method to identify these directions for specific attributes
from contrastive text prompts. We demonstrate that these directions can be used
to augment the prompt text input with fine-grained control over attributes of
specific subjects in a compositional manner (control over multiple attributes
of a single subject) without having to adapt the diffusion model. Project page:
https://compvis.github.io/attribute-control. Code is available at
https://github.com/CompVis/attribute-control. |
This paper introduces a method for fine-grained, subject-specific attribute control in text-to-image (T2I) generation by identifying semantic directions in token-level CLIP text embeddings, allowing manipulation of attributes like age, style, and even vehicle price. |
Current methods for attribute control in T2I models either offer fine-grained global control or coarse subject-specific control, but not both. This work bridges this gap, enabling nuanced manipulation of specific subjects within complex scenes. |
The method involves two approaches: (1) an optimization-free method that computes differences between CLIP embeddings of contrasting prompts (e.g., "young person" vs. "old person") and (2) a robust learning-based method that trains edit deltas using contrastive prompts to guide a diffusion model's predictions. |
Identified directions in token-level CLIP embeddings effectively control attributes of specific subjects.
Learned edit deltas capture semantic differences and are transferable across different prompts and subjects of similar categories.
The method allows for compositional attribute editing, enabling control over multiple attributes of a single subject or different subjects within a scene. |
The approach is limited by the diffusion model's capacity to disentangle attributes, potentially leading to unwanted correlations.
Future work could explore combining this method with complementary approaches to further reduce attribute mixing between subjects. |
text-to-image synthesis, diffusion models, attribute control, clip embeddings, semantic directions |
2403.17008
Report |
FlashFace: Human Image Personalization with High-fidelity Identity Preservation |
Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, Ping Luo |
This work presents FlashFace, a practical tool with which users can easily
personalize their own photos on the fly by providing one or a few reference
face images and a text prompt. Our approach is distinguishable from existing
human photo customization methods by higher-fidelity identity preservation and
better instruction following, benefiting from two subtle designs. First, we
encode the face identity into a series of feature maps instead of one image
token as in prior arts, allowing the model to retain more details of the
reference faces (e.g., scars, tattoos, and face shape ). Second, we introduce a
disentangled integration strategy to balance the text and image guidance during
the text-to-image generation process, alleviating the conflict between the
reference faces and the text prompts (e.g., personalizing an adult into a
"child" or an "elder"). Extensive experimental results demonstrate the
effectiveness of our method on various applications, including human image
personalization, face swapping under language prompts, making virtual
characters into real people, etc. Project Page:
https://jshilong.github.io/flashface-page. |
This paper introduces \method, a novel approach for human image personalization that preserves high-fidelity facial identity and follows text prompts effectively. |
Existing methods for human image customization often struggle to balance preserving detailed facial features with accurately following text instructions, particularly when there's a conflict between the two. |
The paper proposes two key innovations: 1) encoding reference faces into feature maps instead of tokens for detailed preservation and 2) a disentangled integration strategy to balance reference image and text prompt influence during generation. They also introduce a new ID dataset construction pipeline for training. |
\method demonstrates superior identity preservation while effectively incorporating text prompts, even with conflicting instructions (e.g., changing age or gender).
Increasing the number of reference images significantly improves identity fidelity, as evidenced by quantitative metrics and visual comparisons.
Ablation studies highlight the importance of reference attention layer placement in the U-Net decoder and the role of reference strength parameters for fine-tuning generation. |
The method may still produce artifacts in some generated images, suggesting limitations in the base model's capabilities.
Controlling head pose through text prompts remains challenging, indicating an area for future improvement in controllability. |
human image personalization, identity preservation, text-to-image generation, disentangled representation learning, reference attention |
2403.17007
Report |
DreamLIP: Language-Image Pre-training with Long Captions |
Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, Yujun Shen |
Language-image pre-training largely relies on how precisely and thoroughly a
text describes its paired image. In practice, however, the contents of an image
can be so rich that well describing them requires lengthy captions (e.g., with
10 sentences), which are usually missing in existing datasets. Consequently,
there are currently no clear evidences on whether and how language-image
pre-training could benefit from long captions. To figure this out, we first
re-caption 30M images with detailed descriptions using a pre-trained
Multi-modality Large Language Model (MLLM), and then study the usage of the
resulting captions under a contrastive learning framework. We observe that,
each sentence within a long caption is very likely to describe the image
partially (e.g., an object). Motivated by this, we propose to dynamically
sample sub-captions from the text label to construct multiple positive pairs,
and introduce a grouping loss to match the embeddings of each sub-caption with
its corresponding local image patches in a self-supervised manner. Experimental
results on a wide rage of downstream tasks demonstrate the consistent
superiority of our method, termed DreamLIP, over previous alternatives,
highlighting its fine-grained representational capacity. It is noteworthy that,
on the tasks of image-text retrieval and semantic segmentation, our model
trained with 30M image-text pairs achieves on par or even better performance
than CLIP trained with 400M pairs. Project page is available at
https://zyf0619sjtu.github.io/dream-lip. |
This paper studies the use of long captions generated by a pre-trained Multi-modality Large Language Model (MLLM) for improving language-image pre-training. |
Existing language-image pre-training datasets use short captions that fail to capture the richness of real-world images. Long captions can provide a more detailed and comprehensive description, unlocking new potential for semantic understanding. |
The authors re-caption 30M images with detailed descriptions using a pre-trained MLLM. They propose DreamLIP, a framework that dynamically samples sub-captions from the long captions to create multiple positive image-text pairs and utilizes a grouping loss to align sub-captions with their corresponding local image patches. |
DreamLIP consistently outperforms previous state-of-the-art methods on a wide range of downstream tasks including image-text retrieval, semantic segmentation, and image recognition.
Notably, DreamLIP trained on 30M image-text pairs achieves comparable or even superior performance to CLIP trained on 400M pairs for certain tasks.
Analysis demonstrates the effectiveness of long captions and the proposed sampling and alignment strategy in enhancing fine-grained representation learning. |
The work relies on the quality of the MLLM-generated captions, which can be prone to hallucinations.
Future work could explore methods to mitigate the impact of potential hallucinations in long captions. |
language-image pre-training, long captions, multi-modal learning, contrastive learning, fine-grained representation |
2403.17005
Report |
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models |
Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei |
Recent advances in text-to-video generation have demonstrated the utility of
powerful diffusion models. Nevertheless, the problem is not trivial when
shaping diffusion models to animate static image (i.e., image-to-video
generation). The difficulty originates from the aspect that the diffusion
process of subsequent animated frames should not only preserve the faithful
alignment with the given image but also pursue temporal coherence among
adjacent frames. To alleviate this, we present TRIP, a new recipe of
image-to-video diffusion paradigm that pivots on image noise prior derived from
static image to jointly trigger inter-frame relational reasoning and ease the
coherent temporal modeling via temporal residual learning. Technically, the
image noise prior is first attained through one-step backward diffusion process
based on both static image and noised video latent codes. Next, TRIP executes a
residual-like dual-path scheme for noise prediction: 1) a shortcut path that
directly takes image noise prior as the reference noise of each frame to
amplify the alignment between the first frame and subsequent frames; 2) a
residual path that employs 3D-UNet over noised video and static image latent
codes to enable inter-frame relational reasoning, thereby easing the learning
of the residual noise for each frame. Furthermore, both reference and residual
noise of each frame are dynamically merged via attention mechanism for final
video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT
datasets demonstrate the effectiveness of our TRIP for image-to-video
generation. Please see our project page at https://trip-i2v.github.io/TRIP/. |
Presents TRIP, a novel image-to-video diffusion model that leverages temporal residual learning with image noise prior for coherent video generation. |
Addresses the challenge of maintaining temporal coherence and alignment with the input image in image-to-video generation. |
Calculates image noise prior from the input image and noisy video latent codes, then uses it as reference for residual noise prediction via a dual-path scheme with a 3D-UNet and a Transformer-based temporal noise fusion module. |
Achieves state-of-the-art performance on WebVid-10M, DTDB, and MSR-VTT datasets, demonstrating superior temporal coherence and visual quality.
Significantly outperforms baselines in terms of frame consistency (F-Consistency) and Frechet Video Distance (FVD).
Shows strong generalization ability for customized image animation, enabling text-to-video generation and integration with image editing models. |
Current implementation focuses on generating relatively short video clips.
Exploring more sophisticated noise scheduling and sampling strategies for further quality improvement. |
image-to-video generation, diffusion models, temporal residual learning, image noise prior, temporal coherence |
2403.17004
Report |
SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer |
Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, Chang Wen Chen |
Diffusion Transformer (DiT) has emerged as the new trend of generative
diffusion models on image generation. In view of extremely slow convergence in
typical DiT, recent breakthroughs have been driven by mask strategy that
significantly improves the training efficiency of DiT with additional
intra-image contextual learning. Despite this progress, mask strategy still
suffers from two inherent limitations: (a) training-inference discrepancy and
(b) fuzzy relations between mask reconstruction & generative diffusion process,
resulting in sub-optimal training of DiT. In this work, we address these
limitations by novelly unleashing the self-supervised discrimination knowledge
to boost DiT training. Technically, we frame our DiT in a teacher-student
manner. The teacher-student discriminative pairs are built on the diffusion
noises along the same Probability Flow Ordinary Differential Equation (PF-ODE).
Instead of applying mask reconstruction loss over both DiT encoder and decoder,
we decouple DiT encoder and decoder to separately tackle discriminative and
generative objectives. In particular, by encoding discriminative pairs with
student and teacher DiT encoders, a new discriminative loss is designed to
encourage the inter-image alignment in the self-supervised embedding space.
After that, student samples are fed into student DiT decoder to perform the
typical generative diffusion task. Extensive experiments are conducted on
ImageNet dataset, and our method achieves a competitive balance between
training cost and generative capacity. |
This paper proposes SD-DiT, a Diffusion Transformer architecture that leverages self-supervised discrimination knowledge distillation to enhance training efficiency and generative capacity. |
Existing Diffusion Transformers suffer from slow convergence and limitations in mask strategies. This paper addresses these by introducing a novel approach to mask modeling based on self-supervised discrimination. |
SD-DiT employs a teacher-student scheme with decoupled encoder-decoder structure. The teacher branch provides discriminative knowledge to the student branch, enhancing the generative diffusion process in the student branch. This is achieved through a novel discriminative loss that encourages inter-image alignment between teacher and student encoders. |
SD-DiT achieves a better balance between training speed and generative performance compared to state-of-the-art DiT models.
SD-DiT demonstrates superior FID scores compared to other DiT-based methods, especially with larger scale backbones.
SD-DiT shows faster convergence speed, achieving comparable performance to other models with significantly fewer training steps. |
The paper mainly focuses on image generation and evaluation on a single dataset (ImageNet).
Exploring different self-supervised learning techniques beyond the teacher-student scheme could be a potential future direction. |
diffusion models, diffusion transformer, self-supervised learning, image generation, mask modeling |
2403.16999
Report |
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models |
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li |
This paper presents Visual CoT, a novel pipeline that leverages the reasoning
capabilities of multi-modal large language models (MLLMs) by incorporating
visual Chain-of-Thought (CoT) reasoning. While MLLMs have shown promise in
various visual tasks, they often lack interpretability and struggle with
complex visual inputs. To address these challenges, we propose a multi-turn
processing pipeline that dynamically focuses on visual inputs and provides
interpretable thoughts. We collect and introduce the Visual CoT dataset
comprising 373k question-answer pairs, annotated with intermediate bounding
boxes highlighting key regions essential for answering the questions.
Importantly, the introduced benchmark is capable of evaluating MLLMs in
scenarios requiring specific local region identification. Extensive experiments
demonstrate the effectiveness of our framework and shed light on better
inference strategies. The Visual CoT dataset, benchmark, and pre-trained models
are available to foster further research in this direction. |
This paper introduces Visual CoT, a novel pipeline and dataset for enhancing Multi-Modal Large Language Models (MLLMs) with visual Chain-of-Thought (CoT) reasoning, improving their interpretability and ability to process complex visual inputs. |
Existing MLLMs often lack interpretability and struggle with dynamic, multi-turn visual reasoning, hindering their efficacy in complex tasks. |
The authors curate a Visual CoT dataset with 373k question-answer pairs annotated with bounding boxes highlighting key regions. They propose a multi-turn pipeline where the MLLM first identifies the key region, then uses both original and localized information for reasoning. |
Visual CoT significantly improves performance on document/text-related tasks and high-resolution image processing.
The model achieves comparative results on various multi-modal benchmarks, demonstrating enhanced visual understanding.
Visual CoT outperforms previous state-of-the-art models on visual grounding benchmarks, highlighting its effectiveness in locating and understanding objects. |
The model may struggle to identify the most relevant region in images with extensive information or complex questions.
Future work can explore incorporating more sophisticated visual reasoning modules and extending the approach to other multi-modal tasks. |
multi-modal language models, chain-of-thought reasoning, visual reasoning, interpretability, visual grounding |
2403.16990
Report |
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation |
Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or |
Text-to-image diffusion models have an unprecedented ability to generate
diverse and high-quality images. However, they often struggle to faithfully
capture the intended semantics of complex input prompts that include multiple
subjects. Recently, numerous layout-to-image extensions have been introduced to
improve user control, aiming to localize subjects represented by specific
tokens. Yet, these methods often produce semantically inaccurate images,
especially when dealing with multiple semantically or visually similar
subjects. In this work, we study and analyze the causes of these limitations.
Our exploration reveals that the primary issue stems from inadvertent semantic
leakage between subjects in the denoising process. This leakage is attributed
to the diffusion model's attention layers, which tend to blend the visual
features of different subjects. To address these issues, we introduce Bounded
Attention, a training-free method for bounding the information flow in the
sampling process. Bounded Attention prevents detrimental leakage among subjects
and enables guiding the generation to promote each subject's individuality,
even with complex multi-subject conditioning. Through extensive
experimentation, we demonstrate that our method empowers the generation of
multiple subjects that better align with given prompts and layouts. |
Introduces Bounded Attention, a training-free method to improve semantic fidelity in multi-subject image generation with diffusion models by bounding information flow during sampling to mitigate semantic leakage between subjects. |
Existing text-to-image diffusion models struggle to accurately generate scenes with multiple subjects, especially when they are semantically or visually similar due to attention mechanisms blending features and causing semantic leakage. |
Bounded Attention operates in two modes: (1) Bounded Guidance: Backpropagates through the model to steer the latent signal towards desired layout using a loss based on attention map concentration. (2) Bounded Denoising: Uses masks to restrict attention and reduce semantic leakage during the denoising process, refined in later stages using self-attention map clustering. |
Bounded Attention successfully generates multiple subjects with distinct features, even with complex layouts and semantically similar subjects.
Outperforms baselines, including training-based methods, in qualitative comparisons demonstrating reduced semantic leakage and improved layout fidelity.
Shows significant improvement in quantitative evaluation on DrawBench dataset, particularly in counting accuracy and spatial precision. |
Residual leakage persists due to imperfect optimization during guidance and segmentation inaccuracies.
Success is contingent on the match between seed and layout, necessitating future work on seed generation tailored to layouts. |
text-to-image generation, diffusion models, semantic leakage, layout-to-image synthesis, bounded attention |
2403.16954
Report |
Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance |
Jingyuan Zhu, Huimin Ma, Jiansheng Chen, Jian Yuan |
Large-scale text-to-image diffusion models have achieved great success in
synthesizing high-quality and diverse images given target text prompts. Despite
the revolutionary image generation ability, current state-of-the-art models
still struggle to deal with multi-concept generation accurately in many cases.
This phenomenon is known as ``concept bleeding" and displays as the unexpected
overlapping or merging of various concepts. This paper presents a general
approach for text-to-image diffusion models to address the mutual interference
between different subjects and their attachments in complex scenes, pursuing
better text-image consistency. The core idea is to isolate the synthesizing
processes of different concepts. We propose to bind each attachment to
corresponding subjects separately with split text prompts. Besides, we
introduce a revision method to fix the concept bleeding problem in
multi-subject synthesis. We first depend on pre-trained object detection and
segmentation models to obtain the layouts of subjects. Then we isolate and
resynthesize each subject individually with corresponding text prompts to avoid
mutual interference. Overall, we achieve a training-free strategy, named
Isolated Diffusion, to optimize multi-concept text-to-image synthesis. It is
compatible with the latest Stable Diffusion XL (SDXL) and prior Stable
Diffusion (SD) models. We compare our approach with alternative methods using a
variety of multi-concept text prompts and demonstrate its effectiveness with
clear advantages in text-image consistency and user study. |
Introduces Isolated Diffusion, a training-free method to address the "concept bleeding" problem in multi-concept text-to-image generation with Stable Diffusion models. |
Current text-to-image models struggle to maintain text-image consistency when generating images with multiple concepts, often leading to overlapping or merging of concepts. |
Isolates the denoising processes for different concepts using split text prompts. Employs pre-trained object detection (YOLO) and segmentation (SAM) models to identify and revise concepts in generated images. |
Achieves accurate assignment of attributes to multiple attachments within an image.
Effectively revises images to separate and accurately depict multiple subjects, avoiding concept merging.
Outperforms existing methods in maintaining text-image consistency, as demonstrated by qualitative and quantitative evaluations and user studies. |
Relies on successful subject detection by YOLO, which may fail with unseen objects.
Cannot correct for missing subjects if the initial generation by SD models is incomplete. |
text-to-image generation, diffusion models, multi-concept generation, concept bleeding, stable diffusion |
2403.16897
Report |
Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text |
Junshu Tang, Yanhong Zeng, Ke Fan, Xuheng Wang, Bo Dai, Kai Chen, Lizhuang Ma |
Creating and animating 3D biped cartoon characters is crucial and valuable in
various applications. Compared with geometry, the diverse texture design plays
an important role in making 3D biped cartoon characters vivid and charming.
Therefore, we focus on automatic texture design for cartoon characters based on
input instructions. This is challenging for domain-specific requirements and a
lack of high-quality data. To address this challenge, we propose Make-It-Vivid,
the first attempt to enable high-quality texture generation from text in UV
space. We prepare a detailed text-texture paired data for 3D characters by
using vision-question-answering agents. Then we customize a pretrained
text-to-image model to generate texture map with template structure while
preserving the natural 2D image knowledge. Furthermore, to enhance fine-grained
details, we propose a novel adversarial learning scheme to shorten the domain
gap between original dataset and realistic texture domain. Extensive
experiments show that our approach outperforms current texture generation
methods, resulting in efficient character texturing and faithful generation
with prompts. Besides, we showcase various applications such as out of domain
generation and texture stylization. We also provide an efficient generation
system for automatic text-guided textured character generation and animation. |
Make-It-Vivid is the first attempt to generate high-quality textures in UV space for 3D biped cartoon characters from text input. |
Texture design is crucial for creating vivid and charming 3D cartoon characters but current methods struggle with domain-specific requirements and limited high-quality data. |
The authors 1) use vision-question-answering agents to create a text-texture paired dataset, 2) customize a pretrained text-to-image diffusion model to generate texture maps, and 3) introduce adversarial training to enhance fine-grained details. |
Outperforms existing texture generation methods in quality and text-fidelity.
Enables out-of-domain generation and texture stylization.
Supports efficient text-guided character generation and animation. |
Limited to 3D models with pre-defined topology and UV maps.
Future work includes exploring automated meshing and texturing for arbitrary 3D models. |
text-guided texture generation, 3d cartoon characters, uv space, diffusion models, adversarial training |
2403.16885
Report |
CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs |
Yingji Zhong, Lanqing Hong, Zhenguo Li, Dan Xu |
Neural Radiance Fields (NeRF) have shown impressive capabilities for
photorealistic novel view synthesis when trained on dense inputs. However, when
trained on sparse inputs, NeRF typically encounters issues of incorrect density
or color predictions, mainly due to insufficient coverage of the scene causing
partial and sparse supervision, thus leading to significant performance
degradation. While existing works mainly consider ray-level consistency to
construct 2D learning regularization based on rendered color, depth, or
semantics on image planes, in this paper we propose a novel approach that
models 3D spatial field consistency to improve NeRF's performance with sparse
inputs. Specifically, we first adopt a voxel-based ray sampling strategy to
ensure that the sampled rays intersect with a certain voxel in 3D space. We
then randomly sample additional points within the voxel and apply a Transformer
to infer the properties of other points on each ray, which are then
incorporated into the volume rendering. By backpropagating through the
rendering loss, we enhance the consistency among neighboring points.
Additionally, we propose to use a contrastive loss on the encoder output of the
Transformer to further improve consistency within each voxel. Experiments
demonstrate that our method yields significant improvement over different
radiance fields in the sparse inputs setting, and achieves comparable
performance with current works. |
This paper proposes CVT- *x* RF, a novel approach to improve Neural Radiance Fields (NeRF) performance with sparse inputs by modeling 3D spatial field consistency. |
NeRF typically struggles with sparse inputs, leading to inaccurate density and color predictions and degraded performance, particularly due to insufficient scene coverage and sparse supervision. |
The method employs a voxel-based ray sampling strategy and introduces a Contrastive In-Voxel Transformer (CVT) structure with local implicit and global explicit constraints to enforce 3D field consistency during training. |
CVT- *x* RF significantly improves performance over various NeRF baselines, achieving state-of-the-art results on DTU and Synthetic datasets.
The approach enhances 3D field consistency, evidenced by reduced floating artifacts and better object detail recovery in rendered images.
CVT- *x* RF demonstrates fast convergence speed and learns more discriminative features compared to baseline models. |
The performance improvement of object-level LPIPS evaluation for 6/9-view inputs on the DTU dataset is less pronounced.
Future work includes exploring alternative sampling methods for encoding local context within voxels beyond sphere and line sampling. |
neural radiance fields, novel view synthesis, sparse input, 3d field consistency, contrastive learning |
2403.16848
Report |
Multiple Object Tracking as ID Prediction |
Ruopeng Gao, Yijun Zhang, Limin Wang |
In Multiple Object Tracking (MOT), tracking-by-detection methods have stood
the test for a long time, which split the process into two parts according to
the definition: object detection and association. They leverage robust
single-frame detectors and treat object association as a post-processing step
through hand-crafted heuristic algorithms and surrogate tasks. However, the
nature of heuristic techniques prevents end-to-end exploitation of training
data, leading to increasingly cumbersome and challenging manual modification
while facing complicated or novel scenarios. In this paper, we regard this
object association task as an End-to-End in-context ID prediction problem and
propose a streamlined baseline called MOTIP. Specifically, we form the target
embeddings into historical trajectory information while considering the
corresponding IDs as in-context prompts, then directly predict the ID labels
for the objects in the current frame. Thanks to this end-to-end process, MOTIP
can learn tracking capabilities straight from training data, freeing itself
from burdensome hand-crafted algorithms. Without bells and whistles, our method
achieves impressive state-of-the-art performance in complex scenarios like
DanceTrack and SportsMOT, and it performs competitively with other
transformer-based methods on MOT17. We believe that MOTIP demonstrates
remarkable potential and can serve as a starting point for future research. The
code is available at https://github.com/MCG-NJU/MOTIP. |
Presents MOTIP, a novel multiple object tracking system that formulates object association as an end-to-end ID prediction problem. |
Existing tracking-by-detection methods rely on hand-crafted heuristics and surrogate tasks, while tracking-by-query methods suffer from training-inference discrepancy and potential conflicts between detection and association. MOTIP overcomes these limitations with a streamlined and end-to-end trainable pipeline. |
MOTIP leverages a DETR detector, a learnable ID dictionary, and a transformer-based ID Decoder. Object embeddings from DETR and learnable ID embeddings are combined to form historical trajectories. The ID Decoder then predicts ID labels for new detections based on these trajectories. |
Achieves state-of-the-art performance on DanceTrack and SportsMOT, outperforming previous methods by a significant margin.
Performs competitively with other transformer-based methods on MOT17.
Ablation studies validate the effectiveness of each component and the advantages of the ID prediction pipeline. |
Lacks explicit motion modeling which can be crucial in crowded scenes.
Trajectory representation could be further improved with more sophisticated temporal modeling. |
multiple object tracking, tracking-by-detection, end-to-end tracking, id prediction, transformer |
2403.16530
Report |
An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models |
Zizhao Hu, Shaochong Jia, Mohammad Rostami |
Diffusion models have been widely used for conditional data cross-modal
generation tasks such as text-to-image and text-to-video. However,
state-of-the-art models still fail to align the generated visual concepts with
high-level semantics in a language such as object count, spatial relationship,
etc. We approach this problem from a multimodal data fusion perspective and
investigate how different fusion strategies can affect vision-language
alignment. We discover that compared to the widely used early fusion of
conditioning text in a pretrained image feature space, a specially designed
intermediate fusion can: (i) boost text-to-image alignment with improved
generation quality and (ii) improve training and inference efficiency by
reducing low-rank text-to-image attention calculations. We perform experiments
using a text-to-image generation task on the MS-COCO dataset. We compare our
intermediate fusion mechanism with the classic early fusion mechanism on two
common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion
model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and
50% increased training speed compared to a strong U-ViT baseline with an early
fusion. |
This paper introduces an intermediate fusion mechanism for text-to-image diffusion models that improves text-image alignment and efficiency compared to the commonly used early fusion method. |
Existing text-to-image diffusion models struggle to align generated images with high-level semantics in text and often introduce redundant computations due to early fusion of text embeddings. |
The authors propose a U-ViT-based diffusion backbone with dedicated trainable layers for text and image, fusing them at intermediate layers. They compare this approach with early fusion under different conditioning methods (concatenation and cross-attention) on the MS-COCO dataset. |
Intermediate fusion leads to better text-image alignment, evidenced by higher CLIP Scores and lower FID values compared to early fusion.
Human evaluation confirms that intermediate fusion models generate images with more accurate object counts and are generally preferred over early fusion models.
Intermediate fusion also improves efficiency by reducing FLOPs and increasing training speed due to fewer text-to-image attention calculations. |
The study primarily focuses on concatenation and cross-attention conditioning methods, leaving other conditioning strategies unexplored.
The impact of varying model parameters and hyperparameters for intermediate fusion, especially in scaled-up foundation models, requires further investigation. |
diffusion models, text-to-image generation, multimodal fusion, vision-language alignment, intermediate fusion |
2403.16510
Report |
Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework |
Ziyao Huang, Fan Tang, Yong Zhang, Xiaodong Cun, Juan Cao, Jintao Li, Tong-Yee Lee |
Despite the remarkable process of talking-head-based avatar-creating
solutions, directly generating anchor-style videos with full-body motions
remains challenging. In this study, we propose Make-Your-Anchor, a novel system
necessitating only a one-minute video clip of an individual for training,
subsequently enabling the automatic generation of anchor-style videos with
precise torso and hand movements. Specifically, we finetune a proposed
structure-guided diffusion model on input video to render 3D mesh conditions
into human appearances. We adopt a two-stage training strategy for the
diffusion model, effectively binding movements with specific appearances. To
produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise
diffusion model to a 3D style without additional training cost, and a simple
yet effective batch-overlapped temporal denoising module is proposed to bypass
the constraints on video length during inference. Finally, a novel
identity-specific face enhancement module is introduced to improve the visual
quality of facial regions in the output videos. Comparative experiments
demonstrate the effectiveness and superiority of the system in terms of visual
quality, temporal coherence, and identity preservation, outperforming SOTA
diffusion/non-diffusion methods. Project page:
\url{https://github.com/ICTMCG/Make-Your-Anchor}. |
This paper introduces "Make-Your-Anchor," a diffusion-based system for generating personalized 2D avatar videos from one-minute video clips. This system accurately synthesizes full-body anchor videos with realistic torso and hand movements. |
The proposed system addresses the limitations of current talking-head avatar systems that struggle to generate realistic full-body motions, particularly for anchor-style videos. |
The system utilizes a two-stage training strategy for a structure-guided diffusion model. It first pre-trains on a multi-identity dataset for motion generation and then fine-tunes on a specific individual's video to bind appearance to motion. For temporal consistency and arbitrary video length, the system employs batch-overlapped temporal denoising during inference. It also includes an identity-specific face enhancement module for improving facial detail realism. |
The system outperforms state-of-the-art GAN-based and diffusion-based methods in visual quality, temporal consistency, and identity preservation.
A two-stage training strategy effectively binds motion to a specific individual's appearance, allowing for personalized avatar creation.
Batch-overlapped temporal denoising enables the generation of long, temporally consistent videos without additional training. |
The system may struggle to preserve appearance when presented with poses significantly different from those seen during fine-tuning.
The current system does not model foreground occlusions, which may lead to ghosting artifacts. Future work could address this by explicitly segmenting and preserving occluded elements. |
2d avatar generation, diffusion models, video generation, motion-to-appearance synthesis, identity preservation |
2403.16379
Report |
FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models |
Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, Yu Wang |
In recent years, there has been significant progress in the development of
text-to-image generative models. Evaluating the quality of the generative
models is one essential step in the development process. Unfortunately, the
evaluation process could consume a significant amount of computational
resources, making the required periodic evaluation of model performance (e.g.,
monitoring training progress) impractical. Therefore, we seek to improve the
evaluation efficiency by selecting the representative subset of the text-image
dataset. We systematically investigate the design choices, including the
selection criteria (textural features or image-based metrics) and the selection
granularity (prompt-level or set-level). We find that the insights from prior
work on subset selection for training data do not generalize to this problem,
and we propose FlashEval, an iterative search algorithm tailored to evaluation
data selection. We demonstrate the effectiveness of FlashEval on ranking
diffusion models with various configurations, including architectures,
quantization levels, and sampler schedules on COCO and DiffusionDB datasets.
Our searched 50-item subset could achieve comparable evaluation quality to the
randomly sampled 500-item subset for COCO annotations on unseen models,
achieving a 10x evaluation speedup. We release the condensed subset of these
commonly used datasets to help facilitate diffusion algorithm design and
evaluation, and open-source FlashEval as a tool for condensing future datasets,
accessible at https://github.com/thu-nics/FlashEval. |
This paper introduces \method{}, an iterative search algorithm that identifies representative subsets of text-image datasets for faster and more accurate evaluation of text-to-image diffusion generative models. |
Evaluating text-to-image diffusion models is computationally expensive, especially when iterating on model design or training. Existing methods, like random subset sampling, offer poor accuracy-efficiency trade-offs. \method{} aims to improve this trade-off by finding small, highly representative subsets for evaluation. |
\method{}, inspired by evolutionary algorithms, iteratively searches for representative prompts in the dataset. It combines the strengths of prompt-wise search (efficiency) and set-wise search (accuracy). It employs a frequency-based prompt selection strategy to identify prompts that consistently contribute to well-performing subsets. The search process involves constructing and evaluating numerous subsets based on Kendall's Tau (KD) correlation with the full dataset ranking and iteratively refining the selection of prompts. |
\method{} significantly outperforms random sampling and baseline search methods, achieving high ranking correlation (KD) with smaller subset sizes (e.g., 50-item subset comparable to a 500-item random subset).
The subsets found by \method{} generalize well to unseen models with different architectures, parameters, solvers, and step sizes.
The search cost of \method{} can be further reduced by using a smaller randomly sampled subset as a proxy for the full dataset ranking during the search process. |
The current implementation of \method{} primarily focuses on ranking tasks; extending it to other evaluation metrics could be explored.
Further investigation into optimizing the search efficiency of \method{}, especially for very large datasets, is beneficial. |
text-to-image generation, diffusion models, evaluation metrics, subset selection, evolutionary algorithms |
2403.16368
Report |
Distilling Semantic Priors from SAM to Efficient Image Restoration Models |
Quan Zhang, Xiaoyu Liu, Wei Li, Hanting Chen, Junchao Liu, Jie Hu, Zhiwei Xiong, Chun Yuan, Yunhe Wang |
In image restoration (IR), leveraging semantic priors from segmentation
models has been a common approach to improve performance. The recent segment
anything model (SAM) has emerged as a powerful tool for extracting advanced
semantic priors to enhance IR tasks. However, the computational cost of SAM is
prohibitive for IR, compared to existing smaller IR models. The incorporation
of SAM for extracting semantic priors considerably hampers the model inference
efficiency. To address this issue, we propose a general framework to distill
SAM's semantic knowledge to boost exiting IR models without interfering with
their inference process. Specifically, our proposed framework consists of the
semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD)
scheme. SPF fuses two kinds of information between the restored image predicted
by the original IR model and the semantic mask predicted by SAM for the refined
restored image. SPD leverages a self-distillation manner to distill the fused
semantic priors to boost the performance of original IR models. Additionally,
we design a semantic-guided relation (SGR) module for SPD, which ensures
semantic feature representation space consistency to fully distill the priors.
We demonstrate the effectiveness of our framework across multiple IR models and
tasks, including deraining, deblurring, and denoising. |
This paper introduces a novel framework designed to enhance existing image restoration (IR) models by distilling semantic knowledge from the Segment Anything Model (SAM) without compromising inference speed. |
SAM, despite its potential for extracting rich semantic priors, presents a computational bottleneck for IR tasks due to its large size. This framework addresses this limitation, enabling the utilization of SAM's strengths without sacrificing efficiency. |
The framework comprises two core schemes: Semantic Priors Fusion (SPF) fuses restored images from the IR model with SAM's semantic masks for refinement. Semantic Priors Distillation (SPD), incorporating a semantic-guided relation (SGR) module, transfers this fused knowledge to the original IR model, boosting its performance. |
The framework consistently outperforms baseline IR models, demonstrating substantial improvements in both objective metrics (PSNR, SSIM) and subjective visual quality (FID) across various IR tasks.
Evaluations on downstream segmentation tasks using cityscape-syn datasets further highlight the framework's efficacy, exhibiting consistent enhancements in IoU, PA, and DICE metrics.
Ablation studies validate the contribution of individual components (SPF, SPD, SGR) within the framework, underscoring their significance in enhancing IR performance. |
The framework necessitates the training of an additional IR model (f^IR2), potentially increasing training complexity.
Future exploration could focus on extending the framework to incorporate semantic priors from diverse sources beyond SAM, further enriching its capabilities. |
image restoration, semantic priors, segment anything model (sam), knowledge distillation, semantic-guided relation |
2403.16365
Report |
Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion |
Hossein Souri, Arpit Bansal, Hamid Kazemi, Liam Fowl, Aniruddha Saha, Jonas Geiping, Andrew Gordon Wilson, Rama Chellappa, Tom Goldstein, Micah Goldblum |
Modern neural networks are often trained on massive datasets that are web
scraped with minimal human inspection. As a result of this insecure curation
pipeline, an adversary can poison or backdoor the resulting model by uploading
malicious data to the internet and waiting for a victim to scrape and train on
it. Existing approaches for creating poisons and backdoors start with randomly
sampled clean data, called base samples, and then modify those samples to craft
poisons. However, some base samples may be significantly more amenable to
poisoning than others. As a result, we may be able to craft more potent poisons
by carefully choosing the base samples. In this work, we use guided diffusion
to synthesize base samples from scratch that lead to significantly more potent
poisons and backdoors than previous state-of-the-art attacks. Our Guided
Diffusion Poisoning (GDP) base samples can be combined with any downstream
poisoning or backdoor attack to boost its effectiveness. Our implementation
code is publicly available at: https://github.com/hsouri/GDP . |
The paper introduces Guided Diffusion Poisoning (GDP), a method that leverages guided diffusion models to synthesize highly potent poisoned training data for computer vision tasks. |
Existing data poisoning and backdoor attacks often rely on randomly selected base samples, limiting their effectiveness. This work demonstrates that carefully chosen base samples can significantly enhance the potency of such attacks. |
GDP employs a three-step process: (1) Generate base samples with a diffusion model, weakly guided by a poisoning loss function while maintaining clean labels. (2) Utilize these base samples as initialization for downstream poisoning algorithms. (3) Filter generated poisons, selecting those with the lowest poisoning loss. |
GDP achieves significantly higher attack success rates compared to state-of-the-art targeted poisoning and backdoor attacks, even with very small poison budgets.
The method is effective even with small perturbation budgets, making the poisons less detectable.
GDP enhances the transferability of poisons, demonstrating improved performance in black-box settings where the victim model's architecture is unknown. |
GDP requires a dataset-specific diffusion model, which can be computationally expensive to train.
Generating and filtering a large number of poisons is inefficient; exploring more reliable optimization strategies could improve this aspect. |
data poisoning, backdoor attacks, diffusion models, computer vision, adversarial machine learning |
2403.16210
Report |
Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane |
Han Yan, Yang Li, Zhennan Wu, Shenzhou Chen, Weixuan Sun, Taizhang Shang, Weizhe Liu, Tian Chen, Xiaqiang Dai, Chao Ma, Hongdong Li, Pan Ji |
We present Frankenstein, a diffusion-based framework that can generate
semantic-compositional 3D scenes in a single pass. Unlike existing methods that
output a single, unified 3D shape, Frankenstein simultaneously generates
multiple separated shapes, each corresponding to a semantically meaningful
part. The 3D scene information is encoded in one single tri-plane tensor, from
which multiple Singed Distance Function (SDF) fields can be decoded to
represent the compositional shapes. During training, an auto-encoder compresses
tri-planes into a latent space, and then the denoising diffusion process is
employed to approximate the distribution of the compositional scenes.
Frankenstein demonstrates promising results in generating room interiors as
well as human avatars with automatically separated parts. The generated scenes
facilitate many downstream applications, such as part-wise re-texturing, object
rearrangement in the room or avatar cloth re-targeting. |
Frankenstein, a novel tri-plane diffusion-based framework for generating semantic-compositional 3D scenes in a single pass. |
Downstream applications often require semantically-decomposed 3D shapes, e.g., for realistic animation or part replacement. Existing methods struggle to generate such decompositions directly. |
The method encodes multiple SDFs, each representing a semantic part, within a single tri-plane. It uses a three-stage training process: 1) per-scene tri-plane fitting, 2) VAE compression of tri-planes into a latent space, 3) diffusion model training on the latent space for controllable generation. |
Frankenstein generates semantic-compositional 3D scenes for both rooms and avatars with clean part separation.
The generated scenes allow for applications like part-wise texturing, object rearrangement, and cloth re-targeting.
Coarse-to-fine optimization and semantic-aware point sampling during tri-plane fitting are crucial for high-quality reconstruction. |
Limited details due to using a single tri-plane, potentially solvable by incorporating block-wise scene representation.
Slow VAE training, requiring exploration of more efficient architectures. |
3d scene generation, semantic composition, diffusion model, tri-plane representation, conditional generation |
2403.16141
Report |
Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes |
Takashi Otonari, Satoshi Ikehata, Kiyoharu Aizawa |
Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic
scenes often involve explicit modeling of scene dynamics. However, this
approach faces challenges in modeling scene dynamics in urban environments,
where moving objects of various categories and scales are present. In such
settings, it becomes crucial to effectively eliminate moving objects to
accurately reconstruct static backgrounds. Our research introduces an
innovative method, termed here as Entity-NeRF, which combines the strengths of
knowledge-based and statistical strategies. This approach utilizes entity-wise
statistics, leveraging entity segmentation and stationary entity classification
through thing/stuff segmentation. To assess our methodology, we created an
urban scene dataset masked with moving objects. Our comprehensive experiments
demonstrate that Entity-NeRF notably outperforms existing techniques in
removing moving objects and reconstructing static urban backgrounds, both
quantitatively and qualitatively. |
This paper presents Entity-NeRF, a novel method for building NeRFs of dynamic urban scenes by identifying and removing multiple moving objects of various types and scales. |
Existing NeRF methods struggle with the complexity of dynamic urban scenes, where numerous moving objects of different sizes and categories are present. Explicitly modeling scene dynamics or treating moving objects as outliers using existing approaches proves ineffective. |
Entity-NeRF combines knowledge-based and statistical methods. It leverages entity segmentation for object identification, thing/stuff segmentation for stationary entity classification, and entity-wise statistics of reconstruction errors (EARR) for robust distractor labeling. |
Entity-NeRF effectively removes moving objects and reconstructs static backgrounds in urban scenes, outperforming existing methods like RobustNeRF in terms of foreground and background PSNR.
The method demonstrates robustness to variations in object scale and scene complexity, accurately identifying distractors without excessively excluding static elements.
Stationary entity classification using thing/stuff segmentation significantly improves training efficiency and final PSNR by incorporating complex backgrounds from the early stages of training. |
Entity-NeRF might face difficulties reconstructing backgrounds occluded by large moving objects.
Shadows cast by moving objects are not explicitly handled and could be inadvertently incorporated into the training. |
neural radiance fields, dynamic scenes, urban environments, entity segmentation, novel view synthesis |
2403.16131
Report |
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement |
Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen |
DETR-like methods have significantly increased detection performance in an
end-to-end manner. The mainstream two-stage frameworks of them perform dense
self-attention and select a fraction of queries for sparse cross-attention,
which is proven effective for improving performance but also introduces a heavy
computational burden and high dependence on stable query selection. This paper
demonstrates that suboptimal two-stage selection strategies result in scale
bias and redundancy due to the mismatch between selected queries and objects in
two-stage initialization. To address these issues, we propose hierarchical
salience filtering refinement, which performs transformer encoding only on
filtered discriminative queries, for a better trade-off between computational
efficiency and precision. The filtering process overcomes scale bias through a
novel scale-independent salience supervision. To compensate for the semantic
misalignment among queries, we introduce elaborate query refinement modules for
stable two-stage initialization. Based on above improvements, the proposed
Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP
on three challenging task-specific detection datasets, as well as 49.2% AP on
COCO 2017 with less FLOPs. The code is available at
https://github.com/xiuqhou/Salience-DETR. |
This paper proposes Salience DETR, a novel end-to-end object detection framework that addresses scale bias and redundancy in two-stage DETR-like detectors through hierarchical salience filtering refinement. |
Existing two-stage DETR methods suffer from heavy computational burden and scale bias in query selection, resulting in suboptimal performance, especially for small object detection. |
Salience DETR introduces: (1) Scale-independent salience supervision for unbiased query filtering. (2) Hierarchical query filtering to encode only selected discriminative queries. (3) Query refinement modules to address semantic misalignment among queries. |
Salience DETR achieves state-of-the-art performance on three task-specific detection datasets (ESD, CSD, MSSD) and competitive results on COCO 2017.
It outperforms other methods with fewer FLOPs, demonstrating a better trade-off between computational efficiency and accuracy.
The proposed scale-independent supervision and query refinement modules prove effective in mitigating scale bias and redundancy. |
The redundancy removal for two-stage queries relies on hand-crafted NMS and lacks an end-to-end solution.
Exploring the potential of salience supervision for pixel-level tasks like instance segmentation is a promising future direction. |
object detection, detr, transformer, salience, query filtering |
2403.16111
Report |
EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing |
Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang |
Current diffusion-based video editing primarily focuses on local editing
(\textit{e.g.,} object/background editing) or global style editing by utilizing
various dense correspondences. However, these methods often fail to accurately
edit the foreground and background simultaneously while preserving the original
layout. We find that the crux of the issue stems from the imprecise
distribution of attention weights across designated regions, including
inaccurate text-to-attribute control and attention leakage. To tackle this
issue, we introduce EVA, a \textbf{zero-shot} and \textbf{multi-attribute}
video editing framework tailored for human-centric videos with complex motions.
We incorporate a Spatial-Temporal Layout-Guided Attention mechanism that
leverages the intrinsic positive and negative correspondences of cross-frame
diffusion features. To avoid attention leakage, we utilize these
correspondences to boost the attention scores of tokens within the same
attribute across all video frames while limiting interactions between tokens of
different attributes in the self-attention layer. For precise text-to-attribute
manipulation, we use discrete text embeddings focused on specific layout areas
within the cross-attention layer. Benefiting from the precise attention weight
distribution, EVA can be easily generalized to multi-object editing scenarios
and achieves accurate identity mapping. Extensive experiments demonstrate EVA
achieves state-of-the-art results in real-world scenarios. Full results are
provided at https://knightyxp.github.io/EVA/ |
EVA, a zero-shot multi-attribute video editing framework for human-centric videos using a novel Spatial-Temporal Layout-Guided Attention mechanism. |
Current video editing methods struggle with accurate multi-attribute editing while preserving layout and background, especially in videos with complex human motion. |
EVA leverages: 1) Spatially disentangled semantic masks for layout information and accurate text-to-attribute control. 2) Cross-frame diffusion feature similarity to enhance attention scores within attributes and minimize attention leakage between them. |
Achieves state-of-the-art results on benchmark datasets for both single and multi-object editing.
Enables identity swapping in multi-object scenes.
Outperforms existing methods in quantitative metrics (CLIP-T, Warp-error) and user studies evaluating subject edit accuracy, layout preservation, motion alignment, and overall preference. |
Relies on user-provided layout masks, limiting scalability.
Future work includes automating mask generation and exploring higher-resolution video editing. |
video editing, text-to-video generation, diffusion models, attention mechanisms, layout preservation |
2403.16095
Report |
CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field |
Jiarui Hu, Xianhao Chen, Boyin Feng, Guanglin Li, Liangjing Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui |
Recently neural radiance fields (NeRF) have been widely exploited as 3D
representations for dense simultaneous localization and mapping (SLAM). Despite
their notable successes in surface modeling and novel view synthesis, existing
NeRF-based methods are hindered by their computationally intensive and
time-consuming volume rendering pipeline. This paper presents an efficient
dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D
Gaussian field with high consistency and geometric stability. Through an
in-depth analysis of Gaussian Splatting, we propose several techniques to
construct a consistent and stable 3D Gaussian field suitable for tracking and
mapping. Additionally, a novel depth uncertainty model is proposed to ensure
the selection of valuable Gaussian primitives during optimization, thereby
improving tracking efficiency and accuracy. Experiments on various datasets
demonstrate that CG-SLAM achieves superior tracking and mapping performance
with a notable tracking speed of up to 15 Hz. We will make our source code
publicly available. Project page: https://zju3dv.github.io/cg-slam. |
This paper presents CG-SLAM, an efficient dense RGB-D SLAM system based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability. |
Existing NeRF-based SLAM methods are computationally intensive and time-consuming, hindering their ability to achieve both accuracy and efficiency. This paper aims to address this challenge by leveraging the efficiency of 3D Gaussian Splatting while ensuring mapping and tracking quality. |
The authors propose several techniques: 1) a CUDA framework for real-time dense RGB-D SLAM based on the derivatives of camera poses in 3D Gaussian Splatting, 2) a scale regularization term and depth alignment strategy to construct a consistent and stable 3D Gaussian field, and 3) a novel depth uncertainty model to select valuable Gaussian primitives for optimization. |
CG-SLAM achieves superior tracking accuracy compared to NeRF-based SLAM methods on Replica, TUM-RGBD, and ScanNet datasets.
CG-SLAM demonstrates state-of-the-art reconstruction quality with high mapping accuracy in observed areas.
CG-SLAM achieves real-time performance with a tracking speed of up to 15 Hz due to its efficient Gaussian-based representation and GPU acceleration. |
The Gaussian-based representation requires considerable memory usage.
The method exhibits a weak prediction ability for unobserved areas. |
dense visual slam, neural rendering, 3d gaussian field, uncertainty modeling, real-time |
2403.16048
Report |
Edit3K: Universal Representation Learning for Video Editing Components |
Xin Gu, Libo Zhang, Fan Chen, Longyin Wen, Yufei Wang, Tiejian Luo, Sijie Zhu |
This paper focuses on understanding the predominant video creation pipeline,
i.e., compositional video editing with six main types of editing components,
including video effects, animation, transition, filter, sticker, and text. In
contrast to existing visual representation learning of visual materials (i.e.,
images/videos), we aim to learn visual representations of editing
actions/components that are generally applied on raw materials. We start by
proposing the first large-scale dataset for editing components of video
creation, which covers about $3,094$ editing components with $618,800$ videos.
Each video in our dataset is rendered by various image/video materials with a
single editing component, which supports atomic visual understanding of
different editing components. It can also benefit several downstream tasks,
e.g., editing component recommendation, editing component
recognition/retrieval, etc. Existing visual representation methods perform
poorly because it is difficult to disentangle the visual appearance of editing
components from raw materials. To that end, we benchmark popular alternative
solutions and propose a novel method that learns to attend to the appearance of
editing components regardless of raw materials. Our method achieves favorable
results on editing component retrieval/recognition compared to the alternative
solutions. A user study is also conducted to show that our representations
cluster visually similar editing components better than other alternatives.
Furthermore, our learned representations used to transition recommendation
tasks achieve state-of-the-art results on the AutoTransition dataset. The code
and dataset will be released for academic use. |
This paper introduces Edit3K, the first large-scale dataset for learning representations of video editing components (e.g., effects, transitions, filters). It also proposes a novel embedding guidance architecture and contrastive loss for learning these representations. |
Understanding video editing components is crucial for many downstream tasks like effect recommendation, detection, recognition, and automatic video editing. Existing datasets and methods are not designed for this task. |
Edit3K dataset is created by rendering videos using existing image/video materials and a diverse set of editing components. The proposed model utilizes a guided spatial-temporal encoder, a guided embedding decoder, and an embedding queue mechanism to learn disentangled representations of editing components. |
The proposed method significantly outperforms existing video representation learning approaches on editing component retrieval.
User studies demonstrate that the learned embeddings cluster visually similar editing components better than alternative methods.
The learned representations achieve state-of-the-art results on transition recommendation when applied to the AutoTransition dataset. |
The model currently uses low frames per second, limiting its ability to handle fast motion.
The model might struggle to recognize editing components with subtle changes without access to the raw, unedited video. |
video editing, representation learning, dataset, contrastive learning, attention mechanism |
2403.16020
Report |
PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference |
Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu |
As deep neural networks evolve from convolutional neural networks (ConvNets)
to advanced vision transformers (ViTs), there is an increased need to eliminate
redundant data for faster processing without compromising accuracy. Previous
methods are often architecture-specific or necessitate re-training, restricting
their applicability with frequent model updates. To solve this, we first
introduce a novel property of lightweight ConvNets: their ability to identify
key discriminative patch regions in images, irrespective of model's final
accuracy or size. We demonstrate that fully-connected layers are the primary
bottleneck for ConvNets performance, and their suppression with simple weight
recalibration markedly enhances discriminative patch localization performance.
Using this insight, we introduce PaPr, a method for substantially pruning
redundant patches with minimal accuracy loss using lightweight ConvNets across
a variety of deep learning architectures, including ViTs, ConvNets, and hybrid
transformers, without any re-training. Moreover, the simple early-stage
one-step patch pruning with PaPr enhances existing patch reduction methods.
Through extensive testing on diverse architectures, PaPr achieves significantly
higher accuracy over state-of-the-art patch reduction methods with similar FLOP
count reduction. More specifically, PaPr reduces about 70% of redundant patches
in videos with less than 0.8% drop in accuracy, and up to 3.7x FLOPs reduction,
which is a 15% more reduction with 2.5% higher accuracy. |
Proposes PaPr, a training-free, one-step patch pruning method using lightweight ConvNets to accelerate inference in various deep learning models (ViTs, ConvNets, hybrid transformers). |
Addresses limitations of existing patch pruning techniques that require retraining, perform gradual reduction, and lack architectural generality. |
Leverages the inherent ability of lightweight ConvNets to identify discriminative regions by generating a Patch Significance Map (PSM) to guide patch pruning in larger models. |
Achieves significantly higher accuracy with lower computational cost compared to state-of-the-art patch reduction methods.
Demonstrates robustness in patch localization across varying ConvNet proposal models and challenging image scenarios.
Effectively reduces spatio-temporal redundancy in videos, leading to substantial FLOPs reduction with minimal accuracy loss. |
Current work focuses on discriminative tasks, future exploration in dense prediction tasks is promising.
Further investigation into the impact of different upsampling methods on PSM generation and performance. |
patch pruning, vision transformers, convolutional neural networks, efficient inference, computer vision |
2403.16016
Report |
Fill in the ____ (a Diffusion-based Image Inpainting Pipeline) |
Eyoel Gebre, Krishna Saxena, Timothy Tran |
Image inpainting is the process of taking an image and generating lost or
intentionally occluded portions. Inpainting has countless applications
including restoring previously damaged pictures, restoring the quality of
images that have been degraded due to compression, and removing unwanted
objects/text. Modern inpainting techniques have shown remarkable ability in
generating sensible completions for images with mask occlusions. In our paper,
an overview of the progress of inpainting techniques will be provided, along
with identifying current leading approaches, focusing on their strengths and
weaknesses. A critical gap in these existing models will be addressed, focusing
on the ability to prompt and control what exactly is generated. We will
additionally justify why we think this is the natural next progressive step
that inpainting models must take, and provide multiple approaches to
implementing this functionality. Finally, we will evaluate the results of our
approaches by qualitatively checking whether they generate high-quality images
that correctly inpaint regions with the objects that they are instructed to
produce. |
This paper presents "Fill in the ____," a diffusion-based image inpainting pipeline that allows users to specify an object to be inserted into a scene using a target image. |
Existing inpainting models lack control over generated content, limiting their use in applications requiring specific object insertion. This work addresses this gap by enabling object-guided inpainting with diffusion models. |
The pipeline builds upon the RePaint algorithm, incorporating a target image and mask as inputs. It modifies the denoising process by combining information from the target image with the generated inpainting, resolving mask conflicts and ensuring seamless object integration. Several masking techniques and lambda scheduling are explored to enhance boundary realism and control the influence of the target image. |
The pipeline successfully inserts target objects into scenes with varying degrees of realism and faithfulness to the target, depending on chosen hyperparameters.
Lambda scheduling, controlling the balance between the target image and the generated inpainting, proves crucial for achieving optimal results.
Failure modes, such as high variance in generated content and biases from the DDPM training data, are identified. |
Current limitations include reliance on manual mask creation and potential biases from the DDPM training data.
Future work involves automating mask generation, exploring alternative masking techniques, and refining lambda scheduling for enhanced adaptability. The ultimate goal is to develop a fully automated inpainting pipeline. |
image inpainting, diffusion models, generative ai, object insertion, repaint |
2403.15789
Report |
In-Context Matting |
He Guo, Zixuan Ye, Zhiguo Cao, Hao Lu |
We introduce in-context matting, a novel task setting of image matting. Given
a reference image of a certain foreground and guided priors such as points,
scribbles, and masks, in-context matting enables automatic alpha estimation on
a batch of target images of the same foreground category, without additional
auxiliary input. This setting marries good performance in auxiliary input-based
matting and ease of use in automatic matting, which finds a good trade-off
between customization and automation. To overcome the key challenge of accurate
foreground matching, we introduce IconMatting, an in-context matting model
built upon a pre-trained text-to-image diffusion model. Conditioned on inter-
and intra-similarity matching, IconMatting can make full use of reference
context to generate accurate target alpha mattes. To benchmark the task, we
also introduce a novel testing dataset ICM-$57$, covering 57 groups of
real-world images. Quantitative and qualitative results on the ICM-57 testing
set show that IconMatting rivals the accuracy of trimap-based matting while
retaining the automation level akin to automatic matting. Code is available at
https://github.com/tiny-smart/in-context-matting |
This paper introduces "in-context matting", a new image matting task that enables automatic alpha matte generation for a group of images with similar foregrounds using a single reference image and user-provided guidance (e.g., points, scribbles, masks) on that reference image. |
In-context matting bridges the gap between accuracy and efficiency, and between customization and automation, by combining the advantages of automatic matting (efficiency) and auxiliary input-based matting (customization and accuracy). |
The authors propose IconMatting, a model based on a pre-trained text-to-image diffusion model (Stable Diffusion) for in-context matting. IconMatting leverages inter-image similarity (matching between reference and target images) and intra-image similarity (self-attention within the target image) to accurately identify and extract the target foreground. |
IconMatting achieves comparable accuracy to trimap-based matting while maintaining the automation level of automatic matting.
A novel testing dataset, ICM-57, is introduced for benchmarking in-context matting.
Experiments demonstrate the effectiveness of IconMatting in handling various foreground categories and scenes. |
The performance of IconMatting improves with more reference inputs, but the gains diminish after a certain number.
The current model is trained only on real-world datasets, and incorporating composited data could potentially further enhance performance. |
image matting, in-context learning, diffusion models, stable diffusion, semantic correspondence |
2403.15698
Report |
SceneX:Procedural Controllable Large-scale Scene Generation via Large-language Models |
Mengqi Zhou, Jun Hou, Chuanchen Luo, Yuxi Wang, Zhaoxiang Zhang, Junran Peng |
Due to its great application potential, large-scale scene generation has
drawn extensive attention in academia and industry. Recent research employs
powerful generative models to create desired scenes and achieves promising
results. However, most of these methods represent the scene using 3D primitives
(e.g. point cloud or radiance field) incompatible with the industrial pipeline,
which leads to a substantial gap between academic research and industrial
deployment. Procedural Controllable Generation (PCG) is an efficient technique
for creating scalable and high-quality assets, but it is unfriendly for
ordinary users as it demands profound domain expertise. To address these
issues, we resort to using the large language model (LLM) to drive the
procedural modeling. In this paper, we introduce a large-scale scene generation
framework, SceneX, which can automatically produce high-quality procedural
models according to designers' textual descriptions.Specifically, the proposed
method comprises two components, PCGBench and PCGPlanner. The former
encompasses an extensive collection of accessible procedural assets and
thousands of hand-craft API documents. The latter aims to generate executable
actions for Blender to produce controllable and precise 3D assets guided by the
user's instructions. Our SceneX can generate a city spanning 2.5 km times 2.5
km with delicate layout and geometric structures, drastically reducing the time
cost from several weeks for professional PCG engineers to just a few hours for
an ordinary user. Extensive experiments demonstrated the capability of our
method in controllable large-scale scene generation and editing, including
asset placement and season translation. |
This paper introduces SceneX, a novel framework for generating large-scale 3D scenes from textual descriptions using Large Language Models (LLMs) and Procedural Content Generation (PCG). |
SceneX bridges the gap between academic research and industrial applications by generating scenes directly compatible with industrial pipelines, unlike methods relying on point clouds or radiance fields. |
SceneX uses PCGBench, a vast dataset of PCG assets and API documentation, and PCGPlanner, an LLM agent hierarchy for task planning, asset retrieval, and action execution in Blender. |
SceneX generates highly realistic and detailed large-scale scenes, including natural environments and cities, significantly faster than previous methods and human experts.
The generated scenes exhibit high aesthetic quality, surpassing existing text-to-3D and Blender-driven generation methods in user and expert evaluations.
SceneX enables controllable and personalized scene editing, allowing users to modify generated assets and scenes based on their instructions. |
SceneX's performance depends on the capabilities of the pre-trained LLM, potentially limiting its generalizability.
The current version of PCGBench has a limited number of assets and APIs, which can restrict the diversity of generated scenes. |
large-scale scene generation, llm agents, pcg, blender, text-to-3d |
2403.15679
Report |
DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes |
Hao Yan, Zhihui Ke, Xiaobo Zhou, Tie Qiu, Xidong Shi, Dadong Jiang |
Implicit neural representations for video (NeRV) have recently become a novel
way for high-quality video representation. However, existing works employ a
single network to represent the entire video, which implicitly confuse static
and dynamic information. This leads to an inability to effectively compress the
redundant static information and lack the explicitly modeling of global
temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV,
which decomposes videos into sparse learnable static codes and dynamic codes
without the need for explicit optical flow or residual supervision. By setting
different sampling rates for two codes and applying weighted sum and
interpolation sampling methods, DS-NeRV efficiently utilizes redundant static
information while maintaining high-frequency details. Additionally, we design a
cross-channel attention-based (CCA) fusion module to efficiently fuse these two
codes for frame decoding. Our approach achieves a high quality reconstruction
of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic
codes representation and outperforms existing NeRV methods in many downstream
tasks. Our project website is at https://haoyan14.github.io/DS-NeRV. |
This paper presents DS-NeRV, a new video INR that decomposes videos into separate learnable static and dynamic codes, improving compression and quality without explicit optical flow or residual supervision. |
Existing NeRV methods struggle to efficiently compress videos due to mixing static and dynamic information, leading to difficulties in reducing redundancy and modeling temporal coherence. DS-NeRV aims to address these issues. |
DS-NeRV uses sparse learnable static codes with weighted sum sampling and dynamic codes with interpolation sampling to represent video content. It employs a cross-channel attention-based fusion module to combine these codes for frame reconstruction. |
DS-NeRV achieves state-of-the-art video reconstruction quality, outperforming previous NeRV methods on Bunny, UVG, and DAVIS datasets.
The method demonstrates strong performance in downstream tasks like video interpolation and inpainting, highlighting its ability to capture temporal coherence.
DS-NeRV exhibits efficient compression capabilities, achieving competitive results compared to traditional codecs like H.264 and HEVC. |
Determining the optimal lengths for static and dynamic codes currently requires manual adjustment for each video.
Finding the best dimensions for static and dynamic codes involves a testing phase. |
video representation, implicit neural representations (inr), video compression, video inpainting, video interpolation |
2403.15624
Report |
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting |
Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li |
Open-vocabulary 3D scene understanding presents a significant challenge in
computer vision, withwide-ranging applications in embodied agents and augmented
reality systems. Previous approaches haveadopted Neural Radiance Fields (NeRFs)
to analyze 3D scenes. In this paper, we introduce SemanticGaussians, a novel
open-vocabulary scene understanding approach based on 3D Gaussian Splatting.
Our keyidea is distilling pre-trained 2D semantics into 3D Gaussians. We design
a versatile projection approachthat maps various 2Dsemantic features from
pre-trained image encoders into a novel semantic component of 3D Gaussians,
withoutthe additional training required by NeRFs. We further build a 3D
semantic network that directly predictsthe semantic component from raw 3D
Gaussians for fast inference. We explore several applications ofSemantic
Gaussians: semantic segmentation on ScanNet-20, where our approach attains a
4.2% mIoU and 4.0%mAcc improvement over prior open-vocabulary scene
understanding counterparts; object part segmentation,sceneediting, and
spatial-temporal segmentation with better qualitative results over 2D and 3D
baselines,highlighting its versatility and effectiveness on supporting diverse
downstream tasks. |
This paper proposes \method, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting by distilling knowledge from pre-trained 2D encoders. |
Open-vocabulary 3D scene understanding is crucial for various real-world applications like robotics and augmented reality, enabling machines to interact effectively with diverse environments. |
The method projects semantic features from pre-trained 2D models (e.g., OpenSeg, CLIP) onto 3D Gaussian points. Additionally, a 3D semantic network (MinkowskiNet) is introduced to predict semantic components directly from raw 3D Gaussians. |
\method outperforms OpenSeg on ScanNet-20 semantic segmentation, demonstrating effective multi-view information integration.
It achieves high-quality part segmentation consistent across different views, outperforming OpenSeg and LERF.
The method exhibits promising results in spatiotemporal tracking and language-guided editing. |
Scene understanding performance depends on the accuracy of 2D pre-trained models and the quality of 3D Gaussians.
Future work includes exploring better 3D Gaussian representation and multi-modal pre-training. |
open-vocabulary scene understanding, 3d gaussian splatting, semantic segmentation, part segmentation, spatiotemporal tracking |
2403.15583
Report |
U-ARE-ME: Uncertainty-Aware Rotation Estimation in Manhattan Environments |
Aalok Patwardhan, Callum Rhodes, Gwangbin Bae, Andrew J. Davison |
Camera rotation estimation from a single image is a challenging task, often
requiring depth data and/or camera intrinsics, which are generally not
available for in-the-wild videos. Although external sensors such as inertial
measurement units (IMUs) can help, they often suffer from drift and are not
applicable in non-inertial reference frames. We present U-ARE-ME, an algorithm
that estimates camera rotation along with uncertainty from uncalibrated RGB
images. Using a Manhattan World assumption, our method leverages the per-pixel
geometric priors encoded in single-image surface normal predictions and
performs optimisation over the SO(3) manifold. Given a sequence of images, we
can use the per-frame rotation estimates and their uncertainty to perform
multi-frame optimisation, achieving robustness and temporal consistency. Our
experiments demonstrate that U-ARE-ME performs comparably to RGB-D methods and
is more robust than sparse feature-based SLAM methods. We encourage the reader
to view the accompanying video at https://callum-rhodes.github.io/U-ARE-ME for
a visual overview of our method. |
This paper presents U-ARE-ME, an algorithm that estimates camera rotation and uncertainty from uncalibrated RGB images using surface normal predictions and a Manhattan World assumption. |
Accurate and robust rotation estimation from monocular images is crucial for various applications, especially in-the-wild videos where depth data or camera intrinsics are often unavailable. Existing methods struggle with textureless environments, image degradation, or require calibrated cameras. |
The method leverages single-image surface normal predictions and optimizes camera rotation by aligning predicted normals to principal directions. It introduces an uncertainty-weighted cost function to handle unreliable predictions and performs multi-frame optimization using a factor graph for temporal consistency. |
U-ARE-ME achieves comparable accuracy to RGB-D methods and outperforms feature-based SLAM (ORB-SLAM) in challenging, real-world scenarios (ScanNet).
The method is robust to image degradation and does not require camera intrinsics, making it suitable for in-the-wild videos.
The estimated up-vector enables applications like ground segmentation, demonstrating the versatility of the approach. |
The accuracy depends on the quality of surface normal predictions, which can be affected by factors like object boundaries and small object size.
The assumption of a Manhattan World may not hold true for all environments. |
rotation estimation, manhattan world, surface normals, uncertainty quantification, temporal consistency |
2403.15530
Report |
Pixel-GS: Density Control with Pixel-aware Gradient for 3D Gaussian Splatting |
Zheng Zhang, Wenbo Hu, Yixing Lao, Tong He, Hengshuang Zhao |
3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis
results while advancing real-time rendering performance. However, it relies
heavily on the quality of the initial point cloud, resulting in blurring and
needle-like artifacts in areas with insufficient initializing points. This is
mainly attributed to the point cloud growth condition in 3DGS that only
considers the average gradient magnitude of points from observable views,
thereby failing to grow for large Gaussians that are observable for many
viewpoints while many of them are only covered in the boundaries. To this end,
we propose a novel method, named Pixel-GS, to take into account the number of
pixels covered by the Gaussian in each view during the computation of the
growth condition. We regard the covered pixel numbers as the weights to
dynamically average the gradients from different views, such that the growth of
large Gaussians can be prompted. As a result, points within the areas with
insufficient initializing points can be grown more effectively, leading to a
more accurate and detailed reconstruction. In addition, we propose a simple yet
effective strategy to scale the gradient field according to the distance to the
camera, to suppress the growth of floaters near the camera. Extensive
experiments both qualitatively and quantitatively demonstrate that our method
achieves state-of-the-art rendering quality while maintaining real-time
rendering speed, on the challenging Mip-NeRF 360 and Tanks & Temples datasets. |
Pixel-GS enhances 3D Gaussian Splatting by enabling effective point growth in areas with insufficient initial points, thereby reducing blurring and needle-like artifacts. |
The effectiveness of 3D Gaussian Splatting heavily relies on the quality of the initial point cloud. Inadequate initializing points lead to rendering artifacts. |
Pixel-GS introduces a pixel-aware gradient that considers the number of pixels covered by each Gaussian in each view during the point cloud growth condition calculation. Additionally, it scales the gradient field according to the distance to the camera to suppress floaters. |
Pixel-GS achieves state-of-the-art rendering quality on challenging datasets like Mip-NeRF 360 and Tanks & Temples.
It significantly reduces blurring and needle-like artifacts in sparse regions.
Pixel-GS demonstrates robustness to the sparsity of the initial point cloud. |
The increased number of points in Pixel-GS leads to slightly higher memory consumption compared to 3DGS.
The strategy to address floaters is inspired by NeRF's rendering mechanism and may not generalize well to other rendering techniques.
Future work could investigate optimizing the trade-off between point cloud density and rendering efficiency. |
view synthesis, point-based radiance field, real-time rendering, 3d gaussian splatting, adaptive density control |
2403.15389
Report |
DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data |
Hanrong Ye, Dan Xu |
Recently, there has been an increased interest in the practical problem of
learning multiple dense scene understanding tasks from partially annotated
data, where each training sample is only labeled for a subset of the tasks. The
missing of task labels in training leads to low-quality and noisy predictions,
as can be observed from state-of-the-art methods. To tackle this issue, we
reformulate the partially-labeled multi-task dense prediction as a pixel-level
denoising problem, and propose a novel multi-task denoising diffusion framework
coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to
model a potential noisy distribution in the task prediction or feature maps and
generate rectified outputs for different tasks. To exploit multi-task
consistency in denoising, we further introduce a Multi-Task Conditioning
strategy, which can implicitly utilize the complementary nature of the tasks to
help learn the unlabeled tasks, leading to an improvement in the denoising
performance of the different tasks. Extensive quantitative and qualitative
experiments demonstrate that the proposed multi-task denoising diffusion model
can significantly improve multi-task prediction maps, and outperform the
state-of-the-art methods on three challenging multi-task benchmarks, under two
different partial-labeling evaluation settings. The code is available at
https://prismformore.github.io/diffusionmtl/. |
This paper presents DiffusionMTL, a novel multi-task denoising diffusion framework designed to address noisy predictions in multi-task learning from partially annotated data. |
Annotating multi-task datasets at pixel level is expensive, and training with partially annotated data often results in noisy predictions. Existing methods, though improving label efficiency, still suffer from this issue. Hence, a new methodology is needed to rectify noisy predictions and enhance multi-task prediction quality. |
DiffusionMTL utilizes a two-step approach: (i) generating initial multi-task predictions with a shared encoder-decoder backbone and (ii) refining these predictions using a Multi-Task Denoising Diffusion Network (MTDNet). MTDNet employs two diffusion mechanisms: Prediction Diffusion (denoising in output space) and Feature Diffusion (refining in latent feature space). A Multi-Task Conditioning strategy is introduced to facilitate denoising and enable learning of unlabeled tasks via cross-task information sharing. |
DiffusionMTL demonstrates substantial performance improvements, outperforming competing methods on three benchmarks (PASCAL, NYUD, Cityscapes) under different partial-labeling settings.
Ablation studies confirm the effectiveness of the denoising network, multi-task conditioning, and both diffusion mechanisms.
Qualitative analysis showcases DiffusionMTL's ability to effectively denoise noisy predictions and generate cleaner, more accurate multi-task prediction maps. |
The current implementation primarily focuses on the one-label setting; exploring its generalization to more complex scenarios with varying label availability per task is a potential avenue.
Further research on efficiently scaling DiffusionMTL to a larger number of tasks with diverse characteristics and computational demands is warranted. |
multi-task learning, denoising diffusion models, partially supervised learning, dense prediction, computer vision |
2403.15383
Report |
ThemeStation: Generating Theme-Aware 3D Assets from Few Exemplars |
Zhenwei Wang, Tengfei Wang, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau |
Real-world applications often require a large gallery of 3D assets that share
a consistent theme. While remarkable advances have been made in general 3D
content creation from text or image, synthesizing customized 3D assets
following the shared theme of input 3D exemplars remains an open and
challenging problem. In this work, we present ThemeStation, a novel approach
for theme-aware 3D-to-3D generation. ThemeStation synthesizes customized 3D
assets based on given few exemplars with two goals: 1) unity for generating 3D
assets that thematically align with the given exemplars and 2) diversity for
generating 3D assets with a high degree of variations. To this end, we design a
two-stage framework that draws a concept image first, followed by a
reference-informed 3D modeling stage. We propose a novel dual score
distillation (DSD) loss to jointly leverage priors from both the input
exemplars and the synthesized concept image. Extensive experiments and user
studies confirm that ThemeStation surpasses prior works in producing diverse
theme-aware 3D models with impressive quality. ThemeStation also enables
various applications such as controllable 3D-to-3D generation. |
ThemeStation, a novel two-stage framework for theme-aware 3D-to-3D generation. It synthesizes diverse 3D assets thematically consistent with a few input exemplars, balancing unity and diversity. |
Addresses limitations of text/image-based 3D generation (ambiguity, inconsistency) by leveraging richer information from 3D exemplars, enabling automatic synthesis of large, thematically consistent 3D asset galleries. |
1. **Theme-driven concept image generation:** Fine-tunes a text-to-image diffusion model on exemplar renderings to generate diverse concept images. 2. **Reference-informed 3D asset modeling:** Uses concept images as rough guidance and refines them into 3D models via dual score distillation (DSD). DSD leverages concept prior (from concept images) at high noise levels for global layout and reference prior (from exemplars) at low noise levels for fine details. |
Outperforms state-of-the-art image-to-3D and 3D variation methods in generative diversity, quality, and multi-view semantic coherence.
Generates compelling and diverse 3D models with finer details, even with a single exemplar.
Enables applications like controllable 3D-to-3D generation. |
Current pipeline requires hours for optimization; future work can explore faster diffusion/rendering techniques.
Reliance on a two-stage pipeline introduces potential for poor initialization; future work can explore feed-forward models. |
3d generation, exemplar-based generation, diffusion models, dual score distillation, theme-aware generation |
2403.15382
Report |
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects |
Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi |
We introduce DragAPart, a method that, given an image and a set of drags as
input, can generate a new image of the same object in a new state, compatible
with the action of the drags. Differently from prior works that focused on
repositioning objects, DragAPart predicts part-level interactions, such as
opening and closing a drawer. We study this problem as a proxy for learning a
generalist motion model, not restricted to a specific kinematic structure or
object category. To this end, we start from a pre-trained image generator and
fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce.
Combined with a new encoding for the drags and dataset randomization, the new
model generalizes well to real images and different categories. Compared to
prior motion-controlled generators, we demonstrate much better part-level
motion understanding. |
This paper introduces DragPart, an interactive image generator that synthesizes images of objects in new states compatible with user-specified drags, focusing on part-level interactions like opening drawers instead of just repositioning objects. |
Current generative models struggle to capture nuanced part-level motion, often resorting to unrealistic object manipulation. DragPart addresses this by learning a generalist motion model applicable to diverse objects and their articulations. |
The authors created DragBench, a synthetic dataset with drag annotations, by animating and rendering objects from GAPartNet. They then trained DragPart, which uses a novel drag encoding mechanism, on this dataset, leveraging pre-trained diffusion models like Stable Diffusion and DiT. |
DragPart significantly outperforms state-of-the-art methods in quantitative metrics like PSNR, SSIM, and LPIPS on both synthetic and real-world datasets.
Qualitative comparisons demonstrate DragPart's ability to generate realistic object articulations while preserving object identity and visual details.
The learned motion model proves useful for downstream applications such as motion analysis for articulated objects and segmentation of moving parts. |
The model currently lacks explicit enforcement of consistency for generated images of the same object across different viewpoints and drag conditions.
The authors trained separate models for everyday objects and humans, limiting its generalizability to all moving entities. |
generative models, motion synthesis, part-level interaction, drag-based control, synthetic data |
2403.15378
Report |
Long-CLIP: Unlocking the Long-Text Capability of CLIP |
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang |
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for
zero-shot classification, text-image retrieval, and text-image generation by
aligning image and text modalities. Despite its widespread adoption, a
significant limitation of CLIP lies in the inadequate length of text input. The
length of the text token is restricted to 77, and an empirical study shows the
actual effective length is even less than 20. This prevents CLIP from handling
detailed descriptions, limiting its applications for image retrieval and
text-to-image generation with extensive prerequisites. To this end, we propose
Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input,
retains or even surpasses its zero-shot generalizability, and aligns the CLIP
latent space, making it readily replace CLIP without any further adaptation in
downstream frameworks. Nevertheless, achieving this goal is far from
straightforward, as simplistic fine-tuning can result in a significant
degradation of CLIP's performance. Moreover, substituting the text encoder with
a language model supporting longer contexts necessitates pretraining with vast
amounts of data, incurring significant expenses. Accordingly, Long-CLIP
introduces an efficient fine-tuning solution on CLIP with two novel strategies
designed to maintain the original capabilities, including (1) a
knowledge-preserved stretching of positional embedding and (2) a primary
component matching of CLIP features. With leveraging just one million extra
long text-image pairs, Long-CLIP has shown the superiority to CLIP for about
20% in long caption text-image retrieval and 6% in traditional text-image
retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers
enhanced capabilities for generating images from detailed text descriptions by
replacing CLIP in a plug-and-play manner. |
Introduces Long-CLIP, a plug-and-play alternative to CLIP that supports long-text input while retaining or surpassing CLIP's zero-shot generalizability. |
CLIP's limited text input length (77 tokens, effectively only 20) hinders its ability to handle detailed descriptions and capture fine-grained information, limiting its applications in image retrieval and text-to-image generation. |
Long-CLIP employs two novel strategies: 1) Knowledge-Preserved Stretching of positional embedding, preserving well-trained positions while interpolating others. 2) Primary Component Matching of CLIP features, aligning both fine-grained and coarse-grained image features with corresponding long and short captions. |
Long-CLIP achieves up to 25% improvement in recall rate for long-text image retrieval tasks.
It shows a 6% improvement in recall rate for traditional short-text image retrieval tasks on COCO and Flickr30k.
Long-CLIP maintains zero-shot classification performance and enables plug-and-play integration for enhanced text-to-image generation with detailed prompts. |
Long-CLIP still has an upper bound on input token length, though significantly extended.
Future work includes exploring the impact of scaling up training data with long text-image pairs. |
multimodality, zero-shot image classification, text-image retrieval, text-to-image generation, clip |
2403.15377
Report |
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding |
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang |
We introduce InternVideo2, a new video foundation model (ViFM) that achieves
the state-of-the-art performance in action recognition, video-text tasks, and
video-centric dialogue. Our approach employs a progressive training paradigm
that unifies the different self- or weakly-supervised learning frameworks of
masked video token reconstruction, cross-modal contrastive learning, and next
token prediction. Different training stages would guide our model to capture
different levels of structure and semantic information through different
pretext tasks. At the data level, we prioritize the spatiotemporal consistency
by semantically segmenting videos and generating video-audio-speech captions.
This improves the alignment between video and text. We scale both data and
model size for our InternVideo2. Through extensive experiments, we validate our
designs and demonstrate the state-of-the-art performance on over 60 video and
audio tasks. Notably, our model outperforms others on various video-related
captioning, dialogue, and long video understanding benchmarks, highlighting its
ability to reason and comprehend long temporal contexts. Code and models are
available at https://github.com/OpenGVLab/InternVideo2/. |
Introduces InternVideo2, a video foundation model (ViFM) for action recognition, video-text tasks, and video-centric dialogue, achieving state-of-the-art performance on 65 out of 74 video/audio tasks. |
Transferable spatiotemporal representations are critical for diverse applications like video searching, robotics, and self-driving. |
Employs progressive training with three stages: masked video token reconstruction, video-audio-speech-language contrastive learning, and next token prediction with a large language model (LLM). |
Achieves new state-of-the-art results on Kinetics (92.1%/91.9%/85.9% on K400/600/700), SomethingSomething V2, Moments in Time, ActivityNet, and HACS.
Outperforms previous state-of-the-art methods in zero-shot video retrieval across various benchmarks, demonstrating strong video-language alignment.
Excels in video-centric dialogue and long video understanding, showing the ability to reason and comprehend long temporal contexts. |
Limitations in input resolution, sampling rate, and compressed tokens restrict the expression of rich video information.
Scalability and computational feasibility considerations limit joint learning of all optimization objectives. |
video foundation model, multimodal learning, action recognition, video retrieval, video-centric dialogue |
2403.15360
Report |
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series |
Badri N. Patro, Vijay S. Agneeswaran |
Transformers have widely adopted attention networks for sequence mixing and
MLPs for channel mixing, playing a pivotal role in achieving breakthroughs
across domains. However, recent literature highlights issues with attention
networks, including low inductive bias and quadratic complexity concerning
input sequence length. State Space Models (SSMs) like S4 and others (Hippo,
Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address
the above issues to help handle longer sequence lengths. Mamba, while being the
state-of-the-art SSM, has a stability issue when scaled to large networks for
computer vision datasets. We propose SiMBA, a new architecture that introduces
Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations
and uses the Mamba block for sequence modeling. Extensive performance studies
across image and time-series benchmarks demonstrate that SiMBA outperforms
existing SSMs, bridging the performance gap with state-of-the-art transformers.
Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet
and transfer learning benchmarks such as Stanford Car and Flower as well as
task learning benchmarks as well as seven time series benchmark datasets. The
project page is available on this website
~\url{https://github.com/badripatro/Simba}. |
This paper proposes SiMBA, a novel architecture for vision and multivariate time series modeling that leverages the strengths of Mamba (a state-of-the-art State Space Model) for sequence modeling and introduces EinFFT, a new technique for channel modeling. |
Existing State Space Models (SSMs) often struggle with information propagation in long sequences and lack efficient channel modeling techniques. SiMBA addresses these limitations, aiming to bridge the performance gap between SSMs and attention-based transformers. |
SiMBA utilizes the Mamba block for sequence modeling to handle long-range dependencies and introduces EinFFT, based on Einstein FFT and learnable layers, for efficient and stable channel modeling. |
SiMBA outperforms existing SSMs on ImageNet, achieving state-of-the-art performance for SSMs on this benchmark.
The architecture demonstrates excellent generalization capabilities, achieving superior results on six standard time series datasets for multivariate forecasting.
SiMBA shows competitive performance in transfer learning tasks on CIFAR, Stanford Car, and Flower datasets, as well as in downstream tasks like instance segmentation on MS COCO. |
While SiMBA closes the performance gap for small and base-sized models, a gap still exists with large-sized transformers, requiring further exploration in scaling SiMBA.
The paper primarily focuses on vision and time series data, leaving potential applications in other domains like natural language processing for future investigation. |
state space models, transformers, channel modeling, sequence modeling, computer vision, time series forecasting |
2403.15330
Report |
Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization |
Jimyeong Kim, Jungwon Park, Wonjong Rhee |
In text-to-image personalization, a timely and crucial challenge is the
tendency of generated images overfitting to the biases present in the reference
images. We initiate our study with a comprehensive categorization of the biases
into background, nearby-object, tied-object, substance (in style
re-contextualization), and pose biases. These biases manifest in the generated
images due to their entanglement into the subject embedding. This undesired
embedding entanglement not only results in the reflection of biases from the
reference images into the generated images but also notably diminishes the
alignment of the generated images with the given generation prompt. To address
this challenge, we propose SID~(Selectively Informative Description), a text
description strategy that deviates from the prevalent approach of only
characterizing the subject's class identification. SID is generated utilizing
multimodal GPT-4 and can be seamlessly integrated into optimization-based
models. We present comprehensive experimental results along with analyses of
cross-attention maps, subject-alignment, non-subject-disentanglement, and
text-alignment. |
This paper introduces SID (Subject-Informative Description) as a novel description format for personalized text-to-image diffusion models. SID aims to alleviate the problem of undesired embedding entanglement. |
Existing personalized text-to-image models often exhibit undesired entanglement, limiting their ability to generate images that faithfully represent the intended subject in diverse contexts. SID addresses this by providing more informative descriptions that explicitly differentiate the subject from its background and associated objects. |
The authors leverage the capabilities of instruction-following Vision-Language Models (VLMs) to automatically generate SIDs from reference images. These SIDs, incorporating details about both the subject and its surroundings, are then used to train personalized text-to-image diffusion models. |
SID significantly improves the performance of personalized text-to-image generation, particularly in scenarios with highly biased reference images.
The method proves effective even with a single reference image, surpassing the performance of both encoder-based and fine-tuning-based personalization methods.
Human evaluation confirms the superiority of SID-integrated models, showcasing significant improvements in text alignment, subject preservation, and background disentanglement. |
The study highlights the occasional imperfections in VLM-generated descriptions, which can sometimes lead to undesired artifacts in the generated images.
The authors acknowledge the limitations of their evaluation measures in capturing style re-contextualization and plan to explore suitable measures for this aspect in future work. |
text-to-image generation, personalized image synthesis, diffusion models, vision-language models, embedding entanglement |
2403.15309
Report |
Controlled Training Data Generation with Diffusion Models |
Teresa Yeo, Andrei Atanov, Harold Benoit, Aleksandr Alekseev, Ruchira Ray, Pooya Esmaeil Akhoondi, Amir Zamir |
In this work, we present a method to control a text-to-image generative model
to produce training data specifically "useful" for supervised learning. Unlike
previous works that employ an open-loop approach and pre-define prompts to
generate new data using either a language model or human expertise, we develop
an automated closed-loop system which involves two feedback mechanisms. The
first mechanism uses feedback from a given supervised model and finds
adversarial prompts that result in image generations that maximize the model
loss. While these adversarial prompts result in diverse data informed by the
model, they are not informed of the target distribution, which can be
inefficient. Therefore, we introduce the second feedback mechanism that guides
the generation process towards a certain target distribution. We call the
method combining these two mechanisms Guided Adversarial Prompts. We perform
our evaluations on different tasks, datasets and architectures, with different
types of distribution shifts (spuriously correlated data, unseen domains) and
demonstrate the efficiency of the proposed feedback mechanisms compared to
open-loop approaches. |
This paper presents a novel closed-loop method for generating useful training data for supervised learning models. It employs two feedback mechanisms to control a text-to-image generative model, specifically finding prompts that are both adversarial to the model and relevant to a target distribution. |
This is important to address the limitations of static datasets and the need for adaptive, cost-efficient methods to improve model generalization under distribution shifts, especially in scenarios where real-world test conditions change over time. |
The method uses 1) Adversarial Prompt Optimization to identify prompts that maximize the loss of a given supervised model, reflecting its failure modes, and 2) Target Distribution Informed Generation, implemented with CLIP guidance, to guide the generation process towards a target distribution, leveraging textual descriptions or unlabeled image samples. |
Guided Adversarial Prompts (GAP) demonstrate higher data efficiency compared to open-loop and solely model/target-informed methods for image classification tasks, particularly on the Waterbirds and iWildCam datasets.
Model-informed adversarial prompts significantly improve performance under distribution shifts for depth estimation tasks, outperforming baselines on Common Corruptions, 3D Common Corruptions, and cross-dataset shifts.
The effectiveness of both model and target distribution feedback mechanisms is validated on different tasks (image classification, depth estimation), architectures (convolutional and transformer), and datasets exhibiting distribution shifts. |
The method is currently limited by the potential for label shift in certain scenarios, such as changes in label distribution due to domain shifts.
The computational cost of backpropagation through the diffusion model's denoising process can be demanding, presenting a limitation for scalability. |
data augmentation, data generation, diffusion models, distribution shift, adversarial training |
2403.15249
Report |
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models |
Geon Yeong Park, Hyeonho Jeong, Sang Wan Lee, Jong Chul Ye |
The evolution of diffusion models has greatly impacted video generation and
understanding. Particularly, text-to-video diffusion models (VDMs) have
significantly facilitated the customization of input video with target
appearance, motion, etc. Despite these advances, challenges persist in
accurately distilling motion information from video frames. While existing
works leverage the consecutive frame residual as the target motion vector, they
inherently lack global motion context and are vulnerable to frame-wise
distortions. To address this, we present Spectral Motion Alignment (SMA), a
novel framework that refines and aligns motion vectors using Fourier and
wavelet transforms. SMA learns motion patterns by incorporating
frequency-domain regularization, facilitating the learning of whole-frame
global motion dynamics, and mitigating spatial artifacts. Extensive experiments
demonstrate SMA's efficacy in improving motion transfer while maintaining
computational efficiency and compatibility across various video customization
frameworks. |
This paper presents Spectral Motion Alignment (SMA), a frequency-domain motion vector refinement and alignment framework for improved motion transfer in videos using diffusion models. |
Current video motion transfer methods, which rely on latent frame residuals as motion vectors, lack global motion context and are susceptible to frame-wise distortions, leading to inaccurate motion transfer. |
SMA utilizes Fourier and wavelet transforms to refine and align motion vectors. It uses a wavelet-based global motion alignment loss to capture whole-frame motion dynamics and a Fourier-based local motion refinement loss to mitigate spatial artifacts, prioritizing low-frequency components. |
SMA significantly improves motion accuracy in video motion transfer tasks, accurately distinguishing dynamic and static elements.
SMA is computationally efficient and universally compatible with various video motion transfer frameworks, including text-to-video and text-to-image diffusion-based methods.
Quantitative and qualitative evaluations demonstrate SMA's superiority over baselines like MotionDirector, VMC, Tune-A-Video, and ControlVideo across diverse motion patterns and subjects. |
The selection of wavelet levels and frequency weighting parameters in SMA currently relies on empirical observations.
Future work includes exploring the application of SMA to more complex video editing tasks beyond motion transfer, such as motion interpolation and video generation. |
diffusion models, video motion transfer, wavelet transform, fourier transform, frequency-domain analysis |
2403.15234
Report |
Shadow Generation for Composite Image Using Diffusion model |
Qingyang Liu, Junqi You, Jianting Wang, Xinhao Tao, Bo Zhang, Li Niu |
In the realm of image composition, generating realistic shadow for the
inserted foreground remains a formidable challenge. Previous works have
developed image-to-image translation models which are trained on paired
training data. However, they are struggling to generate shadows with accurate
shapes and intensities, hindered by data scarcity and inherent task complexity.
In this paper, we resort to foundation model with rich prior knowledge of
natural shadow images. Specifically, we first adapt ControlNet to our task and
then propose intensity modulation modules to improve the shadow intensity.
Moreover, we extend the small-scale DESOBA dataset to DESOBAv2 using a novel
data acquisition pipeline. Experimental results on both DESOBA and DESOBAv2
datasets as well as real composite images demonstrate the superior capability
of our model for shadow generation task. The dataset, code, and model are
released at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2. |
This paper introduces DESOBAv2, a large-scale shadow generation dataset, and proposes SGDiffusion, a novel diffusion-based method for generating plausible shadows for composite foregrounds. |
Generating realistic shadows for inserted foregrounds in image composition is crucial for realism but challenging due to complex lighting and geometry. Existing methods struggle with data scarcity and generating accurate shadows. |
The authors first create DESOBAv2 by using object-shadow detection and inpainting to automatically generate composite images without foreground shadows and their corresponding ground-truth images with shadows. They then develop SGDiffusion, which adapts ControlNet by adding an intensity encoder to modulate shadow darkness based on background shadows. They also introduce weighted noise loss to focus on the shadow region and employ post-processing to refine the generated image. |
SGDiffusion outperforms previous state-of-the-art methods on both DESOBAv2 and real composite images, exhibiting superior performance in generating realistic shadows with accurate shapes, locations, and intensities.
Ablation studies demonstrate the effectiveness of each component in SGDiffusion, including weighted noise loss, intensity modulation, and post-processing.
Subjective evaluation using human raters further validates the superiority of SGDiffusion in producing realistic shadow effects. |
The reliance on object-shadow detection in dataset construction might introduce bias from the detector's limitations.
Future work can explore incorporating object material and lighting information for more accurate shadow generation. |
shadow generation, image composition, diffusion models, dataset, deep learning |
2403.15059
Report |
MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration |
Zhichao Wei, Qingkun Su, Long Qin, Weizhi Wang |
Recent advances in tuning-free personalized image generation based on
diffusion models are impressive. However, to improve subject fidelity, existing
methods either retrain the diffusion model or infuse it with dense visual
embeddings, both of which suffer from poor generalization and efficiency. Also,
these methods falter in multi-subject image generation due to the unconstrained
cross-attention mechanism. In this paper, we propose MM-Diff, a unified and
tuning-free image personalization framework capable of generating high-fidelity
images of both single and multiple subjects in seconds. Specifically, to
simultaneously enhance text consistency and subject fidelity, MM-Diff employs a
vision encoder to transform the input image into CLS and patch embeddings. CLS
embeddings are used on the one hand to augment the text embeddings, and on the
other hand together with patch embeddings to derive a small number of
detail-rich subject embeddings, both of which are efficiently integrated into
the diffusion model through the well-designed multimodal cross-attention
mechanism. Additionally, MM-Diff introduces cross-attention map constraints
during the training phase, ensuring flexible multi-subject image sampling
during inference without any predefined inputs (e.g., layout). Extensive
experiments demonstrate the superior performance of MM-Diff over other leading
methods. |
MM-Diff is a tuning-free image personalization framework that enables fast generation of high-fidelity single and multi-subject images using vision-augmented text embeddings and detail-rich subject embeddings. |
Existing personalized image generation methods struggle with slow generation speed, poor generalization, and the attribute binding issue in multi-subject scenarios. |
MM-Diff leverages a vision encoder to extract subject features, employs a Subject Embedding Refiner to enhance these features, and integrates them into a diffusion model through LoRA layers. Cross-attention map constraints are introduced during training to address attribute binding in multi-subject generation. |
MM-Diff achieves superior subject fidelity compared to other state-of-the-art tuning-free methods on single-subject generation.
It achieves high face similarity scores for both single and multi-subject portrait generation.
The proposed cross-attention map constraints effectively mitigate attribute binding in multi-subject generation. |
The training dataset size is relatively limited compared to some top-tier methods.
The dataset for general subject generation only contains one subject per image, limiting multi-subject generation capabilities. Future work could focus on using larger and more diverse datasets. |
image personalization, subject fidelity, multi-subject generation, diffusion models, cross-attention |
2403.15019
Report |
BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation |
Jiahao Lu, Jiacheng Deng, Tianzhu Zhang |
3D instance segmentation (3DIS) is a crucial task, but point-level
annotations are tedious in fully supervised settings. Thus, using bounding
boxes (bboxes) as annotations has shown great potential. The current mainstream
approach is a two-step process, involving the generation of pseudo-labels from
box annotations and the training of a 3DIS network with the pseudo-labels.
However, due to the presence of intersections among bboxes, not every point has
a determined instance label, especially in overlapping areas. To generate
higher quality pseudo-labels and achieve more precise weakly supervised 3DIS
results, we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D
Instance Segmentation (BSNet), which devises a novel pseudo-labeler called
Simulation-assisted Transformer. The labeler consists of two main components.
The first is Simulation-assisted Mean Teacher, which introduces Mean Teacher
for the first time in this task and constructs simulated samples to assist the
labeler in acquiring prior knowledge about overlapping areas. To better model
local-global structure, we also propose Local-Global Aware Attention as the
decoder for teacher and student labelers. Extensive experiments conducted on
the ScanNetV2 and S3DIS datasets verify the superiority of our designs. Code is
available at
\href{https://github.com/peoplelu/BSNet}{https://github.com/peoplelu/BSNet}. |
This paper proposes BSNet, a weakly supervised 3D instance segmentation method that uses bounding boxes as annotations. It features a novel pseudo-labeler called SAFormer. |
Point-level annotations for 3D instance segmentation are tedious. BSNet addresses this by using easier-to-obtain bounding box annotations while achieving high accuracy. |
BSNet generates pseudo-labels using SAFormer, which leverages a Simulation-assisted Mean Teacher (SMT) strategy and a Local-Global Aware Attention (LGA) decoder. SMT constructs simulated overlapping samples to train the labeler, while LGA effectively models local and global structures within the point cloud. |
BSNet outperforms previous box-supervised methods on ScanNetV2 and S3DIS benchmarks.
The simulated samples and Mean Teacher strategy in SAFormer lead to higher-quality pseudo-labels and faster training.
The LGA decoder effectively captures both local and global information, improving pseudo-label accuracy. |
The simulated overlapping samples may not perfectly represent all real-world scenarios.
Future work could explore extending BSNet to other weakly supervised 3D vision tasks. |
3d instance segmentation, weakly supervised learning, bounding box supervision, mean teacher, transformer |
2403.15009
Report |
TexRO: Generating Delicate Textures of 3D Models by Recursive Optimization |
Jinbo Wu, Xing Liu, Chenming Wu, Xiaobo Gao, Jialun Liu, Xinqi Liu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang |
This paper presents TexRO, a novel method for generating delicate textures of
a known 3D mesh by optimizing its UV texture. The key contributions are
two-fold. We propose an optimal viewpoint selection strategy, that finds the
most miniature set of viewpoints covering all the faces of a mesh. Our
viewpoint selection strategy guarantees the completeness of a generated result.
We propose a recursive optimization pipeline that optimizes a UV texture at
increasing resolutions, with an adaptive denoising method that re-uses existing
textures for new texture generation. Through extensive experimentation, we
demonstrate the superior performance of TexRO in terms of texture quality,
detail preservation, visual consistency, and, notably runtime speed,
outperforming other current methods. The broad applicability of TexRO is
further confirmed through its successful use on diverse 3D models. |
TexRO is a novel method for generating delicate textures of a known 3D mesh by optimizing its UV texture using recursive optimization at increasing resolutions with an adaptive denoising strategy and optimal viewpoint selection. |
Controllable creation of detailed and delicate textures for 3D models remains challenging while existing methods suffer from limitations such as blurry results, lengthy optimization times, and the inability to maintain multi-view consistency. |
TexRO uses an optimal viewpoint selection strategy based on a heuristic greedy strategy to find the smallest set of views covering all faces of a mesh. Then, it recursively optimizes the UV texture at increasing resolutions in RGB space with an adaptive denoising strategy that re-uses existing textures to generate new textures by adaptively injecting noise. |
TexRO outperforms state-of-the-art methods in terms of texture quality, detail preservation, and visual consistency.
TexRO achieves significantly faster texture generation, completing it in approximately 1 minute.
Experiments on widely-used 3D datasets and user studies validate the effectiveness and efficiency of TexRO. |
TexRO requires water-tight input meshes, limiting its application to non-water-tight meshes.
Texturing areas within meshes with complex topologies can be challenging for TexRO. |
texture generation, multi-view diffusion, recursive optimization, adaptive denoising, 3d model texturing |
2403.14966
Report |
DreamFlow: High-Quality Text-to-3D Generation by Approximating Probability Flow |
Kyungmin Lee, Kihyuk Sohn, Jinwoo Shin |
Recent progress in text-to-3D generation has been achieved through the
utilization of score distillation methods: they make use of the pre-trained
text-to-image (T2I) diffusion models by distilling via the diffusion model
training objective. However, such an approach inevitably results in the use of
random timesteps at each update, which increases the variance of the gradient
and ultimately prolongs the optimization process. In this paper, we propose to
enhance the text-to-3D optimization by leveraging the T2I diffusion prior in
the generative sampling process with a predetermined timestep schedule. To this
end, we interpret text-to3D optimization as a multi-view image-to-image
translation problem, and propose a solution by approximating the probability
flow. By leveraging the proposed novel optimization algorithm, we design
DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization
framework that enables fast generation of highquality and high-resolution
(i.e., 1024x1024) 3D contents. For example, we demonstrate that DreamFlow is 5
times faster than the existing state-of-the-art text-to-3D method, while
producing more photorealistic 3D contents. Visit our project page
(https://kyungmnlee.github.io/dreamflow.github.io/) for visualizations. |
This paper proposes DreamFlow, a text-to-3D generation method that leverages the generative process of text-to-image diffusion models by approximating the reverse generative probability flow, leading to faster optimization and high-quality results. |
Existing score distillation methods for text-to-3D generation suffer from high-variance gradients, requiring lengthy optimization and limiting scalability to high-resolution 3D content. |
The method interprets text-to-3D optimization as a multi-view image-to-image translation problem and uses a novel optimization algorithm based on approximate probability flow ODE (APFO) with a predetermined timestep schedule to transport multi-view images to the data distribution learned by a pre-trained diffusion model. A three-stage coarse-to-fine optimization framework generates NeRF, extracts and fine-tunes a 3D mesh, and refines the mesh with a high-resolution diffusion prior. |
DreamFlow produces more photorealistic 3D content compared to existing methods like DreamFusion, Magic3D, and ProlificDreamer, as demonstrated by human preference studies.
DreamFlow achieves better CLIP R-precision scores than prior methods in both NeRF generation and 3D mesh fine-tuning.
DreamFlow is significantly faster (5x) than ProlificDreamer in generating 3D content. |
The reliance on pre-trained diffusion priors without 3D understanding may limit results in some cases.
Unwanted biases from the pre-trained diffusion model might be inherited. |
text-to-3d generation, diffusion models, probability flow ode, neural radiance fields, 3d mesh |
2403.14944
Report |
CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model |
Seungdae Han, Joohee Kim |
There has been a significant progress in text conditional image generation
models. Recent advancements in this field depend not only on improvements in
model structures, but also vast quantities of text-image paired datasets.
However, creating these kinds of datasets is very costly and requires a
substantial amount of labor. Famous face datasets don't have corresponding text
captions, making it difficult to develop text conditional image generation
models on these datasets. Some research has focused on developing text to image
generation models using only images without text captions. Here, we propose
CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide
multimodal text-image representations and strong image generation capabilities.
On the FFHQ dataset, our model outperformed previous state-of-the-art methods
by 4.4% in clipscore and generated very realistic images even when the text was
both in and out of distribution. The pretrained models and codes will soon be
available at https://github.com/INFINIQ-AI1/CLIPVQDiffusion |
This paper proposes CLIP-VQDiffusion, a novel text-conditional image generation model that leverages CLIP for multimodal representations and a vector quantized diffusion model for image generation, enabling language-free training using only image datasets. |
Creating large text-image paired datasets for training text-conditional image generation models is expensive and laborious. This approach addresses the challenge by utilizing CLIP's ability to connect visual and textual information without requiring paired data. |
The method involves pretraining a VQ-GAN to learn a codebook for image quantization. A diffusion model is then trained to denoise noisy latent codes conditioned on CLIP image embeddings. During inference, text prompts are transformed into CLIP text embeddings to guide image generation. |
CLIP-VQDiffusion outperforms previous state-of-the-art language-free methods on FFHQ by 4.4% in CLIP score.
The model generates highly realistic and text-aligned images on both FFHQ and COCO datasets.
Gaussian noise injection to CLIP image embeddings during training proves crucial for bridging the gap between image and text embeddings. |
The model exhibits a trade-off between image fidelity (FID) and diversity (IS) when varying guidance scale and truncation ratio.
Further investigation into mitigating this trade-off and improving performance on datasets like COCO is needed. |
text-to-image generation, language-free training, clip, vq-diffusion, multimodal learning |
2403.14939
Report |
STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians |
Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, Yao Yao |
Recent progress in pre-trained diffusion models and 3D generation have
spurred interest in 4D content creation. However, achieving high-fidelity 4D
generation with spatial-temporal consistency remains a challenge. In this work,
we propose STAG4D, a novel framework that combines pre-trained diffusion models
with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing
inspiration from 3D generation techniques, we utilize a multi-view diffusion
model to initialize multi-view images anchoring on the input video frames,
where the video can be either real-world captured or generated by a video
diffusion model. To ensure the temporal consistency of the multi-view sequence
initialization, we introduce a simple yet effective fusion strategy to leverage
the first frame as a temporal anchor in the self-attention computation. With
the almost consistent multi-view sequences, we then apply the score
distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian
spatting is specially crafted for the generation task, where an adaptive
densification strategy is proposed to mitigate the unstable Gaussian gradient
for robust optimization. Notably, the proposed pipeline does not require any
pre-training or fine-tuning of diffusion networks, offering a more accessible
and practical solution for the 4D generation task. Extensive experiments
demonstrate that our method outperforms prior 4D generation works in rendering
quality, spatial-temporal consistency, and generation robustness, setting a new
state-of-the-art for 4D generation from diverse inputs, including text, image,
and video. |
This paper introduces STAG4D, a novel framework leveraging pre-trained diffusion models and dynamic 3D Gaussian splatting for high-fidelity 4D generation. |
High-quality 4D content generation is crucial for various applications, but existing methods face challenges in rendering quality, spatial-temporal consistency, and generation speed. |
The method uses a multi-view diffusion model to generate consistent multi-view image sequences. It then employs score distillation sampling to optimize a 4D Gaussian point cloud representation of the dynamic scene, aided by an adaptive densification strategy. |
STAG4D achieves state-of-the-art results in 4D generation from text, image, and video inputs, demonstrating superior rendering quality and spatial-temporal consistency.
The method exhibits robustness and generalizability across diverse dynamic scenes.
Adaptive densification based on Gaussian gradient distribution proves effective for robust 4D Gaussian optimization. |
The approach may be limited in handling complex, fast motions due to constraints of 4D Gaussian representation.
Inherent video limitations, such as blurriness, can impact diffusion effectiveness and subsequent 4D optimization. |
4d generation, 3d gaussian splatting, diffusion models, spatial-temporal consistency, adaptive densification |
2403.14870
Report |
VidLA: Video-Language Alignment at Scale |
Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi |
In this paper, we propose VidLA, an approach for video-language alignment at
scale. There are two major limitations of previous video-language alignment
approaches. First, they do not capture both short-range and long-range temporal
dependencies and typically employ complex hierarchical deep network
architectures that are hard to integrate with existing pretrained image-text
foundation models. To effectively address this limitation, we instead keep the
network architecture simple and use a set of data tokens that operate at
different temporal resolutions in a hierarchical manner, accounting for the
temporally hierarchical nature of videos. By employing a simple two-tower
architecture, we are able to initialize our video-language model with
pretrained image-text foundation models, thereby boosting the final
performance. Second, existing video-language alignment works struggle due to
the lack of semantically aligned large-scale training data. To overcome it, we
leverage recent LLMs to curate the largest video-language dataset to date with
better visual grounding. Furthermore, unlike existing video-text datasets which
only contain short clips, our dataset is enriched with video clips of varying
durations to aid our temporally hierarchical data tokens in extracting better
representations at varying temporal scales. Overall, empirical results show
that our proposed approach surpasses state-of-the-art methods on multiple
retrieval benchmarks, especially on longer videos, and performs competitively
on classification benchmarks. |
This paper introduces VidLA, a new approach for video-language alignment that scales effectively by utilizing large language models (LLMs) for data curation and a novel hierarchical temporal attention mechanism. |
Video-language alignment is challenging due to the difficulty in gathering large, semantically aligned datasets and the complexity of capturing temporal dependencies in videos. Existing methods often struggle with these limitations, hindering performance. |
The authors curate YT-VidLA-800M, a massive video-text dataset, by using LLMs to generate captions and summarize texts for better visual grounding. They design a hierarchical temporal attention mechanism that models both local and global temporal relationships in videos, enabling the use of pretrained image-text encoders. |
VidLA surpasses state-of-the-art methods on multiple video-text retrieval benchmarks, particularly with longer videos.
The hierarchical temporal attention mechanism effectively captures temporal dependencies at different scales, significantly improving performance.
The data curation techniques, including caption generation and text summarization using LLMs, prove crucial for enhancing video-language alignment. |
The model's performance could be further enhanced by exploring more advanced LLM architectures and data curation strategies.
Investigating the effectiveness of VidLA on a broader range of downstream tasks, beyond retrieval and classification, would provide a more comprehensive evaluation. |
video-language alignment, large language models, hierarchical temporal attention, video-text retrieval, data curation |
2403.14828
Report |
Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing |
Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara |
Fashion illustration is a crucial medium for designers to convey their
creative vision and transform design concepts into tangible representations
that showcase the interplay between clothing and the human body. In the context
of fashion design, computer vision techniques have the potential to enhance and
streamline the design process. Departing from prior research primarily focused
on virtual try-on, this paper tackles the task of multimodal-conditioned
fashion image editing. Our approach aims to generate human-centric fashion
images guided by multimodal prompts, including text, human body poses, garment
sketches, and fabric textures. To address this problem, we propose extending
latent diffusion models to incorporate these multiple modalities and modifying
the structure of the denoising network, taking multimodal prompts as input. To
condition the proposed architecture on fabric textures, we employ textual
inversion techniques and let diverse cross-attention layers of the denoising
network attend to textual and texture information, thus incorporating different
granularity conditioning details. Given the lack of datasets for the task, we
extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal
annotations. Experimental evaluations demonstrate the effectiveness of our
proposed approach in terms of realism and coherence concerning the provided
multimodal inputs. |
This paper introduces Textual-inverted Multimodal Garment Designer (Ti-MGD), a novel approach for multimodal-conditioned fashion image editing that leverages latent diffusion models to generate human-centric fashion images guided by text, human body poses, garment sketches, and fabric textures. |
This task is important as it enables fashion designers to empower the design of new fashion items, facilitating the exploration of the interplay between their sketches, available fabric textures, and diverse human body shapes. |
The authors propose a denoising network that takes multimodal prompts as input. They incorporate fabric textures by employing textual inversion techniques and designing a novel component to project texture images into the textual space of the diffusion model. Different cross-attention layers of the denoising network then attend to textual and texture information to incorporate different granularity conditioning details. The authors also define a semi-automatic framework for extending existing fashion datasets with multimodal data and introduce three novel evaluation metrics. |
Ti-MGD outperforms state-of-the-art competitors in terms of realism and coherence concerning the provided multimodal inputs.
The authors demonstrate the ability of the model to handle multiple conditions in a distinct manner efficiently.
The proposed approach enables fine-grained control over the generated images without adding denoising network parameters. |
The model may not fully capture body shape nuances solely from keypoints, necessitating exploration of dense or 3D pose representations.
Texture conditioning might be limited when a sketch comprises distinct areas, prompting future research on spatial control for texture generation. |
fashion product design, latent diffusion models, textual inversion, generative ai, multimodal learning |
2403.14781
Report |
Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance |
Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, Siyu Zhu |
In this study, we introduce a methodology for human image animation by
leveraging a 3D human parametric model within a latent diffusion framework to
enhance shape alignment and motion guidance in curernt human generative
techniques. The methodology utilizes the SMPL(Skinned Multi-Person Linear)
model as the 3D human parametric model to establish a unified representation of
body shape and pose. This facilitates the accurate capture of intricate human
geometry and motion characteristics from source videos. Specifically, we
incorporate rendered depth images, normal maps, and semantic maps obtained from
SMPL sequences, alongside skeleton-based motion guidance, to enrich the
conditions to the latent diffusion model with comprehensive 3D shape and
detailed pose attributes. A multi-layer motion fusion module, integrating
self-attention mechanisms, is employed to fuse the shape and motion latent
representations in the spatial domain. By representing the 3D human parametric
model as the motion guidance, we can perform parametric shape alignment of the
human body between the reference image and the source video motion.
Experimental evaluations conducted on benchmark datasets demonstrate the
methodology's superior ability to generate high-quality human animations that
accurately capture both pose and shape variations. Furthermore, our approach
also exhibits superior generalization capabilities on the proposed wild
dataset. Project page: https://fudan-generative-vision.github.io/champ. |
This paper proposes Champ, a novel approach for human image animation that leverages the SMPL 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion guidance. |
Current human image animation techniques using reference images and pose guidance often struggle with accurate pose alignment and motion guidance, especially when there are significant variations in body shape and intricate movements. |
Champ utilizes the SMPL model to establish a unified representation of body shape and pose, enabling parametric shape alignment. It renders depth, normal, and semantic maps from SMPL sequences, along with skeleton-based motion guidance, to enrich the conditions for the latent diffusion model. A multi-layer motion fusion module with self-attention mechanisms fuses shape and motion latent representations, guiding the generation of high-quality human animation videos. |
Champ outperforms state-of-the-art methods on benchmark datasets like TikTok, demonstrating superior performance in quantitative metrics and qualitative results.
The method exhibits strong generalization capabilities, effectively animating images from unseen domains with variations in shape, pose, and appearance.
Ablation studies confirm the contribution of each component, highlighting the importance of the SMPL model, multi-layer guidance, and self-attention mechanisms. |
The modeling capacity of the SMPL model for faces and hands is limited, requiring additional constraints for optimal animation in those areas.
Solving SMPL and DWpose independently introduces a potential discrepancy in consistency, which could be addressed in future work. |
human image animation, latent diffusion model, 3d human parametric model, smpl, motion guidance |
2403.14773
Report |
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text |
Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi |
Text-to-video diffusion models enable the generation of high-quality videos
that follow text instructions, making it easy to create diverse and individual
content. However, existing approaches mostly focus on high-quality short video
generation (typically 16 or 24 frames), ending up with hard-cuts when naively
extended to the case of long video synthesis. To overcome these limitations, we
introduce StreamingT2V, an autoregressive approach for long video generation of
80, 240, 600, 1200 or more frames with smooth transitions. The key components
are:(i) a short-term memory block called conditional attention module (CAM),
which conditions the current generation on the features extracted from the
previous chunk via an attentional mechanism, leading to consistent chunk
transitions, (ii) a long-term memory block called appearance preservation
module, which extracts high-level scene and object features from the first
video chunk to prevent the model from forgetting the initial scene, and (iii) a
randomized blending approach that enables to apply a video enhancer
autoregressively for infinitely long videos without inconsistencies between
chunks. Experiments show that StreamingT2V generates high motion amount. In
contrast, all competing image-to-video methods are prone to video stagnation
when applied naively in an autoregressive manner. Thus, we propose with
StreamingT2V a high-quality seamless text-to-long video generator that
outperforms competitors with consistency and motion. Our code will be available
at: https://github.com/Picsart-AI-Research/StreamingT2V |
Introduces StreamingT2V, an autoregressive text-to-video diffusion model for generating long, consistent videos with rich motion dynamics. |
Existing text-to-video models struggle to create long, seamless videos and often suffer from stagnation or inconsistencies when extended temporally. |
Combines a short-term memory module (CAM) for smooth chunk transitions and a long-term memory module (APM) for preserving object/scene features across generations. It also employs a randomized blending approach for enhancing long videos without chunk inconsistencies. |
Generates consistent long videos with significantly higher motion amount compared to baselines.
Successfully preserves object and scene details across long video generations, unlike many existing methods.
Demonstrates superior performance in user studies regarding motion quality, text alignment, and overall video quality. |
The model relies on a pre-trained text-to-video model and its performance is limited by the base model's capabilities.
Training data for long, high-quality videos is limited, potentially impacting the model's ability to generate diverse and complex scenes over extended periods. Future work will explore training with large-scale datasets. |
text-to-video generation, diffusion models, long video synthesis, temporal consistency, appearance preservation |
2403.14760
Report |
Can 3D Vision-Language Models Truly Understand Natural Language? |
Weipeng Deng, Runyu Ding, Jihan Yang, Jiahui Liu, Yijiang Li, Xiaojuan Qi, Edith Ngai |
Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new
avenues for human interaction with embodied agents or robots using natural
language. Despite this progress, we find a notable limitation: existing 3D-VL
models exhibit sensitivity to the styles of language input, struggling to
understand sentences with the same semantic meaning but written in different
variants. This observation raises a critical question: Can 3D vision-language
models truly understand natural language? To test the language
understandability of 3D-VL models, we first propose a language robustness task
for systematically assessing 3D-VL models across various tasks, benchmarking
their performance when presented with different language style variants.
Importantly, these variants are commonly encountered in applications requiring
direct interaction with humans, such as embodied robotics, given the diversity
and unpredictability of human language. We propose a 3D Language Robustness
Dataset, designed based on the characteristics of human language, to facilitate
the systematic study of robustness. Our comprehensive evaluation uncovers a
significant drop in the performance of all existing models across various 3D-VL
tasks. Even the state-of-the-art 3D-LLM fails to understand some variants of
the same sentences. Further in-depth analysis suggests that the existing models
have a fragile and biased fusion module, which stems from the low diversity of
the existing dataset. Finally, we propose a training-free module driven by LLM,
which improves language robustness. Datasets and code will be available at
github. |
This paper introduces a new benchmark and dataset, called 3D Language Robustness (3D-LR), to evaluate how well 3D vision-language models understand natural language variations commonly used in human communication. |
Existing 3D-VL models often struggle to understand sentences with the same meaning but expressed in different styles, hindering their use in real-world applications like robotics. |
The authors define five key language characteristics (syntax, voice, modifier, accent, and tone) and use a large language model (LLM) to create paraphrased versions of sentences from existing 3D-VL datasets. They then evaluate various 3D-VL models on these paraphrased datasets. |
Existing 3D-VL models, even those using LLMs, show significant performance drops (up to 32%) when presented with sentences rephrased using common language variations.
The fusion module, responsible for combining visual and language features, is identified as a major point of failure due to its bias towards training data style.
A simple LLM-based pre-alignment module is proposed, which improves robustness without retraining and achieves performance comparable to models trained on double the data size. |
The 3D-LR dataset may not fully capture the entire spectrum of natural language variations.
Future work should focus on more efficient data augmentation techniques and model architectures that generalize better to unseen language styles. |
3d vision language, language robustness, open world understanding, natural language processing, embodied ai |
2403.14623
Report |
Simplified Diffusion Schrödinger Bridge |
Zhicong Tang, Tiankai Hang, Shuyang Gu, Dong Chen, Baining Guo |
This paper introduces a novel theoretical simplification of the Diffusion
Schr\"odinger Bridge (DSB) that facilitates its unification with Score-based
Generative Models (SGMs), addressing the limitations of DSB in complex data
generation and enabling faster convergence and enhanced performance. By
employing SGMs as an initial solution for DSB, our approach capitalizes on the
strengths of both frameworks, ensuring a more efficient training process and
improving the performance of SGM. We also propose a reparameterization
technique that, despite theoretical approximations, practically improves the
network's fitting capabilities. Our extensive experimental evaluations confirm
the effectiveness of the simplified DSB, demonstrating its significant
improvements. We believe the contributions of this work pave the way for
advanced generative modeling. The code is available at
https://github.com/checkcrab/SDSB. |
This paper presents Simplified Diffusion Schrödinger Bridge (S-DSB), a novel theoretical simplification of Diffusion Schrödinger Bridge (DSB) that enables its unification with Score-based Generative Models (SGMs) and improves its training efficiency and performance. |
DSB holds theoretical advantages over SGMs for handling complex data and arbitrary distributions, but its slow convergence and training difficulties hinder practical application. This work bridges this gap and unlocks the potential of DSB in advanced generative modeling. |
The authors propose a simplified optimization objective for DSB, demonstrating its equivalence to the original formulation while requiring fewer computations. This allows using pretrained SGMs as initialization for DSB, leading to faster convergence. Further, a reparameterization technique inspired by SGMs significantly enhances the network's fitting capabilities. |
S-DSB, even with random initialization, matches vanilla DSB in performance but with faster training.
Using pretrained SGMs as initialization significantly accelerates S-DSB convergence and improves generation quality.
The proposed reparameterization method further boosts DSB's performance, surpassing vanilla DSB even with random initialization. |
The convergence of DSB, even with the proposed improvements, remains computationally intensive, limiting scalability to larger datasets.
The reparameterization technique in R-DSB relies on specific assumptions, which might introduce errors in practical scenarios, necessitating further research on improved approximations and error analysis. |
diffusion schrödinger bridge, score-based generative models, generative modeling, reparameterization, optimal transport |
2403.14621
Report |
GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation |
Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, Gordon Wetzstein |
We introduce GRM, a large-scale reconstructor capable of recovering a 3D
asset from sparse-view images in around 0.1s. GRM is a feed-forward
transformer-based model that efficiently incorporates multi-view information to
translate the input pixels into pixel-aligned Gaussians, which are unprojected
to create a set of densely distributed 3D Gaussians representing a scene.
Together, our transformer architecture and the use of 3D Gaussians unlock a
scalable and efficient reconstruction framework. Extensive experimental results
demonstrate the superiority of our method over alternatives regarding both
reconstruction quality and efficiency. We also showcase the potential of GRM in
generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with
existing multi-view diffusion models. Our project website is at:
https://justimyhxu.github.io/projects/grm/. |
This paper introduces GRM, a novel large-scale, feed-forward 3D reconstruction model based on transformers and 3D Gaussian splatting for efficient and high-fidelity 3D object generation. |
Existing 3D generative models either suffer from slow optimization processes or inefficient triplane representations. This new method leverages the efficiency of 3D Gaussians and a novel transformer architecture to improve both speed and quality. |
GRM uses a transformer-based encoder-decoder network to predict pixel-aligned Gaussian attributes from multi-view images. It employs a novel transformer-based upsampler with windowed attention for efficient non-local information aggregation and detail reconstruction. These Gaussians are then splatted to render novel views. |
GRM significantly outperforms previous state-of-the-art methods in sparse-view 3D reconstruction tasks, achieving higher fidelity with fewer input views.
Combined with multi-view diffusion models, GRM achieves state-of-the-art quality and speed for text-to-3D and single image-to-3D object generation.
Ablation studies demonstrate the effectiveness of each component, including the transformer-based upsampler, pixel-aligned Gaussian representation, and scale activation function. |
The current model relies heavily on input view consistency and struggles with hallucination in unseen regions.
The framework is limited to object-centric scenes due to the lack of large-scale 3D scene datasets. |
3d reconstruction, 3d generation, gaussian splatting, transformers, sparse-view reconstruction |
2403.14617
Report |
Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion |
Xiang Fan, Anand Bhattad, Ranjay Krishna |
We introduce Videoshop, a training-free video editing algorithm for localized
semantic edits. Videoshop allows users to use any editing software, including
Photoshop and generative inpainting, to modify the first frame; it
automatically propagates those changes, with semantic, spatial, and temporally
consistent motion, to the remaining frames. Unlike existing methods that enable
edits only through imprecise textual instructions, Videoshop allows users to
add or remove objects, semantically change objects, insert stock photos into
videos, etc. with fine-grained control over locations and appearance. We
achieve this through image-based video editing by inverting latents with noise
extrapolation, from which we generate videos conditioned on the edited image.
Videoshop produces higher quality edits against 6 baselines on 2 editing
benchmarks using 10 evaluation metrics. |
\methodname is a training-free video editing method that allows users to make localized semantic edits by modifying the first frame using any image editing tool and propagating these changes to the rest of the video. |
Existing video editing methods lack the precision for localized edits, often relying on coarse textual instructions or requiring extensive fine-tuning. |
The method leverages the near-linear trajectory of video latents during denoising diffusion and introduces: (1) inversion with noise extrapolation for accurate latent reconstruction, and (2) latent normalization and rescaling for consistency and quality. |
\methodname enables diverse localized edits like object addition/removal, color changes, semantic edits, and appearance adjustments, while preserving source video fidelity.
Outperforms 6 baselines on 2 editing benchmarks, using 10 evaluation metrics, demonstrating superior edit fidelity, source faithfulness, and temporal consistency.
User study confirms \methodname's advantage in editing and video generation quality over text-based methods, with a 2.23x speedup compared to the average baseline. |
Limitations: Information loss during VAE encoding and potential temporal inconsistency in videos with large movements.
Future work: Combining image editing with motion controls for more seamless results and extending the method for 3D mesh editing. |
video editing, diffusion models, semantic editing, noise extrapolation, latent space manipulation |
2403.14614
Report |
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation |
Yuning Cui, Syed Waqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, Fahad Shahbaz Khan |
In the image acquisition process, various forms of degradation, including
noise, haze, and rain, are frequently introduced. These degradations typically
arise from the inherent limitations of cameras or unfavorable ambient
conditions. To recover clean images from degraded versions, numerous
specialized restoration methods have been developed, each targeting a specific
type of degradation. Recently, all-in-one algorithms have garnered significant
attention by addressing different types of degradations within a single model
without requiring prior information of the input degradation type. However,
these methods purely operate in the spatial domain and do not delve into the
distinct frequency variations inherent to different degradation types. To
address this gap, we propose an adaptive all-in-one image restoration network
based on frequency mining and modulation. Our approach is motivated by the
observation that different degradation types impact the image content on
different frequency subbands, thereby requiring different treatments for each
restoration task. Specifically, we first mine low- and high-frequency
information from the input features, guided by the adaptively decoupled spectra
of the degraded image. The extracted features are then modulated by a
bidirectional operator to facilitate interactions between different frequency
components. Finally, the modulated features are merged into the original input
for a progressively guided restoration. With this approach, the model achieves
adaptive reconstruction by accentuating the informative frequency subbands
according to different input degradations. Extensive experiments demonstrate
that the proposed method achieves state-of-the-art performance on different
image restoration tasks, including denoising, dehazing, deraining, motion
deblurring, and low-light image enhancement. Our code is available at
https://github.com/c-yn/AdaIR. |
An adaptive all-in-one image restoration framework, called \xnet, is proposed, which leverages both spatial and frequency domain information to effectively decouple degradations from the desired clean image content. |
Existing deep learning-based image restoration methods lack generalizability beyond specific degradation types or require training separate models for each task, which is computationally expensive and impractical for deployment on resource-constrained devices. |
\xnet is based on a Transformer-based encoder-decoder architecture with Adaptive Frequency Learning Blocks (AFLBs). Each AFLB uses a Frequency Mining Module (FMiM) to extract low- and high-frequency feature maps guided by the adaptively decoupled spectra of the degraded image, and a Frequency Modulation Module (FMoM) to calibrate these features by enabling information exchange across different frequency bands. |
\xnet achieves state-of-the-art performance on several all-in-one image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement.
Under a three-degradation all-in-one setting (dehazing, deraining, denoising), \xnet outperforms the recent best method PromptIR by 0.63 dB PSNR.
Under a five-degradation all-in-one setting, \xnet achieves a 1.86 dB gain compared to the recent best method IDR, when averaged across five restoration tasks. |
The paper only evaluates the method on a limited set of degradation types.
Further investigation is needed to explore the effectiveness of the method on real-world images with complex and mixed degradations. |
image restoration, all-in-one model, frequency analysis, deep learning, transformer |
2403.14613
Report |
DreamReward: Text-to-3D Generation with Human Preference |
Junliang Ye, Fangfu Liu, Qixiu Li, Zhengyi Wang, Yikai Wang, Xinzhou Wang, Yueqi Duan, Jun Zhu |
3D content creation from text prompts has shown remarkable success recently.
However, current text-to-3D methods often generate 3D results that do not align
well with human preferences. In this paper, we present a comprehensive
framework, coined DreamReward, to learn and improve text-to-3D models from
human preference feedback. To begin with, we collect 25k expert comparisons
based on a systematic annotation pipeline including rating and ranking. Then,
we build Reward3D -- the first general-purpose text-to-3D human preference
reward model to effectively encode human preferences. Building upon the 3D
reward model, we finally perform theoretical analysis and present the Reward3D
Feedback Learning (DreamFL), a direct tuning algorithm to optimize the
multi-view diffusion models with a redefined scorer. Grounded by theoretical
proof and extensive experiment comparisons, our DreamReward successfully
generates high-fidelity and 3D consistent results with significant boosts in
prompt alignment with human intention. Our results demonstrate the great
potential for learning from human feedback to improve text-to-3D models. |
Presents DreamReward, a novel text-to-3D generation framework that leverages human preference feedback (RLHF) to improve the alignment of generated 3D assets with human intentions. |
Existing text-to-3D methods struggle to generate content that aligns well with human preferences, often resulting in outputs that lack in quality, text-3D alignment, and multi-view consistency. |
1. Collects and annotates a diverse 3D dataset with human preference feedback, focusing on text-3D alignment, overall quality, and multi-view consistency.
2. Trains Reward3D, a 3D-aware scoring model, to effectively evaluate the quality of generated 3D content.
3. Introduces DreamFL (Reward3D Feedback Learning), an optimization algorithm that incorporates the Reward3D model to guide the training of multi-view diffusion models towards generating high-quality and human-preferred 3D assets. |
DreamReward successfully generates 3D assets exhibiting superior text alignment, overall quality, and multi-view consistency compared to existing state-of-the-art methods.
Reward3D demonstrates strong alignment with human preferences, making it a promising automatic evaluation metric for text-to-3D generation.
DreamFL effectively utilizes the guidance of Reward3D to optimize 3D models, leading to significant improvements in generation quality and human preference alignment. |
The diversity of generated 3D assets is limited by the size of the annotated dataset.
Future work includes expanding the annotated dataset and incorporating more camera and orientation information into the Reward3D architecture. |
3d generation, rlhf, human preference, text-to-3d, reward model |
2403.14610
Report |
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy |
Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang |
We present T-Rex2, a highly practical model for open-set object detection.
Previous open-set object detection methods relying on text prompts effectively
encapsulate the abstract concept of common objects, but struggle with rare or
complex object representation due to data scarcity and descriptive limitations.
Conversely, visual prompts excel in depicting novel objects through concrete
visual examples, but fall short in conveying the abstract concept of objects as
effectively as text prompts. Recognizing the complementary strengths and
weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes
both prompts within a single model through contrastive learning. T-Rex2 accepts
inputs in diverse formats, including text prompts, visual prompts, and the
combination of both, so that it can handle different scenarios by switching
between the two prompt modalities. Comprehensive experiments demonstrate that
T-Rex2 exhibits remarkable zero-shot object detection capabilities across a
wide spectrum of scenarios. We show that text prompts and visual prompts can
benefit from each other within the synergy, which is essential to cover massive
and complicated real-world scenarios and pave the way towards generic object
detection. Model API is now available at
\url{https://github.com/IDEA-Research/T-Rex}. |
The paper introduces T-Rex2, a novel open-set object detection model that unifies text and visual prompts within a single framework, enabling both generic and interactive object detection. |
Open-set object detection, crucial for real-world applications, requires identifying objects beyond pre-defined categories. This work addresses limitations of existing methods relying solely on text or visual prompts by combining their strengths. |
T-Rex2 utilizes a DETR-like architecture with parallel encoders for text and visual prompts. It employs contrastive learning to align both modalities, enabling them to benefit from each other's strengths. |
T-Rex2 demonstrates strong zero-shot object detection capabilities, achieving state-of-the-art performance on COCO, LVIS, ODinW, and Roboflow100 benchmarks.
The study reveals a complementary relationship between text and visual prompts, with text prompts excelling in common objects and visual prompts proving more effective for rare or hard-to-describe objects.
The model's interactive capabilities are highlighted through its impressive performance in few-shot object counting tasks, showing its potential in applications like automatic annotation. |
While showing promising results, the integration of text and visual prompts requires further refinement, particularly in scenarios with common objects where performance slightly dips.
The current method requires up to 16 visual examples for reliable detection, necessitating further research to enhance the efficiency of visual prompts with fewer examples. |
open-set object detection, text prompts, visual prompts, contrastive learning, interactive object detection |
2403.14608
Report |
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey |
Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang |
Large models represent a groundbreaking advancement in multiple application
fields, enabling remarkable achievements across various tasks. However, their
unprecedented scale comes with significant computational costs. These models,
often consisting of billions of parameters, require vast amounts of
computational resources for execution. Especially, the expansive scale and
computational demands pose considerable challenges when customizing them for
particular downstream tasks, particularly over the hardware platforms
constrained by computational capabilities. Parameter Efficient Fine-Tuning
(PEFT) provides a practical solution by efficiently adapt the large models over
the various downstream tasks. In particular, PEFT refers to the process of
adjusting the parameters of a pre-trained large models to adapt it to a
specific task while minimizing the number of additional parameters introduced
or computational resources required. This approach is particularly important
when dealing with large language models with high parameter counts, as
fine-tuning these models from scratch can be computationally expensive and
resource-intensive, posing considerable challenges in the supporting system
platform design. In this survey, we present comprehensive studies of various
PEFT algorithms, examining their performance and computational overhead.
Moreover, we provide an overview of applications developed using different PEFT
algorithms and discuss common techniques employed to mitigate computation costs
for PEFT. In addition to the algorithmic perspective, we overview various
real-world system designs to investigate the implementation costs associated
with different PEFT algorithms. This survey serves as an indispensable resource
for researchers aiming to understand both the PEFT algorithm and its system
implementation, offering detailed insights into recent advancements and
practical applications. |
This paper presents a comprehensive survey of Parameter-Efficient Fine-Tuning (PEFT) methods for large models, encompassing algorithmic designs, computational efficiency considerations, applications, and system implementation challenges. |
The large size of modern models makes full fine-tuning computationally expensive and resource-intensive. PEFT offers a practical solution by adapting pre-trained models to specific tasks with minimal parameter adjustments, thereby reducing storage, memory, and computation costs. |
The paper categorizes PEFT algorithms into four types: additive, selective, reparameterized, and hybrid. It discusses their mechanisms, advantages, limitations, and notable variations. Additionally, it explores strategies for enhancing PEFT efficiency, such as pruning, quantization, and memory optimization techniques. |
The effectiveness of various PEFT methods can differ significantly across different tasks.
Pruning and quantization can substantially enhance the efficiency of PEFT methods.
Memory-efficient PEFT methods are crucial for reducing the memory overhead during training. |
Lack of a unified benchmark for fair comparison of PEFT approaches.
Need for improved training efficiency and simplified hyperparameter tuning in PEFT methods. |
large language models, parameter-efficient fine-tuning, pruning, quantization, memory optimization |
2403.14554
Report |
Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering |
Antoine Guédon, Vincent Lepetit |
We propose Gaussian Frosting, a novel mesh-based representation for
high-quality rendering and editing of complex 3D effects in real-time. Our
approach builds on the recent 3D Gaussian Splatting framework, which optimizes
a set of 3D Gaussians to approximate a radiance field from images. We propose
first extracting a base mesh from Gaussians during optimization, then building
and refining an adaptive layer of Gaussians with a variable thickness around
the mesh to better capture the fine details and volumetric effects near the
surface, such as hair or grass. We call this layer Gaussian Frosting, as it
resembles a coating of frosting on a cake. The fuzzier the material, the
thicker the frosting. We also introduce a parameterization of the Gaussians to
enforce them to stay inside the frosting layer and automatically adjust their
parameters when deforming, rescaling, editing or animating the mesh. Our
representation allows for efficient rendering using Gaussian splatting, as well
as editing and animation by modifying the base mesh. We demonstrate the
effectiveness of our method on various synthetic and real scenes, and show that
it outperforms existing surface-based approaches. We will release our code and
a web-based viewer as additional contributions. Our project page is the
following: https://anttwo.github.io/frosting/ |
Gaussian Frosting, a novel mesh-based representation for high-quality rendering and editing of complex 3D effects in real-time, building upon the 3D Gaussian Splatting framework. |
Enables both efficient rendering using Gaussian splatting and easy editing and animation by modifying the base mesh, surpassing previous methods in quality and/or efficiency. |
Extracts a base mesh from optimized Gaussians, builds an adaptive layer of Gaussians (Frosting) with variable thickness around the mesh based on Gaussian density, and parameterizes Gaussians to stay within the layer during mesh deformation. |
Outperforms existing surface-based and even some non-editable volumetric methods in rendering quality on challenging datasets like Shelly and Mip-NeRF 360.
Allows for efficient real-time rendering and editing due to its hybrid representation.
Enables seamless animation by automatically adjusting Gaussian parameters based on mesh deformation. |
Current implementation uses a simple, piecewise linear deformation model.
Models are larger than vanilla Gaussian Splatting due to the inclusion of barycentric coordinates and mesh vertices. |
gaussian splatting, mesh, differentiable rendering, 3d reconstruction, image-based rendering |
2403.14530
Report |
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression |
Yihang Chen, Qianyi Wu, Jianfei Cai, Mehrtash Harandi, Weiyao Lin |
3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel
view synthesis, boasting rapid rendering speed with high fidelity. However, the
substantial Gaussians and their associated attributes necessitate effective
compression techniques. Nevertheless, the sparse and unorganized nature of the
point cloud of Gaussians (or anchors in our paper) presents challenges for
compression. To address this, we make use of the relations between the
unorganized anchors and the structured hash grid, leveraging their mutual
information for context modeling, and propose a Hash-grid Assisted Context
(HAC) framework for highly compact 3DGS representation. Our approach introduces
a binary hash grid to establish continuous spatial consistencies, allowing us
to unveil the inherent spatial relations of anchors through a carefully
designed context model. To facilitate entropy coding, we utilize Gaussian
distributions to accurately estimate the probability of each quantized
attribute, where an adaptive quantization module is proposed to enable
high-precision quantization of these attributes for improved fidelity
restoration. Additionally, we incorporate an adaptive masking strategy to
eliminate invalid Gaussians and anchors. Importantly, our work is the pioneer
to explore context-based compression for 3DGS representation, resulting in a
remarkable size reduction of over $75\times$ compared to vanilla 3DGS, while
simultaneously improving fidelity, and achieving over $11\times$ size reduction
over SOTA 3DGS compression approach Scaffold-GS. Our code is available here:
https://github.com/YihangChen-ee/HAC |
This paper proposes HAC, a Hash-grid Assisted Context framework for highly compact 3D Gaussian Splatting (3DGS) representation by exploiting spatial consistencies among unorganized 3D Gaussians. |
3DGS offers fast and high-fidelity novel view synthesis but requires substantial storage space for storing Gaussian attributes, necessitating effective compression techniques. |
HAC leverages a structured hash grid to model the context of anchor attributes in Scaffold-GS, predicting their value distributions for efficient entropy coding. It also incorporates an adaptive quantization module and a masking strategy for enhanced compression. |
HAC achieves a remarkable size reduction of over 75x compared to vanilla 3DGS while improving fidelity.
It outperforms SOTA 3DGS compression approaches like Scaffold-GS by achieving over 11x size reduction.
The proposed context modeling and adaptive components are shown to effectively improve rate-distortion performance. |
Integrating additional models in HAC increases training time compared to Scaffold-GS.
Future work could explore faster entropy coding algorithms on CPU or GPU for reduced encoding/decoding time. |
3d gaussian splatting, compression, context models, novel view synthesis, rate-distortion optimization |
2403.14520
Report |
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference |
Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang |
In recent years, the application of multimodal large language models (MLLM)
in various fields has achieved remarkable success. However, as the foundation
model for many downstream tasks, current MLLMs are composed of the well-known
Transformer network, which has a less efficient quadratic computation
complexity. To improve the efficiency of such basic models, we propose Cobra, a
linear computational complexity MLLM. Specifically, Cobra integrates the
efficient Mamba language model into the visual modality. Moreover, we explore
and study various modal fusion schemes to create an effective multi-modal
Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely
competitive performance with current computationally efficient state-of-the-art
methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due
to Cobra's linear sequential modeling. (2) Interestingly, the results of
closed-set challenging prediction benchmarks show that Cobra performs well in
overcoming visual illusions and spatial relationship judgments. (3) Notably,
Cobra even achieves comparable performance to LLaVA with about 43% of the
number of parameters. We will make all codes of Cobra open-source and hope that
the proposed method can facilitate future research on complexity problems in
MLLM. Our project page is available at: https://sites.google.com/view/cobravlm. |
Cobra, a multimodal large language model (MLLM) with linear computational complexity, addressing the inefficiency of quadratic complexity in Transformer-based MLLMs. |
Existing MLLMs suffer from quadratic computational complexity due to the Transformer architecture, hindering their efficiency and practicality. |
Cobra integrates the efficient Mamba language model (linear complexity) with visual modality using DINOv2 and SigLIP as encoders and explores various modal fusion schemes. |
Cobra achieves competitive performance with state-of-the-art efficient MLLMs (LLaVA-Phi, TinyLLaVA, MobileVLM v2) with faster inference speed.
Cobra excels in overcoming visual illusions and spatial relationship judgments in closed-set prediction benchmarks.
Cobra exhibits comparable performance to the larger LLaVA model with only 43% of its parameters, highlighting its efficiency. |
Cobra shows weaker performance in text recognition tasks compared to some baselines.
Cobra's recurrent dynamics require relatively high numerical precision, limiting memory reduction through quantization. |
multimodal large language model, mamba, state space model, computation efficiency, vision language model |
2403.14487
Report |
DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing |
Yueru Jia, Yuhui Yuan, Aosong Cheng, Chuke Wang, Ji Li, Huizhu Jia, Shanghang Zhang |
Recently, how to achieve precise image editing has attracted increasing
attention, especially given the remarkable success of text-to-image generation
models. To unify various spatial-aware image editing abilities into one
framework, we adopt the concept of layers from the design domain to manipulate
objects flexibly with various operations. The key insight is to transform the
spatial-aware image editing task into a combination of two sub-tasks:
multi-layered latent decomposition and multi-layered latent fusion. First, we
segment the latent representations of the source images into multiple layers,
which include several object layers and one incomplete background layer that
necessitates reliable inpainting. To avoid extra tuning, we further explore the
inner inpainting ability within the self-attention mechanism. We introduce a
key-masking self-attention scheme that can propagate the surrounding context
information into the masked region while mitigating its impact on the regions
outside the mask. Second, we propose an instruction-guided latent fusion that
pastes the multi-layered latent representations onto a canvas latent. We also
introduce an artifact suppression scheme in the latent space to enhance the
inpainting quality. Due to the inherent modular advantages of such
multi-layered representations, we can achieve accurate image editing, and we
demonstrate that our approach consistently surpasses the latest spatial editing
methods, including Self-Guidance and DiffEditor. Last, we show that our
approach is a unified framework that supports various accurate image editing
tasks on more than six different editing tasks. |
This paper presents a training-free, forward-only, unified framework for accurate spatial-aware image editing tasks by decomposing the source image into multiple latent layers for independent manipulation and then fusing them into the target image. |
Existing text-to-image generation models struggle with precise spatial arrangements and previous image editing methods lack the flexibility for complex multi-object manipulation. This method aims to bridge this gap, offering more control and precision in image editing. |
The approach involves 1) Multi-layered latent decomposition: segmenting source image latent representations into object layers and a background layer, utilizing a key-masking self-attention scheme for accurate object removal and background inpainting. 2) Multi-layered latent fusion: pasting manipulated latent representations onto a canvas latent following user instructions or GPT-4V guidance, and refining the result with a harmonization process and an artifact suppression scheme. |
Outperforms state-of-the-art methods like Self-Guidance and DiffEditor in image quality and editing fidelity based on user studies.
Achieves high-quality object removal comparable to specifically trained inpainting models without requiring finetuning.
Successfully unifies various spatial-aware image editing tasks, including object removal, resizing, movement, flipping, camera panning, zooming out, and cross-image composition, demonstrating strong generalizability. |
The resolution difference between image and latent space can cause detail loss when resizing objects.
Future work can explore further applications of the framework for more complex editing tasks. |
image editing, latent diffusion models, spatial-aware editing, multi-layered representation, gpt-4v |
2403.14468
Report |
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks |
Max Ku, Cong Wei, Weiming Ren, Harry Yang, Wenhu Chen |
Video-to-video editing involves editing a source video along with additional
control (such as text prompts, subjects, or styles) to generate a new video
that aligns with the source video and the provided control. Traditional methods
have been constrained to certain editing types, limiting their ability to meet
the wide range of user demands. In this paper, we introduce AnyV2V, a novel
training-free framework designed to simplify video editing into two primary
steps: (1) employing an off-the-shelf image editing model (e.g.
InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an
existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion
and feature injection. In the first stage, AnyV2V can plug in any existing
image editing tools to support an extensive array of video editing tasks.
Beyond the traditional prompt-based editing methods, AnyV2V also can support
novel video editing tasks, including reference-based style transfer,
subject-driven editing, and identity manipulation, which were unattainable by
previous methods. In the second stage, AnyV2V can plug in any existing
image-to-video models to perform DDIM inversion and intermediate feature
injection to maintain the appearance and motion consistency with the source
video. On the prompt-based editing, we show that AnyV2V can outperform the
previous best approach by 35\% on prompt alignment, and 25\% on human
preference. On the three novel tasks, we show that AnyV2V also achieves a high
success rate. We believe AnyV2V will continue to thrive due to its ability to
seamlessly integrate the fast-evolving image editing methods. Such
compatibility can help AnyV2V to increase its versatility to cater to diverse
user demands. |
\model is a novel training-free, plug-and-play framework that simplifies video editing into two stages: (1) first-frame image editing using off-the-shelf models and (2) image-to-video generation via DDIM inversion and feature injection. |
Existing video editing methods are limited to specific editing types and often require retraining or complex feature extraction. \model addresses these limitations by enabling a wide range of editing tasks within a unified, efficient, and user-friendly framework. |
\model leverages pre-trained image editing and image-to-video generation models. It edits the first frame using an image editing model, then uses an I2V model to propagate the edit through the video while maintaining consistency with the source video's appearance and motion through feature injection. |
\model outperforms the previous best approach in prompt-based editing by 35% on prompt alignment and 25% on human preference.
\model demonstrates compatibility with various image editing models, enabling diverse tasks such as style transfer, subject-driven editing, and identity manipulation.
Ablation studies confirm the importance of DDIM inversion, spatial and temporal feature injection for maintaining consistency and structure in edited videos. |
The performance of \model depends on the accuracy of the initial first-frame edit, which can be limited by the capabilities of existing image editing models.
\model's ability to handle fast or complex motion is constrained by the limitations of current I2V models. |
video editing, diffusion models, plug-and-play, image-to-video generation, ddim inversion |
2403.14376
Report |
InfNeRF: Towards Infinite Scale NeRF Rendering with O(log n) Space Complexity |
Jiabin Liang, Lanqing Zhang, Zhuoran Zhao, Xiangyu Xu |
The conventional mesh-based Level of Detail (LoD) technique, exemplified by
applications such as Google Earth and many game engines, exhibits the
capability to holistically represent a large scene even the Earth, and achieves
rendering with a space complexity of O(log n). This constrained data
requirement not only enhances rendering efficiency but also facilitates dynamic
data fetching, thereby enabling a seamless 3D navigation experience for users.
In this work, we extend this proven LoD technique to Neural Radiance Fields
(NeRF) by introducing an octree structure to represent the scenes in different
scales. This innovative approach provides a mathematically simple and elegant
representation with a rendering space complexity of O(log n), aligned with the
efficiency of mesh-based LoD techniques. We also present a novel training
strategy that maintains a complexity of O(n). This strategy allows for parallel
training with minimal overhead, ensuring the scalability and efficiency of our
proposed method. Our contribution is not only in extending the capabilities of
existing techniques but also in establishing a foundation for scalable and
efficient large-scale scene representation using NeRF and octree structures. |
Presents InfNeRF, a novel Neural Radiance Field (NeRF) framework utilizing an octree structure for efficient large-scale scene representation and rendering. |
Addresses the limitations of existing large-scale NeRF methods in handling bird's-eye views and aliasing artifacts, aiming for scalable and memory-efficient rendering. |
Constructs an LoD octree where each node encapsulates a NeRF representing a specific region at a certain scale, enabling anti-aliasing rendering by querying appropriate nodes based on sampling point location and radius. Employs tree pruning for model sparsity and introduces a distributed training strategy for efficiency. |
Achieves a rendering memory complexity of O(log n), significantly reducing memory footprint compared to baseline methods.
Demonstrates superior rendering quality with over 2.4dB improvement in PSNR due to inherent anti-aliasing properties.
Presents an efficient and scalable distributed training strategy, reducing VRAM consumption and communication overhead. |
Reconstruction time and computational burden still need optimization compared to traditional photogrammetry methods.
Exploring the fusion of octrees from diverse image sources and scales for reconstructing even larger scenes. |
neural radiance fields, large-scale scene reconstruction, level of detail, octree, anti-aliasing |
2403.14291
Report |
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models |
Pablo Marcos-Manchón, Roberto Alcover-Couso, Juan C. SanMiguel, Jose M. Martínez |
Diffusion models represent a new paradigm in text-to-image generation. Beyond
generating high-quality images from text prompts, models such as Stable
Diffusion have been successfully extended to the joint generation of semantic
segmentation pseudo-masks. However, current extensions primarily rely on
extracting attentions linked to prompt words used for image synthesis. This
approach limits the generation of segmentation masks derived from word tokens
not contained in the text prompt. In this work, we introduce Open-Vocabulary
Attention Maps (OVAM)-a training-free method for text-to-image diffusion models
that enables the generation of attention maps for any word. In addition, we
propose a lightweight optimization process based on OVAM for finding tokens
that generate accurate attention maps for an object class with a single
annotation. We evaluate these tokens within existing state-of-the-art Stable
Diffusion extensions. The best-performing model improves its mIoU from 52.1 to
86.6 for the synthetic images' pseudo-masks, demonstrating that our optimized
tokens are an efficient way to improve the performance of existing methods
without architectural changes or retraining. |
Introduces Open-Vocabulary Attention Maps (OVAM), a training-free method for text-to-image diffusion models enabling the generation of attention maps and semantic segmentation masks from open-vocabulary descriptions, independent of the image generation prompt. |
Existing methods for generating semantic segmentation masks from diffusion models are primarily limited by the tokens present in the text prompt used for image synthesis, restricting their flexibility and open-vocabulary capabilities. |
OVAM leverages cross-attention maps from diffusion models, using an independent 'attribution prompt' to generate attention maps for arbitrary words. It also introduces a token optimization process to learn accurate attention maps for specific object classes with just one annotation per class. |
OVAM with token optimization outperforms existing training-free methods and achieves comparable or superior results to methods requiring additional training data.
Token optimization through OVAM significantly improves the performance of existing Stable Diffusion-based segmentation methods without requiring architectural changes or retraining.
Synthetic data generated using OVAM with token optimization effectively trains semantic segmentation models, achieving competitive results on standard benchmarks. |
The current implementation of OVAM relies on a fixed threshold for binarizing attention maps, which could be further improved.
Future work will explore extending OVAM to generate multi-class segmentation masks from a single attention map. |
semantic segmentation, diffusion models, open-vocabulary, stable diffusion, attention maps |
2403.14270
Report |
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection |
Tim Salzmann, Markus Ryll, Alex Bewley, Matthias Minderer |
Visual relationship detection aims to identify objects and their
relationships in images. Prior methods approach this task by adding separate
relationship modules or decoders to existing object detection architectures.
This separation increases complexity and hinders end-to-end training, which
limits performance. We propose a simple and highly efficient decoder-free
architecture for open-vocabulary visual relationship detection. Our model
consists of a Transformer-based image encoder that represents objects as tokens
and models their relationships implicitly. To extract relationship information,
we introduce an attention mechanism that selects object pairs likely to form a
relationship. We provide a single-stage recipe to train this model on a mixture
of object and relationship detection data. Our approach achieves
state-of-the-art relationship detection performance on Visual Genome and on the
large-vocabulary GQA benchmark at real-time inference speeds. We provide
analyses of zero-shot performance, ablations, and real-world qualitative
examples. |
This paper introduces Scene-Graph ViT, an efficient, end-to-end, open-vocabulary model for visual relationship detection using a Transformer-based encoder-only architecture and a novel Relationship Attention mechanism. |
VRD facilitates structured scene understanding, crucial for robotics, image retrieval, and grounding language models. Existing methods are complex and hinder end-to-end training. |
The model leverages a pretrained vision-language model, adds a Relationship Attention layer to extract object pairs likely to form a relationship, and is trained jointly on object and relationship datasets. |
Achieves state-of-the-art relationship detection performance on Visual Genome (29.5% mR@100) and GQA benchmarks.
Demonstrates strong performance in open-vocabulary and zero-shot settings, benefiting from large-scale pretraining and multi-dataset training.
Maintains real-time inference speeds comparable to pure object detectors due to efficient top-k selection in the Relationship Attention layer. |
Performance on specialized human-object interaction detection is on par with prior models, potentially limited by task-specific training data.
Zero-shot generalization to unseen objects and predicates, a challenge for open-vocabulary VRD models, shows room for improvement. |
visual relationship detection, scene graph generation, vision transformer, open vocabulary, encoder-only architecture |
2403.14244
Report |
Isotropic Gaussian Splatting for Real-Time Radiance Field Rendering |
Yuanhao Gong, Lantao Yu, Guanghui Yue |
The 3D Gaussian splatting method has drawn a lot of attention, thanks to its
high performance in training and high quality of the rendered image. However,
it uses anisotropic Gaussian kernels to represent the scene. Although such
anisotropic kernels have advantages in representing the geometry, they lead to
difficulties in terms of computation, such as splitting or merging two kernels.
In this paper, we propose to use isotropic Gaussian kernels to avoid such
difficulties in the computation, leading to a higher performance method. The
experiments confirm that the proposed method is about {\bf 100X} faster without
losing the geometry representation accuracy. The proposed method can be applied
in a large range applications where the radiance field is needed, such as 3D
reconstruction, view synthesis, and dynamic object modeling. |
This paper proposes using scale-adaptive isotropic Gaussian kernels for signal representation, leading to a faster 3D Gaussian splatting method. |
While anisotropic Gaussian kernels are better at representing geometry in 3D Gaussian splatting, they lead to computational difficulties. Isotropic kernels offer a more efficient alternative. |
The method uses a two-stage approach: 1) initialization with a QuadTree/Octree structure to organize particles carrying color and opacity, and 2) optimization of a loss function that combines reconstruction error and SSIM. |
Isotropic Gaussian kernels can achieve high rendering quality with fewer artifacts.
The proposed method is significantly faster (around 100 times) in the training process compared to using anisotropic kernels.
The use of a tree structure for initialization enables efficient particle management. |
The paper focuses on 2D image experiments, further validation is needed for 3D scenarios.
Exploring different optimization strategies beyond backpropagation and evolutionary algorithms could be beneficial. |
3d gaussian splatting, isotropic gaussian kernels, radiance field, rendering, particle representation |
2403.14186
Report |
StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN |
Jongwoo Choi, Kwanggyoon Seo, Amirsaman Ashtari, Junyong Noh |
We propose a method that can generate cinemagraphs automatically from a still
landscape image using a pre-trained StyleGAN. Inspired by the success of recent
unconditional video generation, we leverage a powerful pre-trained image
generator to synthesize high-quality cinemagraphs. Unlike previous approaches
that mainly utilize the latent space of a pre-trained StyleGAN, our approach
utilizes its deep feature space for both GAN inversion and cinemagraph
generation. Specifically, we propose multi-scale deep feature warping (MSDFW),
which warps the intermediate features of a pre-trained StyleGAN at different
resolutions. By using MSDFW, the generated cinemagraphs are of high resolution
and exhibit plausible looping animation. We demonstrate the superiority of our
method through user studies and quantitative comparisons with state-of-the-art
cinemagraph generation methods and a video generation method that uses a
pre-trained StyleGAN. |
This paper introduces StyleCineGAN, a novel method for generating high-resolution (1024x1024) cinemagraphs from single landscape images using a pre-trained StyleGAN. |
Creating cinemagraphs is typically a manual, time-consuming process. Existing automatic methods are either limited in resolution, require reference videos, or necessitate extensive training of deep generative models. StyleCineGAN addresses these limitations. |
The method leverages the deep feature space of a pre-trained StyleGAN for GAN inversion and cinemagraph generation. It employs a multi-scale deep feature warping (MSDFW) technique, applying motion generated from the input image to the StyleGAN's intermediate features at different resolutions. This allows for plausible looping animations while preserving image quality and content. |
StyleCineGAN outperforms state-of-the-art cinemagraph generation methods in both qualitative and quantitative comparisons, demonstrating superior static consistency and motion quality.
It also surpasses unconditional video generation methods using pre-trained StyleGANs in terms of content preservation, making it particularly suitable for cinemagraph creation.
User studies confirm the effectiveness of StyleCineGAN, with participants rating its generated cinemagraphs significantly higher in overall quality. |
The automatic motion prediction can be ambiguous for certain images, requiring additional user guidance for accurate motion generation.
Isolating the motion of thin structures within animated regions remains challenging due to the multi-scale nature of feature warping. |
cinemagraph generation, stylegan, deep feature warping, unconditional video generation, content preservation |
2403.14166
Report |
Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians |
Guangchi Fang, Bing Wang |
In this study, we explore the challenge of efficiently representing scenes
with a constrained number of Gaussians. Our analysis shifts from traditional
graphics and 2D computer vision to the perspective of point clouds,
highlighting the inefficient spatial distribution of Gaussian representation as
a key limitation in model performance. To address this, we introduce strategies
for densification including blur split and depth reinitialization, and
simplification through intersection preserving and sampling. These techniques
reorganize the spatial positions of the Gaussians, resulting in significant
improvements across various datasets and benchmarks in terms of rendering
quality, resource consumption, and storage compression. Our Mini-Splatting
integrates seamlessly with the original rasterization pipeline, providing a
strong baseline for future research in Gaussian-Splatting-based works.
\href{https://github.com/fatPeter/mini-splatting}{Code is available}. |
This paper presents Mini-Splatting, a novel method to efficiently represent scenes with a constrained number of Gaussians for 3D Gaussian Splatting (3DGS) |
3DGS shows great potential in novel view synthesis, however, the large number of Gaussians used can lead to inefficiencies and limit rendering quality and speed. |
The authors analyze the spatial distribution of Gaussians and propose densification (blur split and depth reinitialization) and simplification (intersection preserving and sampling) strategies to reorganize Gaussians for a more efficient representation. |
Mini-Splatting-D achieves better rendering quality than the baseline 3DGS and even surpasses state-of-the-art neural rendering algorithm Zip-NeRF on some metrics.
Mini-Splatting maintains comparable rendering quality to 3DGS while using significantly fewer Gaussians (7x fewer).
Mini-Splatting demonstrates significant speed-up in both training and rendering with reduced memory usage. |
The depth-based reinitialization strategy may fail in areas without a certain depth value like the sky.
Finding the minimal number of Gaussians while maintaining high quality rendering remains a challenge and could benefit from further investigation of uncertainty. |
gaussian splatting, point clouds, scene representation, densification, simplification |
2403.14155
Report |
Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization |
Yeji Song, Jimyeong Kim, Wonhark Park, Wonsik Shin, Wonjong Rhee, Nojun Kwak |
In a surge of text-to-image (T2I) models and their customization methods that
generate new images of a user-provided subject, current works focus on
alleviating the costs incurred by a lengthy per-subject optimization. These
zero-shot customization methods encode the image of a specified subject into a
visual embedding which is then utilized alongside the textual embedding for
diffusion guidance. The visual embedding incorporates intrinsic information
about the subject, while the textual embedding provides a new, transient
context. However, the existing methods often 1) are significantly affected by
the input images, eg., generating images with the same pose, and 2) exhibit
deterioration in the subject's identity. We first pin down the problem and show
that redundant pose information in the visual embedding interferes with the
textual embedding containing the desired pose information. To address this
issue, we propose orthogonal visual embedding which effectively harmonizes with
the given textual embedding. We also adopt the visual-only embedding and inject
the subject's clear features utilizing a self-attention swap. Our results
demonstrate the effectiveness and robustness of our method, which offers highly
flexible zero-shot generation while effectively maintaining the subject's
identity. |
This paper introduces a novel method to address the challenges of pose variation and identity preservation in zero-shot text-to-image customization, aiming for more diverse and flexible subject-driven generation. |
Existing zero-shot customization methods, while effective in separating subject identity from background, struggle with disentangling pose from identity in visual embeddings, leading to pose bias and identity loss when modifying subject poses. |
The proposed method employs two key techniques: (1) **Contextual Embedding Orchestration**: orthogonalizes the visual embedding to the textual embedding subspace, reducing interference and enabling pose variation guided by text prompts. (2) **Self-attention Swap**: integrates clean identity information from a visual-only guided denoising process into the main generation process, preserving subject identity amidst pose modifications. |
The proposed method significantly improves text alignment and pose variation compared to baseline models, as demonstrated qualitatively and quantitatively on a newly introduced 'Deformable Subject Set' and the DreamBooth dataset.
It effectively addresses both pose bias and identity loss, generating images that faithfully follow text prompts regarding pose while maintaining subject identity.
User study confirms the effectiveness, with the proposed method preferred for both text and image alignment compared to baselines. |
The method might struggle with handling multiple, potentially conflicting text prompts simultaneously due to the orthogonalization process.
Future work could explore extending the method to address complex compositions involving multiple subjects and intricate interactions. |
text-to-image synthesis, zero-shot learning, subject-driven generation, pose variation, identity preservation |
2403.14148
Report |
Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition |
Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, Anima Anandkumar |
Video diffusion models have recently made great progress in generation
quality, but are still limited by the high memory and computational
requirements. This is because current video diffusion models often attempt to
process high-dimensional videos directly. To tackle this issue, we propose
content-motion latent diffusion model (CMD), a novel efficient extension of
pretrained image diffusion models for video generation. Specifically, we
propose an autoencoder that succinctly encodes a video as a combination of a
content frame (like an image) and a low-dimensional motion latent
representation. The former represents the common content, and the latter
represents the underlying motion in the video, respectively. We generate the
content frame by fine-tuning a pretrained image diffusion model, and we
generate the motion latent representation by training a new lightweight
diffusion model. A key innovation here is the design of a compact latent space
that can directly utilizes a pretrained image diffusion model, which has not
been done in previous latent video diffusion models. This leads to considerably
better quality generation and reduced computational costs. For instance, CMD
can sample a video 7.7$\times$ faster than prior approaches by generating a
video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD
achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous
state-of-the-art of 292.4. |
This paper introduces CMD (Content-Motion Latent Diffusion Model), an efficient method for video generation that leverages pre-trained image diffusion models. |
Existing video diffusion models struggle with high computational costs and memory requirements due to processing high-dimensional videos directly. CMD addresses these limitations. |
CMD uses an autoencoder to compress videos into a content frame (similar to an image) and a low-dimensional motion latent representation. A pre-trained image diffusion model generates the content frame, and a lightweight diffusion model generates the motion latent representation. |
CMD achieves an FVD score of 238.3 on WebVid-10M, 18.5% better than previous state-of-the-art.
It generates a 512x1024 resolution video of 16 frames in 3.1 seconds, 7.7x faster than prior approaches.
CMD demonstrates significant efficiency in terms of FLOPs and memory consumption during both training and sampling compared to other methods. |
The paper mainly focuses on generating videos of fixed length, limiting its applicability to longer videos.
The quality of the autoencoder could be further improved, particularly for videos containing highly dynamic motion. |
video generation, diffusion models, latent space, computational efficiency, text-to-video generation |
2403.14141
Report |
Empowering Segmentation Ability to Multi-modal Large Language Models |
Yuqi Yang, Peng-Tao Jiang, Jing Wang, Hao Zhang, Kai Zhao, Jinwei Chen, Bo Li |
Multi-modal large language models (MLLMs) can understand image-language
prompts and demonstrate impressive reasoning ability. In this paper, we extend
MLLMs' output by empowering MLLMs with the segmentation ability. The extended
MLLMs can both output language responses to the image-language prompts and
segment the regions that the complex question or query in the language prompts
focuses on. To this end, the existing work, LISA, enlarges the original word
embeddings with an additional segment token and fine-tunes dialogue generation
and query-focused segmentation together, where the feature of the segment token
is used to prompt the segment-anything model. Although they achieve superior
segmentation performance, we observe that the dialogue ability decreases by a
large margin compared to the original MLLMs. To maintain the original MLLMs'
dialogue ability, we propose a novel MLLMs framework, coined as LLaVASeg, which
leverages a chain-of-thought prompting strategy to instruct the MLLMs to
segment the target region queried by the user. The MLLMs are first prompted to
reason about the simple description of the target region from the complicated
user query, then extract the visual attributes of the target region according
to the understanding of MLLMs to the image. These visual attributes, such as
color and relative locations, are utilized to prompt the downstream
segmentation model. Experiments show that the proposed method keeps the
original dialogue ability and equips the MLLMs' model with strong reasoning
segmentation ability. The code is available at
https://github.com/YuqiYang213/LLaVASeg. |
This paper proposes LLaVASeg, a novel framework that empowers Multi-modal Large Language Models (MLLMs) with segmentation abilities while preserving their conversational and reasoning skills, unlike previous fine-tuning approaches that often degrade these abilities. |
Extending MLLMs to possess segmentation capabilities similar to human perception can significantly enhance their understanding and interaction with visual information, allowing them to both comprehend complex queries and pinpoint relevant regions in images. |
LLaVASeg employs a chain-of-thought prompting strategy that guides MLLMs to generate image-specific textual attributes (e.g., color, shape, relative location) for the target region. These attributes are then used to prompt a multi-scale promptable segmentation model that segments the target. |
LLaVASeg achieves state-of-the-art segmentation performance on the ReasonSeg dataset, surpassing previous methods like LISA.
The proposed chain-of-thought prompting strategy proves highly effective in extracting relevant visual attributes for segmentation.
Unlike fine-tuning approaches, LLaVASeg maintains the original dialogue and reasoning capabilities of the MLLMs, as demonstrated by its superior performance on CIDEr and ROUGE-L metrics. |
The current framework only supports a single query per interaction, limiting its applicability to more complex scenarios.
While LLaVASeg uses off-the-shelf MLLMs, its performance could be further enhanced by incorporating instruction tuning with high-quality chain-of-thought instruction pairs. |
multi-modal large language models, reasoning segmentation, chain-of-thought prompting, visual attributes, multi-scale prompting |
2403.13951
Report |
ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On |
Jeffrey Zhang, Kedan Li, Shao-Yu Chang, David Forsyth |
Virtual Try-on (VTON) involves generating images of a person wearing selected
garments. Diffusion-based methods, in particular, can create high-quality
images, but they struggle to maintain the identities of the input garments. We
identified this problem stems from the specifics in the training formulation
for diffusion. To address this, we propose a unique training scheme that limits
the scope in which diffusion is trained. We use a control image that perfectly
aligns with the target image during training. In turn, this accurately
preserves garment details during inference. We demonstrate our method not only
effectively conserves garment details but also allows for layering, styling,
and shoe try-on. Our method runs multi-garment try-on in a single inference
cycle and can support high-quality zoomed-in generations without training in
higher resolutions. Finally, we show our method surpasses prior methods in
accuracy and quality. |
This paper introduces ACDG-VTON, a novel virtual try-on method leveraging diffusion models while preserving garment details by aligning garment features during training and using a novel zoom-in generation process. |
Current virtual try-on systems struggle to balance garment accuracy, generation quality, and user controllability. This method offers a solution by improving accuracy and quality while allowing for multi-garment layering, styling variations, and shoe try-on. |
ACDG-VTON uses a warp-then-diffuse pipeline. It generates a control image with aligned garment features and employs a ControlNet architecture with a modified training process. For high-resolution zoom-in, it crops and upsamples specific regions, leveraging the diffusion model's ability to accurately copy details. |
ACDG-VTON accurately preserves garment details like logos, text, textures, and patterns, outperforming existing diffusion-based methods.
User studies confirm that ACDG-VTON surpasses previous methods in accurately replicating garment details, both in full-body and zoomed-in views.
The method improves the visual quality of generated images compared to GAN-based approaches while maintaining garment accuracy and user controllability, as demonstrated through qualitative examples and user studies. |
The method's accuracy depends on the performance of the pre-trained warper and layout generator.
The system may struggle with garment types or poses not well-represented in the training dataset, such as garments with transparency or complex drape. |
virtual try-on, diffusion models, accuracy, controllability, image generation |
2403.13826
Report |
Measuring Diversity in Co-creative Image Generation |
Francisco Ibarrola, Kazjon Grace |
Quality and diversity have been proposed as reasonable heuristics for
assessing content generated by co-creative systems, but to date there has been
little agreement around what constitutes the latter or how to measure it.
Proposed approaches for assessing generative models in terms of diversity have
limitations in that they compare the model's outputs to a ground truth that in
the era of large pre-trained generative models might not be available, or
entail an impractical number of computations. We propose an alternative based
on entropy of neural network encodings for comparing diversity between sets of
images that does not require ground-truth knowledge and is easy to compute. We
also compare two pre-trained networks and show how the choice relates to the
notion of diversity that we want to evaluate. We conclude with a discussion of
the potential applications of these measures for ideation in interactive
systems, model evaluation, and more broadly within computational creativity. |
This paper proposes novel, computationally inexpensive methods for estimating within-set diversity of images generated by text-to-image AI systems, using either Truncated Inception Entropy (TIE) or Truncated CLIP Entropy (TCE). |
Diversity in generated images is crucial for interactive AI systems to support creative exploration and problem reframing, but current methods lack practicality or require ground truth data, which is often unavailable for large pre-trained models. |
The methods involve calculating the entropy of the empirical distribution of a set of generated images in a latent space derived from pre-trained networks (InceptionV3 for TIE and CLIP for TCE). |
TIE and TCE successfully differentiate between sets of images generated with varying degrees of diversity in prompt and style.
TCE, based on a model trained on both text and images, is more sensitive to semantic variations in images compared to TIE.
Preliminary experiments suggest TCE could be applicable to assessing text diversity as well. |
The proposed measures require further validation through human perception studies to confirm their alignment with human judgment of diversity.
Future work will explore the use of other pre-trained networks and layers, potentially leading to measures with different biases (e.g., more sensitive to visual textures). |
computational creativity, image generation, diversity measures, co-creative systems, generative ai |
2403.13807
Report |
Editing Massive Concepts in Text-to-Image Diffusion Models |
Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu |
Text-to-image diffusion models suffer from the risk of generating outdated,
copyrighted, incorrect, and biased content. While previous methods have
mitigated the issues on a small scale, it is essential to handle them
simultaneously in larger-scale real-world scenarios. We propose a two-stage
method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage
performs memory optimization for each individual concept with dual
self-distillation from text alignment loss and diffusion noise prediction loss.
The second stage conducts massive concept editing with multi-layer, closed form
model editing. We further propose a comprehensive benchmark, named ImageNet
Concept Editing Benchmark (ICEB), for evaluating massive concept editing for
T2I models with two subtasks, free-form prompts, massive concept categories,
and extensive evaluation metrics. Extensive experiments conducted on our
proposed benchmark and previous benchmarks demonstrate the superior scalability
of EMCID for editing up to 1,000 concepts, providing a practical approach for
fast adjustment and re-deployment of T2I diffusion models in real-world
applications. |
This paper proposes EMCID, a two-stage method for editing a large number of concepts in text-to-image diffusion models. |
Text-to-image diffusion models can generate outdated, biased, or incorrect content. Editing concepts within these models offers a practical solution without requiring expensive retraining. |
EMCID uses dual self-distillation to optimize concept representations in the first stage. In the second stage, it uses a closed-form solution for multi-layer model editing, enabling large-scale concept updates. |
EMCID successfully edits up to 1,000 concepts while preserving the generation quality for non-edited concepts.
A new comprehensive benchmark, ICEB, is introduced to evaluate large-scale concept editing in T2I models.
EMCID outperforms previous methods in terms of scalability, generalization ability, and specificity, particularly for editing a large number of concepts. |
EMCID faces limitations in erasing NSFW content due to the complexity of visual concepts and potential associations.
Future work can explore combining EMCID with methods targeting specific aspects like NSFW content removal for a more comprehensive solution. |
text-to-image generation, diffusion models, concept editing, model editing, benchmarking |
2403.13806
Report |
RadSplat: Radiance Field-Informed Gaussian Splatting for Robust Real-Time Rendering with 900+ FPS |
Michael Niemeyer, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Oechsle, Daniel Duckworth, Rama Gosula, Keisuke Tateno, John Bates, Dominik Kaeser, Federico Tombari |
Recent advances in view synthesis and real-time rendering have achieved
photorealistic quality at impressive rendering speeds. While Radiance
Field-based methods achieve state-of-the-art quality in challenging scenarios
such as in-the-wild captures and large-scale scenes, they often suffer from
excessively high compute requirements linked to volumetric rendering. Gaussian
Splatting-based methods, on the other hand, rely on rasterization and naturally
achieve real-time rendering but suffer from brittle optimization heuristics
that underperform on more challenging scenes. In this work, we present
RadSplat, a lightweight method for robust real-time rendering of complex
scenes. Our main contributions are threefold. First, we use radiance fields as
a prior and supervision signal for optimizing point-based scene
representations, leading to improved quality and more robust optimization.
Next, we develop a novel pruning technique reducing the overall point count
while maintaining high quality, leading to smaller and more compact scene
representations with faster inference speeds. Finally, we propose a novel
test-time filtering approach that further accelerates rendering and allows to
scale to larger, house-sized scenes. We find that our method enables
state-of-the-art synthesis of complex captures at 900+ FPS. |
RadSplat, a method combining radiance fields and Gaussian Splatting for robust real-time rendering of complex scenes. |
Achieve real-time rendering of complex scenes with high quality, addressing limitations of both radiance field (computationally expensive) and Gaussian Splatting (brittle optimization) methods. |
1. Train a radiance field (ZipNeRF) as a robust prior. 2. Initialize and supervise a point-based 3DGS representation using the radiance field. 3. Introduce a ray contribution-based pruning technique for compact scene representation. 4. Perform viewpoint-based visibility filtering to accelerate rendering. |
Achieves state-of-the-art view synthesis quality, outperforming previous real-time methods and even surpassing offline method ZipNeRF in some metrics.
RadSplat renders at speeds exceeding 900 FPS, significantly faster than prior works.
Demonstrates robustness in handling complex real-world captures with lighting and exposure variations. |
Training time is longer compared to single-representation models.
A small quality gap to ZipNeRF remains on large-scale scenes. |
real-time rendering, gaussian splatting, neural fields, view synthesis, 3d reconstruction |
2403.13788
Report |
DepthFM: Fast Monocular Depth Estimation with Flow Matching |
Ming Gui, Johannes S. Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer |
Monocular depth estimation is crucial for numerous downstream vision tasks
and applications. Current discriminative approaches to this problem are limited
due to blurry artifacts, while state-of-the-art generative methods suffer from
slow sampling due to their SDE nature. Rather than starting from noise, we seek
a direct mapping from input image to depth map. We observe that this can be
effectively framed using flow matching, since its straight trajectories through
solution space offer efficiency and high quality. Our study demonstrates that a
pre-trained image diffusion model can serve as an adequate prior for a flow
matching depth model, allowing efficient training on only synthetic data to
generalize to real images. We find that an auxiliary surface normals loss
further improves the depth estimates. Due to the generative nature of our
approach, our model reliably predicts the confidence of its depth estimates. On
standard benchmarks of complex natural scenes, our lightweight approach
exhibits state-of-the-art performance at favorable low computational cost
despite only being trained on little synthetic data. |
Presents DepthFM, a flow matching model for fast monocular depth estimation achieving state-of-the-art results with low computational cost. |
Crucial for various vision tasks, existing discriminative methods produce blurry depth maps, and generative methods are slow. |
Leverages pre-trained image diffusion models as prior and employs data-dependent flow matching to learn a direct mapping from input image to depth, incorporating an auxiliary surface normals loss for enhanced geometric accuracy. |
Achieves state-of-the-art performance on standard benchmarks using only synthetic training data.
Significantly faster than diffusion-based methods due to its one-step inference capability.
Provides reliable confidence estimates, unlike discriminative approaches. |
Relies on accurate camera intrinsics for surface normal estimation.
Limited exploration of different pre-trained diffusion models as priors. |
depth estimation, flow matching, generative model, zero-shot learning, confidence estimation |
2403.13745
Report |
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation |
Fu-Yun Wang, Xiaoshi Wu, Zhaoyang Huang, Xiaoyu Shi, Dazhong Shen, Guanglu Song, Yu Liu, Hongsheng Li |
Video outpainting is a challenging task, aiming at generating video content
outside the viewport of the input video while maintaining inter-frame and
intra-frame consistency. Existing methods fall short in either generation
quality or flexibility. We introduce MOTIA Mastering Video Outpainting Through
Input-Specific Adaptation, a diffusion-based pipeline that leverages both the
intrinsic data-specific patterns of the source video and the image/video
generative prior for effective outpainting. MOTIA comprises two main phases:
input-specific adaptation and pattern-aware outpainting. The input-specific
adaptation phase involves conducting efficient and effective pseudo outpainting
learning on the single-shot source video. This process encourages the model to
identify and learn patterns within the source video, as well as bridging the
gap between standard generative processes and outpainting. The subsequent
phase, pattern-aware outpainting, is dedicated to the generalization of these
learned patterns to generate outpainting outcomes. Additional strategies
including spatial-aware insertion and noise travel are proposed to better
leverage the diffusion model's generative prior and the acquired video patterns
from source videos. Extensive evaluations underscore MOTIA's superiority,
outperforming existing state-of-the-art methods in widely recognized
benchmarks. Notably, these advancements are achieved without necessitating
extensive, task-specific tuning. |
Introduces MOTIA, a diffusion-based video outpainting pipeline that leverages both intrinsic data-specific patterns of source videos and the image/video generative prior for effective outpainting. |
Video outpainting is crucial for adapting videos to various aspect ratios and screen sizes seamlessly while preserving temporal and spatial consistency, which is challenging for existing methods. |
Employs a two-stage process: 1) input-specific adaptation by conducting pseudo outpainting learning on the source video itself and 2) pattern-aware outpainting by combining learned patterns with diffusion models, incorporating spatial-aware insertion and noise regret strategies. |
Significantly outperforms state-of-the-art methods in quantitative metrics (PSNR, SSIM, LPIPS, FVD) on DAVIS and YouTube-VOS benchmarks.
Demonstrates superior visual quality and realism in qualitative comparisons, effectively handling both foreground and background outpainting.
Showcases flexibility in handling various mask types, video resolutions and lengths, and arbitrary styles, surpassing previous limitations. |
Struggles with outpainting videos containing limited source information.
Future work could explore better utilization of temporal information for enhanced consistency. |
video outpainting, diffusion models, input-specific adaptation, pattern-aware outpainting, spatial-aware insertion |
2403.13600
Report |
VL-Mamba: Exploring State Space Models for Multimodal Learning |
Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu |
Multimodal large language models (MLLMs) have attracted widespread interest
and have rich applications. However, the inherent attention mechanism in its
Transformer structure requires quadratic complexity and results in expensive
computational overhead. Therefore, in this work, we propose VL-Mamba, a
multimodal large language model based on state space models, which have been
shown to have great potential for long-sequence modeling with fast inference
and linear scaling in sequence length. Specifically, we first replace the
transformer-based backbone language model such as LLama or Vicuna with the
pre-trained Mamba language model. Then, we empirically explore how to
effectively apply the 2D vision selective scan mechanism for multimodal
learning and the combinations of different vision encoders and variants of
pretrained Mamba language models. The extensive experiments on diverse
multimodal benchmarks with competitive performance show the effectiveness of
our proposed VL-Mamba and demonstrate the great potential of applying state
space models for multimodal learning tasks. |
This paper introduces VL-Mamba, the first exploration of using the state space model 'Mamba' for multimodal learning tasks, aiming to leverage its efficiency for handling long sequences in vision and language understanding. |
Existing multimodal large language models (MLLMs) heavily rely on Transformers, which suffer from quadratic complexity in attention mechanisms, making them computationally expensive for long sequences. VL-Mamba addresses this limitation by employing the Mamba model known for its linear scaling in sequence length. |
VL-Mamba comprises a pre-trained Mamba language model, a vision encoder (Vision Transformer), and a novel MultiModal Connector (MMC). The MMC, incorporating a 2D vision selective scan mechanism, bridges the gap between non-causal image data and the causal modeling of SSMs. Two scan mechanisms, Bidirectional and Cross Scanning, are explored within the MMC. |
VL-Mamba achieves competitive performance on eight multimodal benchmarks, comparable to state-of-the-art MLLMs despite having fewer parameters and training data.
The study shows that VL-Mamba outperforms some larger models, highlighting the efficiency of SSMs for multimodal learning.
Ablation studies confirm the effectiveness of different components, including language model variants, vision encoders, MMC architectures, and scan mechanisms. |
The paper primarily focuses on the 2D selective scan mechanism in the MMC, leaving the exploration of higher-quality training data for future work.
Future research could investigate incorporating the training data used by top-performing MLLMs to potentially enhance VL-Mamba's performance further. |
multimodal learning, large language models, state space models, vision and language, mamba |
2403.13589
Report |
ReGround: Improving Textual and Spatial Grounding at No Cost |
Yuseung Lee, Minhyuk Sung |
When an image generation process is guided by both a text prompt and spatial
cues, such as a set of bounding boxes, do these elements work in harmony, or
does one dominate the other? Our analysis of a pretrained image diffusion model
that integrates gated self-attention into the U-Net reveals that spatial
grounding often outweighs textual grounding due to the sequential flow from
gated self-attention to cross-attention. We demonstrate that such bias can be
significantly mitigated without sacrificing accuracy in either grounding by
simply rewiring the network architecture, changing from sequential to parallel
for gated self-attention and cross-attention. This surprisingly simple yet
effective solution does not require any fine-tuning of the network but
significantly reduces the trade-off between the two groundings. Our experiments
demonstrate significant improvements from the original GLIGEN to the rewired
version in the trade-off between textual grounding and spatial grounding. |
This paper introduces ReGround, a method to improve textual grounding in layout-guided image generation by rewiring the attention mechanism in GLIGEN from sequential to parallel. |
Existing methods like GLIGEN, while enabling spatial grounding with bounding boxes, often overlook textual details in prompts, leading to a trade-off between textual and spatial accuracy. |
The authors propose a simple rewiring of the network architecture in GLIGEN, changing the relationship between gated self-attention (spatial grounding) and cross-attention (textual grounding) from sequential to parallel. |
ReGround significantly reduces the trade-off between textual and spatial grounding, achieving higher CLIP scores (textual grounding) while maintaining comparable YOLO scores (spatial grounding) to GLIGEN.
The improvement is consistent across different datasets, including MS-COCO and a newly introduced NSR-1K-GPT dataset.
ReGround's effectiveness extends to other frameworks that use GLIGEN as a backbone, such as BoxDiff, demonstrating its broad applicability. |
The study primarily focuses on GLIGEN and its application with bounding box layouts, potentially limiting its generalizability to other spatial grounding techniques.
Further investigation into the impact of rewiring on more complex and diverse layout representations could be beneficial. |
textual grounding, spatial grounding, image generation, diffusion models, network rewiring |
2403.13551
Report |
Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing |
Hangeol Chang, Jinho Chang, Jong Chul Ye |
Despite recent advancements in text-to-image diffusion models facilitating
various image editing techniques, complex text prompts often lead to an
oversight of some requests due to a bottleneck in processing text information.
To tackle this challenge, we present Ground-A-Score, a simple yet powerful
model-agnostic image editing method by incorporating grounding during score
distillation. This approach ensures a precise reflection of intricate prompt
requirements in the editing outcomes, taking into account the prior knowledge
of the object locations within the image. Moreover, the selective application
with a new penalty coefficient and contrastive loss helps to precisely target
editing areas while preserving the integrity of the objects in the source
image. Both qualitative assessments and quantitative analyses confirm that
Ground-A-Score successfully adheres to the intricate details of extended and
multifaceted prompts, ensuring high-quality outcomes that respect the original
image attributes. |
Presents Ground-A-Score, a model-agnostic image editing method using grounding during score distillation for multi-attribute editing, improving accuracy and detail in complex prompts. |
Existing score distillation methods struggle to accurately reflect complex prompts with multiple editing requirements, often overlooking specific objects or compositions. |
Breaks down complex prompts into subtasks, calculates score gradients separately, aggregates them with grounding information, and introduces a null-text penalty to prevent object distortion during optimization. |
Successfully edits multiple image attributes according to complex prompts, outperforming baseline models in qualitative assessments.
Quantitative analyses confirm Ground-A-Score achieves higher image quality (lower LPIPS) and better prompt adherence (higher masked CLIP score).
User study confirms Ground-A-Score produces edits more aligned with user intent, preserving original features while ensuring high overall image quality. |
Reliance on pre-trained models (T2I diffusion, grounding, LLM) may inherit their limitations.
Performance may vary across diverse image domains and with highly complex or ambiguous prompts. |
image editing, diffusion models, score distillation, multi-attribute editing, grounding |
2403.13535
Report |
IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models |
Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, Ziyong Feng |
Leveraging Stable Diffusion for the generation of personalized portraits has
emerged as a powerful and noteworthy tool, enabling users to create
high-fidelity, custom character avatars based on their specific prompts.
However, existing personalization methods face challenges, including test-time
fine-tuning, the requirement of multiple input images, low preservation of
identity, and limited diversity in generated outcomes. To overcome these
challenges, we introduce IDAdapter, a tuning-free approach that enhances the
diversity and identity preservation in personalized image generation from a
single face image. IDAdapter integrates a personalized concept into the
generation process through a combination of textual and visual injections and a
face identity loss. During the training phase, we incorporate mixed features
from multiple reference images of a specific identity to enrich
identity-related content details, guiding the model to generate images with
more diverse styles, expressions, and angles compared to previous works.
Extensive evaluations demonstrate the effectiveness of our method, achieving
both diversity and identity fidelity in generated images. |
This paper presents IDAdapter, a tuning-free method for personalizing text-to-image synthesis models using a single face image, achieving high diversity in generated images without test-time fine-tuning. |
Existing personalization methods struggle with challenges like test-time fine-tuning, needing multiple input images, low identity preservation, and limited output diversity. IDAdapter addresses these limitations by enabling diverse and high-fidelity image generation from a single face image. |
IDAdapter integrates mixed features from multiple reference images during training to enrich identity information, guiding the model to generate images with diverse styles, expressions, and angles. It employs textual and visual injections to incorporate a personalized concept and uses a face identity loss to preserve identity. |
IDAdapter outperforms existing methods in generating diverse and high-fidelity personalized images.
It successfully decouples identity and non-identity features, allowing for variations in expression, pose, and style while maintaining facial fidelity.
The use of mixed facial features from multiple reference images significantly improves diversity and identity preservation compared to using a single image. |
The model's performance might be influenced by the quality and diversity of the training dataset.
Future work could explore extending the method to handle more complex personalization scenarios, such as full-body generation with diverse clothing and accessories. |
text-to-image synthesis, personalization, diffusion models, face generation, tuning-free |
2403.13524
Report |
Compress3D: a Compressed Latent Space for 3D Generation from a Single Image |
Bowen Zhang, Tianyu Yang, Yu Li, Lei Zhang, Xi Zhao |
3D generation has witnessed significant advancements, yet efficiently
producing high-quality 3D assets from a single image remains challenging. In
this paper, we present a triplane autoencoder, which encodes 3D models into a
compact triplane latent space to effectively compress both the 3D geometry and
texture information. Within the autoencoder framework, we introduce a 3D-aware
cross-attention mechanism, which utilizes low-resolution latent representations
to query features from a high-resolution 3D feature volume, thereby enhancing
the representation capacity of the latent space. Subsequently, we train a
diffusion model on this refined latent space. In contrast to solely relying on
image embedding for 3D generation, our proposed method advocates for the
simultaneous utilization of both image embedding and shape embedding as
conditions. Specifically, the shape embedding is estimated via a diffusion
prior model conditioned on the image embedding. Through comprehensive
experiments, we demonstrate that our method outperforms state-of-the-art
algorithms, achieving superior performance while requiring less training data
and time. Our approach enables the generation of high-quality 3D assets in
merely 7 seconds on a single A100 GPU. |
This paper introduces Compress3D, a novel two-stage diffusion model for generating high-quality 3D models from single images using a compressed latent space. |
Efficiently generating high-quality 3D models from single images is crucial for various applications, but remains challenging due to limitations in data size and computational efficiency. |
The method employs a triplane autoencoder with a 3D-aware cross-attention mechanism to compress 3D models into a compact latent space. It then utilizes a diffusion prior model to estimate shape embeddings from image embeddings, and a triplane diffusion model generates 3D models conditioned on both shape and image embeddings. |
Compress3D outperforms state-of-the-art methods in terms of FID and CLIP similarity, indicating superior generation quality.
It requires significantly less training data and time compared to previous approaches.
The method enables fast generation of high-quality 3D assets in approximately 7 seconds on a single A100 GPU. |
The model's performance might be further improved by exploring alternative 3D representations beyond FlexiCubes.
Investigating the generalization ability of Compress3D on more diverse datasets with complex scenes and objects could be beneficial. |
3d generation, diffusion model, triplane representation, latent space compression, shape embedding |
2403.13447
Report |
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models |
Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang |
Recent advancements indicate that scaling up Multimodal Large Language Models
(MLLMs) effectively enhances performance on downstream multimodal tasks. The
prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into
text-like tokens using a \emph{static} vision-language mapper, thereby enabling
\emph{static} LLMs to develop the capability to comprehend visual information
through visual instruction tuning. Although promising, the \emph{static} tuning
strategy~\footnote{The static tuning refers to the trained model with static
parameters.} that shares the same parameters may constrain performance across
different downstream multimodal tasks. In light of this, we introduce
HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters,
in conjunction with a dynamic visual expert and language expert, respectively.
These experts are derived from HyperNetworks, which generates adaptive
parameter shifts through visual and language guidance, enabling dynamic
projector and LLM modeling in two-stage training.
Our experiments demonstrate that our solution significantly surpasses LLaVA
on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and
LLaVA-Bench. ~\footnote{Our project is available on the link
https://github.com/DCDmllm/HyperLLaVA}. |
This paper introduces HyperLLaVA, an enhanced Multimodal Large Language Model (MLLM) that adaptively tunes both projector and LLM parameters using dynamic visual and language experts derived from HyperNetworks. |
Existing MLLMs often rely on static tuning, limiting their flexibility and performance across diverse multimodal tasks. HyperLLaVA addresses this limitation by dynamically adapting to visual and language inputs, resulting in superior performance. |
HyperLLaVA employs a two-stage training process: 1) Visual-language alignment: A visual expert dynamically adjusts the projector's output based on visual features. 2) Multimodal instruction tuning: A language expert dynamically models LLM layers guided by intermediate LLM outputs. |
HyperLLaVA significantly outperforms LLaVA and other state-of-the-art MLLMs on 11 out of 12 benchmarks, including VQA, image captioning, and visual reasoning tasks.
The dynamic tuning approach in HyperLLaVA proves more effective than static tuning, demonstrating its ability to generate adaptive visual tokens and instruction-specific features.
HyperLLaVA's language expert functions as a parameter-efficient fine-tuning method, achieving comparable performance to traditional methods while updating fewer parameters. |
The impact of varying the size of the visual and language experts on performance needs further investigation.
Exploring the application of dynamic tuning to other MLLM architectures and pretraining objectives could be a promising future direction. |
multimodal large language model, hypernetwork, dynamic tuning, parameter-efficient fine-tuning, vision-language alignment |
2403.13438
Report |
See, Imagine, Plan: Discovering and Hallucinating Tasks from a Single Image |
Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, Andrew Markham |
Humans can not only recognize and understand the world in its current state
but also envision future scenarios that extend beyond immediate perception. To
resemble this profound human capacity, we introduce zero-shot task
hallucination -- given a single RGB image of any scene comprising unknown
environments and objects, our model can identify potential tasks and imagine
their execution in a vivid narrative, realized as a video. We develop a modular
pipeline that progressively enhances scene decomposition, comprehension, and
reconstruction, incorporating VLM for dynamic interaction and 3D motion
planning for object trajectories. Our model can discover diverse tasks, with
the generated task videos demonstrating realistic and compelling visual
outcomes that are understandable by both machines and humans. Project Page:
https://dannymcy.github.io/zeroshot_task_hallucination/ |
This paper introduces 'zero-shot task hallucination,' enabling a model to identify potential tasks from a single RGB image of an unknown scene and generate a video visualizing the task execution. |
This work mimics the human ability to envision and plan future scenarios from visual perception, potentially leading to applications like robotic task discovery and interactive visual guidance. |
The paper proposes a modular pipeline combining VLM for task discovery, 2D/3D scene reconstruction, a novel axes-constrained 3D planning approach for object trajectory generation, and rendering for video creation. |
The model discovers diverse and contextually relevant tasks within various scenes.
Generated videos demonstrate realistic object manipulation aligned with task descriptions.
Human evaluation confirms the visual quality and interpretability of the generated task videos. |
The quality of generated videos can be influenced by the performance of individual components, such as segmentation or 3D reconstruction.
The current approach primarily focuses on rigid object manipulation, with future work exploring deformable objects and more complex interactions. |
task hallucination, vision-language models, 3d scene reconstruction, motion planning, video generation |
2403.13408
Report |
S2DM: Sector-Shaped Diffusion Models for Video Generation |
Haoran Lang, Yuxuan Ge, Zheng Tian |
Diffusion models have achieved great success in image generation. However,
when leveraging this idea for video generation, we face significant challenges
in maintaining the consistency and continuity across video frames. This is
mainly caused by the lack of an effective framework to align frames of videos
with desired temporal features while preserving consistent semantic and
stochastic features. In this work, we propose a novel Sector-Shaped Diffusion
Model (S2DM) whose sector-shaped diffusion region is formed by a set of
ray-shaped reverse diffusion processes starting at the same noise point. S2DM
can generate a group of intrinsically related data sharing the same semantic
and stochastic features while varying on temporal features with appropriate
guided conditions. We apply S2DM to video generation tasks, and explore the use
of optical flow as temporal conditions. Our experimental results show that S2DM
outperforms many existing methods in the task of video generation without any
temporal-feature modelling modules. For text-to-video generation tasks where
temporal conditions are not explicitly given, we propose a two-stage generation
strategy which can decouple the generation of temporal features from
semantic-content features. We show that, without additional training, our model
integrated with another temporal conditions generative model can still achieve
comparable performance with existing works. Our results can be viewd at
https://s2dm.github.io/S2DM/. |
This paper introduces S2DM, a novel Sector-Shaped Diffusion Model for generating videos with high consistency and coherence by modeling the generation process as a sector-shaped diffusion region. |
Generating consistent and continuous videos using diffusion models is challenging due to the difficulty in aligning video frames with desired temporal features while preserving semantic and stochastic features. |
S2DM employs a sector-shaped diffusion region formed by multiple ray-shaped reverse diffusion processes starting from the same noise point. Each process is guided by identical semantic conditions and varying temporal conditions to ensure consistency and temporal alignment. |
S2DM outperforms existing methods in optical flow-guided video generation tasks on MHAD and MUG datasets.
A two-stage text-to-video generation strategy using S2DM achieves comparable results to state-of-the-art methods.
Ablation studies confirm the effectiveness of the shared noise assumption in S2DM for maintaining video consistency. |
The current method of incorporating semantic and temporal conditions could be improved for better control.
Exploring additional temporal conditions beyond optical flow would further demonstrate the generality of S2DM. |
video generation, diffusion models, optical flow, text-to-video, consistency |
2403.13352
Report |
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation |
Jingkun An, Yinghao Zhu, Zongjian Li, Haoran Feng, Bohua Chen, Yemin Shi, Chengwei Pan |
Text-to-Image (T2I) diffusion models have achieved remarkable success in
image generation. Despite their progress, challenges remain in both
prompt-following ability, image quality and lack of high-quality datasets,
which are essential for refining these models. As acquiring labeled data is
costly, we introduce AGFSync, a framework that enhances T2I diffusion models
through Direct Preference Optimization (DPO) in a fully AI-driven approach.
AGFSync utilizes Vision-Language Models (VLM) to assess image quality across
style, coherence, and aesthetics, generating feedback data within an AI-driven
loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and
SDXL, our extensive experiments on the TIFA dataset demonstrate notable
improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2
benchmark, consistently outperforming the base models. AGFSync's method of
refining T2I diffusion models paves the way for scalable alignment techniques. |
\modelname{} is a novel framework that leverages AI-generated feedback and Direct Preference Optimization (DPO) to improve the quality of text-to-image generation. |
Existing methods for enhancing text-to-image generation often rely on expensive human-labeled data and may not fully capture the nuances of image quality across different aspects like style, coherence, and aesthetics. |
The framework uses LLMs to generate diverse textual prompts and corresponding question-answer pairs. Then, it uses a VQA model, CLIP score, and aesthetic scoring model to evaluate the generated images. Finally, it applies DPO to fine-tune the diffusion model based on the constructed preference pairs. |
Significantly improves image quality across different models and benchmarks, as demonstrated by higher VQA, CLIP, and aesthetic scores.
Generates images that are more faithful to the input prompts and exhibit better coherence with real-world rules.
Achieves a 100\% data conversion efficiency compared to lower rates in methods like DreamSync. |
The performance of \modelname{} is dependent on the capabilities and potential biases of the LLMs and aesthetic scoring models used.
The introduction of random noise for image diversity might sometimes lead to reduced consistency between some images and their prompts. |
text-to-image generation, diffusion models, direct preference optimization, ai feedback, image quality evaluation |
2403.13304
Report |
DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception |
Yibo Wang, Ruiyuan Gao, Kai Chen, Kaiqiang Zhou, Yingjie Cai, Lanqing Hong, Zhenguo Li, Lihui Jiang, Dit-Yan Yeung, Qiang Xu, Kai Zhang |
Current perceptive models heavily depend on resource-intensive datasets,
prompting the need for innovative solutions. Leveraging recent advances in
diffusion models, synthetic data, by constructing image inputs from various
annotations, proves beneficial for downstream tasks. While prior methods have
separately addressed generative and perceptive models, DetDiffusion, for the
first time, harmonizes both, tackling the challenges in generating effective
data for perceptive models. To enhance image generation with perceptive models,
we introduce perception-aware loss (P.A. loss) through segmentation, improving
both quality and controllability. To boost the performance of specific
perceptive models, our method customizes data augmentation by extracting and
utilizing perception-aware attribute (P.A. Attr) during generation.
Experimental results from the object detection task highlight DetDiffusion's
superior performance, establishing a new state-of-the-art in layout-guided
generation. Furthermore, image syntheses from DetDiffusion can effectively
augment training data, significantly enhancing downstream detection
performance. |
This paper introduces DetDiffusion, a novel framework that leverages the synergy between generative and perceptive models to enhance controlled image generation and improve the performance of downstream perception tasks. |
Existing perceptive models rely heavily on large, labeled datasets, which are expensive to obtain. DetDiffusion addresses this by generating synthetic data tailored for perception tasks, potentially improving data efficiency and model performance. |
DetDiffusion integrates perception-aware attributes (P.A. Attr) extracted from a pre-trained detector and a perception-aware loss (P.A. loss) based on segmentation into a geometric-aware diffusion model. |
DetDiffusion achieves state-of-the-art performance in layout-guided image generation, surpassing previous methods in FID and YOLO score.
Synthetic data generated by DetDiffusion effectively augments training data, leading to significant improvements in downstream object detection performance.
The framework demonstrates control over the difficulty of generated images by manipulating the perception-aware attributes, enabling the generation of challenging examples for improved training. |
Currently, DetDiffusion primarily focuses on object detection tasks. Expanding its applicability to other perception tasks is a potential direction for future research.
Further exploration of how to generate high-quality, human-aligned images while mitigating harmful or toxic content is crucial for practical applications. |
generative models, perceptive models, diffusion models, synthetic data generation, object detection |
2403.13187
Report |
Evolutionary Optimization of Model Merging Recipes |
Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha |
We present a novel application of evolutionary algorithms to automate the
creation of powerful foundation models. While model merging has emerged as a
promising approach for LLM development due to its cost-effectiveness, it
currently relies on human intuition and domain knowledge, limiting its
potential. Here, we propose an evolutionary approach that overcomes this
limitation by automatically discovering effective combinations of diverse
open-source models, harnessing their collective intelligence without requiring
extensive additional training data or compute. Our approach operates in both
parameter space and data flow space, allowing for optimization beyond just the
weights of the individual models. This approach even facilitates cross-domain
merging, generating models like a Japanese LLM with Math reasoning
capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art
performance on a variety of established Japanese LLM benchmarks, even
surpassing models with significantly more parameters, despite not being
explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM
generated through our approach demonstrates its effectiveness in describing
Japanese culture-specific content, outperforming previous Japanese VLMs. This
work not only contributes new state-of-the-art models back to the open-source
community, but also introduces a new paradigm for automated model composition,
paving the way for exploring alternative, efficient approaches to foundation
model development. |
The paper introduces Evolutionary Model Merge, a novel approach using evolutionary algorithms to automatically discover optimal combinations of open-source foundation models, creating new models with enhanced capabilities without extensive training. |
Model merging, while promising for its cost-effectiveness, currently relies on human intuition and domain knowledge, limiting its potential. This paper presents an automated approach to overcome this limitation and democratize foundation model development. |
The method leverages evolutionary algorithms to optimize model merging in both parameter space (e.g., using DARE-TIES for weight merging) and data flow space (e.g., evolving the inference path through model layers), enabling exploration of a wider range of model combinations. |
Generated a Japanese LLM with Math reasoning abilities, achieving state-of-the-art performance on Japanese LLM benchmarks, surpassing even larger models.
Created a culturally-aware Japanese VLM that excels in describing Japanese culture-specific content, outperforming existing Japanese VLMs.
Demonstrated the effectiveness of combining parameter space and data flow space merging for enhanced model capabilities. |
The merged models inherit limitations of the source models, such as potential for logical inconsistencies.
The study does not include instruction fine-tuning or alignment, which could lead to factually flawed outputs. |
evolutionary algorithms, model merging, foundation models, language models, vision-language models |
2403.13163
Report |
DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring |
Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu |
Blurry images may contain local and global non-uniform artifacts, which
complicate the deblurring process and make it more challenging to achieve
satisfactory results. Recently, Transformers generate improved deblurring
outcomes than existing CNN architectures. However, the large model size and
long inference time are still two bothersome issues which have not been fully
explored. To this end, we propose DeblurDiNAT, a compact encoder-decoder
Transformer which efficiently restores clean images from real-world blurry
ones. We adopt an alternating dilation factor structure with the aim of
global-local feature learning. Also, we observe that simply using
self-attention layers in networks does not always produce good deblurred
results. To solve this problem, we propose a channel modulation self-attention
(CMSA) block, where a cross-channel learner (CCL) is utilized to capture
channel relationships. In addition, we present a divide and multiply
feed-forward network (DMFN) allowing fast feature propagation. Moreover, we
design a lightweight gated feature fusion (LGFF) module, which performs
controlled feature merging. Comprehensive experimental results show that the
proposed model, named DeblurDiNAT, provides a favorable performance boost
without introducing noticeable computational costs over the baseline, and
achieves state-of-the-art (SOTA) performance on several image deblurring
datasets. Compared to nearest competitors, our space-efficient and time-saving
method demonstrates a stronger generalization ability with 3%-68% fewer
parameters and produces deblurred images that are visually closer to the ground
truth. |
This paper presents DeblurDiNAT, a lightweight and effective Transformer for image deblurring, which leverages dilated neighborhood attention and channel modulation to capture global-local blur information efficiently. |
Existing Transformer-based image deblurring methods often struggle to balance computational efficiency with deblurring accuracy, making it challenging to achieve optimal results without high computational costs. |
The proposed DeblurDiNAT utilizes an alternating dilation factor structure with dilated neighborhood attention for capturing both global and local blur patterns. It introduces a channel modulation self-attention block (CMSA) to capture cross-channel relationships effectively. Additionally, it employs a divide and multiply feed-forward network (DMFN) for fast feature propagation and a lightweight gated feature fusion (LGFF) module for efficient feature aggregation. |
DeblurDiNAT-L achieves state-of-the-art performance on GoPro and HIDE datasets while being significantly faster and requiring less memory than competitors.
The proposed method demonstrates superior generalization ability, outperforming existing models on real-world datasets RealBlur-R and RealBlur-J.
Ablation studies confirm the effectiveness of each proposed component (ADFS, CMSA, DMFN, LGFF) in improving deblurring performance and efficiency. |
The current implementation of DeblurDiNAT focuses on single-image deblurring.
Exploring the potential of DeblurDiNAT for video deblurring could be a promising direction. |
image deblurring, transformer, dilated neighborhood attention, channel modulation, lightweight model |
2403.13064
Report |
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model |
Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, Vasileios Balntas |
We introduce SceneScript, a method that directly produces full scene models
as a sequence of structured language commands using an autoregressive,
token-based approach. Our proposed scene representation is inspired by recent
successes in transformers & LLMs, and departs from more traditional methods
which commonly describe scenes as meshes, voxel grids, point clouds or radiance
fields. Our method infers the set of structured language commands directly from
encoded visual data using a scene language encoder-decoder architecture. To
train SceneScript, we generate and release a large-scale synthetic dataset
called Aria Synthetic Environments consisting of 100k high-quality in-door
scenes, with photorealistic and ground-truth annotated renders of egocentric
scene walkthroughs. Our method gives state-of-the art results in architectural
layout estimation, and competitive results in 3D object detection. Lastly, we
explore an advantage for SceneScript, which is the ability to readily adapt to
new commands via simple additions to the structured language, which we
illustrate for tasks such as coarse 3D object part reconstruction. |
Introduces SceneScript, a method for reconstructing 3D scenes by predicting a sequence of structured language commands from egocentric videos. |
Provides a compact, editable, and interpretable scene representation that can be readily extended to new tasks, bridging the gap between 3D reconstruction and language models. |
Uses an encoder-decoder architecture with different encoder options (point cloud, posed images, combined) and a transformer decoder to predict a sequence of language commands describing walls, doors, windows, bounding boxes, and more. |
Achieves state-of-the-art architectural layout estimation on the proposed Aria Synthetic Environments (ASE) dataset.
Shows competitive 3D object detection performance on ASE and ScanNet by simply adding a bounding box command.
Demonstrates extensibility by incorporating commands for coarse 3D object parts, curved entities, entity compositions, and object states. |
Structured language commands are currently manually defined.
Capturing fine-grained geometric details with high precision remains challenging due to the high-level nature of the commands. |
3d reconstruction, scene representation, structured language, egocentric vision, synthetic datasets |
2403.13044
Report |
Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos |
Hadi Alzayer, Zhihao Xia, Xuaner Zhang, Eli Shechtman, Jia-Bin Huang, Michael Gharbi |
We propose a generative model that, given a coarsely edited image,
synthesizes a photorealistic output that follows the prescribed layout. Our
method transfers fine details from the original image and preserves the
identity of its parts. Yet, it adapts it to the lighting and context defined by
the new layout. Our key insight is that videos are a powerful source of
supervision for this task: objects and camera motions provide many observations
of how the world changes with viewpoint, lighting, and physical interactions.
We construct an image dataset in which each sample is a pair of source and
target frames extracted from the same video at randomly chosen time intervals.
We warp the source frame toward the target using two motion models that mimic
the expected test-time user edits. We supervise our model to translate the
warped image into the ground truth, starting from a pretrained diffusion model.
Our model design explicitly enables fine detail transfer from the source frame
to the generated image, while closely following the user-specified layout. We
show that by using simple segmentations and coarse 2D manipulations, we can
synthesize a photorealistic edit faithful to the user's input while addressing
second-order effects like harmonizing the lighting and physical interactions
between edited objects. |
This paper introduces Magic Fixup, a novel diffusion-based image editing method that allows users to create photorealistic edits through a simple 'cut-and-transform' interface. |
Existing image editing tools often require extensive manual work or struggle to preserve realism and faithfulness to user input. This method seeks to bridge this gap by combining intuitive user controls with the power of generative models. |
The approach leverages a dual diffusion model setup. A 'detail extractor' model processes the original image to capture fine-grained details, while a 'synthesizer' model generates the final output, guided by the user's coarse edit and the extracted details. The models are trained on a dataset of paired video frames, where the input frame is automatically warped to match the target frame using flow-based and piecewise affine motion models. |
Magic Fixup demonstrates superior performance in preserving object identity and generating realistic details compared to existing editing tools, as evidenced by qualitative results and a user study.
The use of video data and the proposed motion models is crucial for training a model capable of realistic and faithful image recomposition and reposing.
The cross-attention mechanism for detail transfer significantly improves the model's ability to harmonize edits and maintain realism. |
The model's ability to handle out-of-domain images (e.g., paintings) is limited by the video-based training data.
The method inherits the limitations of the underlying diffusion models, particularly in areas like generating hands and faces. |
image editing, diffusion models, generative models, video data, user interface |
2403.13043
Report |
When Do We Not Need Larger Vision Models? |
Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell |
Scaling up the size of vision models has been the de facto standard to obtain
more powerful visual representations. In this work, we discuss the point beyond
which larger vision models are not necessary. First, we demonstrate the power
of Scaling on Scales (S$^2$), whereby a pre-trained and frozen smaller vision
model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform
larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth
estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation.
Notably, S$^2$ achieves state-of-the-art performance in detailed understanding
of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the
conditions under which S$^2$ is a preferred scaling approach compared to
scaling on model size. While larger models have the advantage of better
generalization on hard examples, we show that features of larger vision models
can be well approximated by those of multi-scale smaller models. This suggests
most, if not all, of the representations learned by current large pre-trained
models can also be obtained from multi-scale smaller models. Our results show
that a multi-scale smaller model has comparable learning capacity to a larger
model, and pre-training smaller models with S$^2$ can match or even exceed the
advantage of larger models. We release a Python package that can apply S$^2$ on
any vision model with one line of code:
https://github.com/bfshi/scaling_on_scales. |
This paper challenges the assumption that larger vision models are always better, proposing "Scaling on Scales" (S^2) where a smaller model is run on multiple image scales instead of increasing model size. |
Scaling model size, while effective, is resource-intensive. S^2 offers a potentially more efficient way to achieve comparable or better visual understanding. |
The authors introduce "S^2-Wrapper," a mechanism to apply multi-scale processing to any pre-trained vision model without additional parameters. They compare S^2 with traditional model size scaling across tasks like image classification, segmentation, depth estimation, MLLM benchmarks, and robotic manipulation. |
Smaller models with S^2 often match or outperform larger models on various tasks, achieving state-of-the-art on MLLM visual detail understanding (V* benchmark).
Larger models show advantage on hard examples, but their features can be largely approximated by those of multi-scale smaller models.
Pre-training with S^2 further improves smaller models, suggesting comparable learning capacity to larger counterparts. |
The optimal balance between model size and image scales needs further exploration for different pre-trained models.
Future work includes exploring scale-selective processing and parallel processing of single images with S^2. |
multi-scale representation learning, vision transformer, model scaling, multimodal learning, robotic manipulation |
2403.12966
Report |
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models |
Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, Jiwen Lu |
In the realm of vision-language understanding, the proficiency of models in
interpreting and reasoning over visual content has become a cornerstone for
numerous applications. However, it is challenging for the visual encoder in
Large Vision-Language Models (LVLMs) to extract useful features tailored to
questions that aid the language model's response. Furthermore, a common
practice among existing LVLMs is to utilize lower-resolution images, which
restricts the ability for visual recognition. Our work introduces the
Chain-of-Spot (CoS) method, which we describe as Interactive Reasoning, a novel
approach that enhances feature extraction by focusing on key regions of
interest (ROI) within the image, corresponding to the posed questions or
instructions. This technique allows LVLMs to access more detailed visual
information without altering the original image resolution, thereby offering
multi-granularity image features. By integrating Chain-of-Spot with
instruct-following LLaVA-1.5 models, the process of image reasoning
consistently improves performance across a wide range of multimodal datasets
and benchmarks without bells and whistles and achieves new state-of-the-art
results. Our empirical findings demonstrate a significant improvement in LVLMs'
ability to understand and reason about visual content, paving the way for more
sophisticated visual instruction-following applications. Code and models are
available at https://github.com/dongyh20/Chain-of-Spot |
The paper introduces Chain-of-Spot (CoS), a novel interactive reasoning approach for large vision-language models (LVLMs) that improves visual understanding by guiding models to focus on key regions of interest (ROI) within an image. |
Existing LVLMs often struggle to extract useful features tailored to specific questions and are limited by the use of lower-resolution images. Chain-of-Spot addresses these issues by providing multi-granularity image features and enabling more focused analysis. |
CoS uses a relevance map between language tokens and image features to identify the ROI. During inference, the model first identifies the ROI and then uses both the global and cropped ROI features to generate the response. |
CoS significantly improves the performance of LLaVA-1.5 on various visual question answering and multimodal benchmarks.
The method achieves state-of-the-art results on multiple datasets, including VQAv2, GQA, VizWiz, SEEDBench, MMBench, and MM-Vet.
Analysis shows that CoS effectively guides the model's focus to relevant image regions, improving reasoning and accuracy. |
One limitation is the potential for insufficient training data to adequately guide ROI identification.
Future work could explore expanding the training dataset and investigating the ethical implications of enhanced LVLMs. |
large vision-language models, interactive reasoning, chain-of-spot, region of interest, multimodal learning |
2403.12965
Report |
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment |
Mengting Chen, Xi Chen, Zhonghua Zhai, Chen Ju, Xuewen Hong, Jinsong Lan, Shuai Xiao |
This paper introduces a novel framework for virtual try-on, termed
Wear-Any-Way. Different from previous methods, Wear-Any-Way is a customizable
solution. Besides generating high-fidelity results, our method supports users
to precisely manipulate the wearing style. To achieve this goal, we first
construct a strong pipeline for standard virtual try-on, supporting
single/multiple garment try-on and model-to-model settings in complicated
scenarios. To make it manipulable, we propose sparse correspondence alignment
which involves point-based control to guide the generation for specific
locations. With this design, Wear-Any-Way gets state-of-the-art performance for
the standard setting and provides a novel interaction form for customizing the
wearing style. For instance, it supports users to drag the sleeve to make it
rolled up, drag the coat to make it open, and utilize clicks to control the
style of tuck, etc. Wear-Any-Way enables more liberated and flexible
expressions of the attires, holding profound implications in the fashion
industry. |
This paper presents Wear-Any-Way, a novel framework for virtual try-on that not only generates high-fidelity results but also allows users to customize wearing styles. |
Existing virtual try-on methods often lack detail fidelity and controllability over garment wearing style, limiting their application in fashion. |
The proposed approach leverages a dual-branch diffusion model with a reference U-Net for detail preservation and a sparse correspondence alignment module for point-based manipulation. |
Wear-Any-Way achieves state-of-the-art performance on standard virtual try-on benchmarks, outperforming existing methods in fidelity and detail.
The method supports flexible customization, enabling users to control garment features like sleeve rolls, coat openness, and tuck styles through click-and-drag interactions.
A novel point-pair collection pipeline based on a Siamese diffusion model is proposed to effectively learn garment-person correspondence. |
The method might generate artifacts for fine details like hands, especially in lower resolutions.
Future work includes exploring higher resolution models and addressing the challenge of generating complex garment interactions (e.g., multiple layers of clothing). |
virtual try-on, customizable generation, diffusion model, point-based control, sparse correspondence alignment |
2403.12963
Report |
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis |
Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, Hongsheng Li |
In this study, we delve into the generation of high-resolution images from
pre-trained diffusion models, addressing persistent challenges, such as
repetitive patterns and structural distortions, that emerge when models are
applied beyond their trained resolutions. To address this issue, we introduce
an innovative, training-free approach FouriScale from the perspective of
frequency domain analysis. We replace the original convolutional layers in
pre-trained diffusion models by incorporating a dilation technique along with a
low-pass operation, intending to achieve structural consistency and scale
consistency across resolutions, respectively. Further enhanced by a
padding-then-crop strategy, our method can flexibly handle text-to-image
generation of various aspect ratios. By using the FouriScale as guidance, our
method successfully balances the structural integrity and fidelity of generated
images, achieving an astonishing capacity of arbitrary-size, high-resolution,
and high-quality generation. With its simplicity and compatibility, our method
can provide valuable insights for future explorations into the synthesis of
ultra-high-resolution images. The code will be released at
https://github.com/LeonHLJ/FouriScale. |
This paper introduces FouriScale, a training-free method to generate high-resolution images from pre-trained diffusion models by addressing the issue of repetitive patterns and structural distortions often seen in upscaling. |
Existing diffusion models are typically trained at limited resolutions, and applying them to higher resolutions often leads to undesirable artifacts and inconsistencies. FouriScale offers a way to overcome these limitations without needing to retrain the model. |
FouriScale analyzes the problem in the frequency domain and introduces two key operations: 1) dilated convolution to maintain structural consistency, and 2) low-pass filtering to ensure scale consistency across resolutions. A padding-then-crop strategy is used for arbitrary aspect ratios, and a guidance mechanism further improves image quality. |
FouriScale outperforms existing training-free methods in quantitative metrics like FID and KID, showing better image quality and diversity at higher resolutions.
The method effectively reduces repetitive patterns and preserves structural details even with significant upscaling factors (up to 16x).
FouriScale is shown to be compatible with various pre-trained models like SD 1.5, SD 2.1, and SDXL, and can be integrated with techniques like LoRA. |
While effective at high resolutions, FouriScale still faces challenges with ultra-high resolutions (e.g., 4096x4096) where artifacts might occur.
The current implementation primarily focuses on convolutional layers, limiting its application to purely transformer-based diffusion models. |
diffusion model, training-free, high-resolution synthesis, frequency domain analysis, text-to-image generation |
2403.12962
Report |
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation |
Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy |
The remarkable efficacy of text-to-image diffusion models has motivated
extensive exploration of their potential application in video domains.
Zero-shot methods seek to extend image diffusion models to videos without
necessitating model training. Recent methods mainly focus on incorporating
inter-frame correspondence into attention mechanisms. However, the soft
constraint imposed on determining where to attend to valid features can
sometimes be insufficient, resulting in temporal inconsistency. In this paper,
we introduce FRESCO, intra-frame correspondence alongside inter-frame
correspondence to establish a more robust spatial-temporal constraint. This
enhancement ensures a more consistent transformation of semantically similar
content across frames. Beyond mere attention guidance, our approach involves an
explicit update of features to achieve high spatial-temporal consistency with
the input video, significantly improving the visual coherence of the resulting
translated videos. Extensive experiments demonstrate the effectiveness of our
proposed framework in producing high-quality, coherent videos, marking a
notable improvement over existing zero-shot methods. |
This paper introduces FRESCO, a novel zero-shot diffusion framework that leverages both inter-frame and intra-frame correspondences for coherent and flexible video translation. |
Existing zero-shot video translation methods, while promising, struggle with temporal inconsistencies, particularly in scenarios with occlusion or rapid motion. This work addresses these limitations by introducing intra-frame spatial correspondence as a key constraint. |
FRESCO adapts a pre-trained image diffusion model for videos using two key mechanisms: 1) FRESCO-aware feature optimization, which directly optimizes decoder features to align with the spatial-temporal coherence of the input video. 2) FRESCO-guided attention, which incorporates spatial and temporal cues to guide the attention mechanism in the U-Net. |
FRESCO effectively addresses temporal inconsistencies observed in previous methods, producing significantly more coherent results.
The framework's modular design allows for independent analysis of spatial and temporal adaptations, demonstrating their individual contributions to overall performance.
FRESCO exhibits high compatibility with existing image diffusion techniques, enabling its application in other video editing tasks such as colorization. |
While effective, FRESCO's reliance on optical flow from the original video may limit its ability to handle large shape deformations.
Future work could explore adaptive combinations with pixel-level alignment methods and incorporate learned motion priors for handling larger deformations. |
video translation, diffusion models, zero-shot learning, temporal consistency, spatial correspondence |
2403.12960
Report |
FaceXFormer: A Unified Transformer for Facial Analysis |
Kartik Narayan, Vibashan VS, Rama Chellappa, Vishal M. Patel |
In this work, we introduce FaceXformer, an end-to-end unified transformer
model for a comprehensive range of facial analysis tasks such as face parsing,
landmark detection, head pose estimation, attributes recognition, and
estimation of age, gender, race, and landmarks visibility. Conventional methods
in face analysis have often relied on task-specific designs and preprocessing
techniques, which limit their approach to a unified architecture. Unlike these
conventional methods, our FaceXformer leverages a transformer-based
encoder-decoder architecture where each task is treated as a learnable token,
enabling the integration of multiple tasks within a single framework. Moreover,
we propose a parameter-efficient decoder, FaceX, which jointly processes face
and task tokens, thereby learning generalized and robust face representations
across different tasks. To the best of our knowledge, this is the first work to
propose a single model capable of handling all these facial analysis tasks
using transformers. We conducted a comprehensive analysis of effective
backbones for unified face task processing and evaluated different task queries
and the synergy between them. We conduct experiments against state-of-the-art
specialized models and previous multi-task models in both intra-dataset and
cross-dataset evaluations across multiple benchmarks. Additionally, our model
effectively handles images "in-the-wild," demonstrating its robustness and
generalizability across eight different tasks, all while maintaining the
real-time performance of 37 FPS. |
This paper introduces *FaceXformer*, a unified transformer-based model for eight facial analysis tasks: face parsing, landmark detection, head pose estimation, attributes recognition, age estimation, gender estimation, race estimation, and landmarks visibility prediction. |
Existing facial analysis models are often task-specific, limiting their applicability to multiple tasks and hindering the development of a single unified model. A unified model offers several advantages: learning robust and generalized face representations, modeling intra-task relationships, and enhancing overall performance through task synergy. |
FaceXformer uses a transformer-based encoder-decoder architecture. It leverages multi-scale features from the input face image and fuses them into a unified representation. Each facial analysis task is treated as a unique, learnable token processed by a parameter-efficient decoder (FaceX) to interact with the unified face representation. Task-specific predictions are then generated from the refined task tokens. |
*FaceXformer* achieves state-of-the-art performance in face parsing and attributes recognition.
It demonstrates competitive performance in landmark detection and head pose estimation compared to leading methods.
The model effectively handles in-the-wild images, showing robustness and generalization across all eight tasks while maintaining real-time performance (37 FPS). |
While *FaceXformer* supports tokens for various tasks, it lacks full interactivity and promptability.
It does not achieve state-of-the-art performance in tasks like landmark detection and head pose estimation due to not utilizing auxiliary information and advanced representations. |
facial analysis, transformer, multi-task learning, computer vision, deep learning |
2403.12957
Report |
GVGEN: Text-to-3D Generation with Volumetric Representation |
Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, Tong He |
In recent years, 3D Gaussian splatting has emerged as a powerful technique
for 3D reconstruction and generation, known for its fast and high-quality
rendering capabilities. To address these shortcomings, this paper introduces a
novel diffusion-based framework, GVGEN, designed to efficiently generate 3D
Gaussian representations from text input. We propose two innovative
techniques:(1) Structured Volumetric Representation. We first arrange
disorganized 3D Gaussian points as a structured form GaussianVolume. This
transformation allows the capture of intricate texture details within a volume
composed of a fixed number of Gaussians. To better optimize the representation
of these details, we propose a unique pruning and densifying method named the
Candidate Pool Strategy, enhancing detail fidelity through selective
optimization. (2) Coarse-to-fine Generation Pipeline. To simplify the
generation of GaussianVolume and empower the model to generate instances with
detailed 3D geometry, we propose a coarse-to-fine pipeline. It initially
constructs a basic geometric structure, followed by the prediction of complete
Gaussian attributes. Our framework, GVGEN, demonstrates superior performance in
qualitative and quantitative assessments compared to existing 3D generation
methods. Simultaneously, it maintains a fast generation speed ($\sim$7
seconds), effectively striking a balance between quality and efficiency. |
This paper proposes GVGEN, a novel diffusion-based framework for generating 3D Gaussian representations directly from text descriptions. |
Generating 3D models from text descriptions is important for various industries. Existing methods either lack diversity, require long inference times, or produce low-resolution assets. This work aims to overcome these limitations by directly generating 3D Gaussians from text. |
The proposed method utilizes a two-stage approach: 1) **GaussianVolume Fitting:** Organizes 3D Gaussian points into a structured volumetric form (GaussianVolume) using a novel Candidate Pool Strategy for pruning and densification. 2) **Text-to-3D Generation:** Employs a coarse-to-fine pipeline. First, a diffusion model generates a coarse geometry volume (Gaussian Distance Field). Then, a 3D U-Net predicts detailed Gaussian attributes based on the generated geometry and text input. |
GVGEN demonstrates superior performance in qualitative and quantitative assessments compared to existing 3D generation methods.
The method achieves a fast generation speed (approximately 7 seconds).
GVGEN effectively balances generation quality and efficiency. |
The performance of GVGEN is limited when presented with text inputs that significantly deviate from the training data domain.
Scaling up the model to handle millions of objects for increased diversity presents a challenge due to the time-consuming nature of fitting GaussianVolume for each object. |
text-to-3d generation, 3d gaussian splatting, diffusion models, volumetric representation, deep learning |
2403.12915
Report |
Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model |
Jiajie Yang |
We introduce the Pyramid Diffusion Model (PDM), a novel architecture designed
for ultra-high-resolution image synthesis. PDM utilizes a pyramid latent
representation, providing a broader design space that enables more flexible,
structured, and efficient perceptual compression which enable AutoEncoder and
Network of Diffusion to equip branches and deeper layers. To enhance PDM's
capabilities for generative tasks, we propose the integration of
Spatial-Channel Attention and Res-Skip Connection, along with the utilization
of Spectral Norm and Decreasing Dropout Strategy for the Diffusion Network and
AutoEncoder. In summary, PDM achieves the synthesis of images with a 2K
resolution for the first time, demonstrated on two new datasets comprising
images of sizes 2048x2048 pixels and 2048x1024 pixels respectively. We believe
that this work offers an alternative approach to designing scalable image
generative models, while also providing incremental reinforcement for existing
frameworks. |
The paper introduces Pyramid Diffusion Model (PDM), a novel architecture for ultra-high-resolution image synthesis using a pyramid latent representation, enabling efficient perceptual compression and flexible design. |
Existing models struggle to synthesize ultra-high-resolution images due to limitations in latent representation and network design. PDM addresses these limitations to enable 2K resolution image generation. |
PDM replaces the single latent in LDMs with a pyramid latent structure, utilizes a Pyramid UNet with branches for each latent scale, and incorporates Spatial-Channel Attention, Res-Skip Connections, Spectral Norm, and a Decreasing Dropout Strategy. |
Achieved synthesis of 2K resolution images for the first time.
Introduced two new datasets, SCAPES2K and PEOPLE2K, containing images with 2048x2048 and 2048x1024 pixels.
Visualization of pyramid latent representations shows that different resolutions contribute to distinct image aspects (global concept, local concept, details). |
Limited evaluation of FID scores on benchmark datasets.
Further research on Concept Aliasing and its impact on generative models. |
diffusion model, image synthesis, high-resolution images, pyramid latent representation, spatial-channel attention |
2403.12906
Report |
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation |
Yufei Liu, Junwei Zhu, Junshu Tang, Shijie Zhang, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yunsheng Wu, Dongjin Huang |
Texturing 3D humans with semantic UV maps remains a challenge due to the
difficulty of acquiring reasonably unfolded UV. Despite recent text-to-3D
advancements in supervising multi-view renderings using large text-to-image
(T2I) models, issues persist with generation speed, text consistency, and
texture quality, resulting in data scarcity among existing datasets. We present
TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture
generation model. Utilizing an efficient texture adaptation finetuning
strategy, we adapt large T2I model to a semantic UV structure while preserving
its original generalization capability. Leveraging a novel feature translator
module, the trained model is capable of generating high-fidelity 3D human
textures from either text or image within seconds. Furthermore, we introduce
ArTicuLated humAn textureS (ATLAS), the largest high-resolution (1024 X 1024)
3D human texture dataset which contains 50k high-fidelity textures with text
descriptions. |
TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model for texturing 3D humans from text or image inputs. |
Existing methods for generating 3D human textures are limited by generation speed, consistency, and quality, leading to data scarcity in existing datasets. |
TexDreamer utilizes a two-step training strategy: 1) Text-to-UV (T2UV) adapts a large text-to-image model to a semantic UV structure with an efficient texture adaptation finetuning strategy, and 2) Image-to-UV (I2UV) translates image features to textual features using a novel feature translator module, enabling texture prediction from images in the T2UV's text feature space. The model is trained on a novel dataset called ATLAS, the largest high-resolution 3D human texture dataset. |
TexDreamer outperforms state-of-the-art methods in generating high-fidelity textures from both text and image inputs.
The model demonstrates high text consistency, effectively capturing identity and clothing details from textual descriptions.
TexDreamer enables efficient texture editing and integration with complex 3D human meshes. |
I2UV's performance on real-life cases may be limited due to its reliance on semantic features rather than precise 2D image segmentation.
The realistic texture generation capability raises ethical concerns about potential misuse, such as creating deepfakes. |
human texture, multimodal, texture synthesis, text-to-3d, image-to-uv |
2403.12803
Report |
DreamDA: Generative Data Augmentation with Diffusion Models |
Yunxiang Fu, Chaoqi Chen, Yu Qiao, Yizhou Yu |
The acquisition of large-scale, high-quality data is a resource-intensive and
time-consuming endeavor. Compared to conventional Data Augmentation (DA)
techniques (e.g. cropping and rotation), exploiting prevailing diffusion models
for data generation has received scant attention in classification tasks.
Existing generative DA methods either inadequately bridge the domain gap
between real-world and synthesized images, or inherently suffer from a lack of
diversity. To solve these issues, this paper proposes a new
classification-oriented framework DreamDA, which enables data synthesis and
label generation by way of diffusion models. DreamDA generates diverse samples
that adhere to the original data distribution by considering training images in
the original data as seeds and perturbing their reverse diffusion process. In
addition, since the labels of the generated data may not align with the labels
of their corresponding seed images, we introduce a self-training paradigm for
generating pseudo labels and training classifiers using the synthesized data.
Extensive experiments across four tasks and five datasets demonstrate
consistent improvements over strong baselines, revealing the efficacy of
DreamDA in synthesizing high-quality and diverse images with accurate labels.
Our code will be available at https://github.com/yunxiangfu2001/DreamDA. |
This paper proposes DreamDA, a novel data augmentation framework that leverages pre-trained diffusion models to generate diverse images adhering to the real data distribution for improved image classification. |
High-quality, large-scale data collection is crucial for deep learning but costly. DreamDA addresses this by synthesizing diverse and reliable training data, enhancing model performance. |
DreamDA perturbs the reverse diffusion process of pre-trained diffusion models by injecting noise into the U-Net bottleneck. It introduces AMST, a self-training paradigm using multiple classifiers to generate reliable pseudo labels for synthesized data, improving label accuracy. |
DreamDA consistently outperforms conventional and diffusion-based data augmentation techniques, demonstrating superior performance on multiple datasets and tasks.
DreamDA effectively mitigates the domain gap between synthetic and real data, achieving excellent FID and MMD scores.
The paper provides extensive ablation studies, demonstrating the effectiveness of individual components, such as latent perturbation and AMST. |
The paper acknowledges the computational cost of data generation and suggests exploring faster sampling techniques in future work.
The authors emphasize the need to carefully consider ethical implications when applying generative data augmentation in real-world scenarios. |
data augmentation, diffusion models, image classification, self-training, generative models |
2403.12760
Report |
WaveFace: Authentic Face Restoration with Efficient Frequency Recovery |
Yunqi Miao, Jiankang Deng, Jungong Han |
Although diffusion models are rising as a powerful solution for blind face
restoration, they are criticized for two problems: 1) slow training and
inference speed, and 2) failure in preserving identity and recovering
fine-grained facial details. In this work, we propose WaveFace to solve the
problems in the frequency domain, where low- and high-frequency components
decomposed by wavelet transformation are considered individually to maximize
authenticity as well as efficiency. The diffusion model is applied to recover
the low-frequency component only, which presents general information of the
original image but 1/16 in size. To preserve the original identity, the
generation is conditioned on the low-frequency component of low-quality images
at each denoising step. Meanwhile, high-frequency components at multiple
decomposition levels are handled by a unified network, which recovers complex
facial details in a single step. Evaluations on four benchmark datasets show
that: 1) WaveFace outperforms state-of-the-art methods in authenticity,
especially in terms of identity preservation, and 2) authentic images are
restored with the efficiency 10x faster than existing diffusion model-based BFR
methods. |
This paper proposes WaveFace, an efficient blind face restoration approach that restores authentic images by recovering their frequency components individually. |
Existing diffusion models for BFR are computationally expensive and often fail to preserve identity and fine-grained facial details. This work addresses these limitations by operating in the frequency domain. |
The method uses Discrete Wavelet Transform (DWT) to decompose images. It then leverages a Low-frequency Conditional Denoising (LCD) module with a conditional diffusion model for the low-frequency component and a High-Frequency Recovery (HFR) module for high-frequency components at multiple levels. |
WaveFace outperforms state-of-the-art methods in authenticity, particularly in identity preservation.
It achieves up to 10x faster restoration speeds compared to existing diffusion model-based BFR methods.
The method effectively balances efficiency and restoration quality by carefully selecting the DWT decomposition level. |
There's a significant difference between simulated and real-world degradations, impacting performance on real images.
Future work will focus on simulating more realistic degradations and exploring better evaluation metrics for BFR. |
blind face restoration, diffusion models, frequency domain, wavelet transform, identity preservation |
2403.12722
Report |
HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting |
Hongyu Zhou, Jiahao Shao, Lu Xu, Dongfeng Bai, Weichao Qiu, Bingbing Liu, Yue Wang, Andreas Geiger, Yiyi Liao |
Holistic understanding of urban scenes based on RGB images is a challenging
yet important problem. It encompasses understanding both the geometry and
appearance to enable novel view synthesis, parsing semantic labels, and
tracking moving objects. Despite considerable progress, existing approaches
often focus on specific aspects of this task and require additional inputs such
as LiDAR scans or manually annotated 3D bounding boxes. In this paper, we
introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic
urban scene understanding. Our main idea involves the joint optimization of
geometry, appearance, semantics, and motion using a combination of static and
dynamic 3D Gaussians, where moving object poses are regularized via physical
constraints. Our approach offers the ability to render new viewpoints in
real-time, yielding 2D and 3D semantic information with high accuracy, and
reconstruct dynamic scenes, even in scenarios where 3D bounding box detection
are highly noisy. Experimental results on KITTI, KITTI-360, and Virtual KITTI 2
demonstrate the effectiveness of our approach. |
Introduces HUGS, a novel pipeline leveraging 3D Gaussian Splatting for holistic urban scene understanding from posed RGB images. |
Enables holistic urban scene representation for applications like autonomous driving simulation, encompassing novel view synthesis, semantic parsing, and dynamic object tracking, without relying on expensive LiDAR or annotations. |
Decomposes scenes into static and dynamic 3D Gaussians, modeling moving objects' motion with a physically-constrained unicycle model. Jointly optimizes geometry, appearance, semantics, and motion using RGB images, noisy 2D semantic labels, and optical flow. |
Achieves state-of-the-art novel view synthesis on dynamic scenes, even with noisy 3D bounding box inputs.
Enables high-quality novel view semantic synthesis, achieving comparable performance to state-of-the-art on KITTI-360.
Allows for accurate 3D semantic reconstruction, outperforming Semantic Nerfacto in terms of geometric quality and semantic accuracy. |
Limited rotation capability for reconstructed dynamic objects.
Lacks control over aspects like lighting editing. |
3d scene understanding, gaussian splatting, novel view synthesis, semantic reconstruction, dynamic scenes |
2403.12706
Report |
AnimateDiff-Lightning: Cross-Model Diffusion Distillation |
Shanchuan Lin, Xiao Yang |
We present AnimateDiff-Lightning for lightning-fast video generation. Our
model uses progressive adversarial diffusion distillation to achieve new
state-of-the-art in few-step video generation. We discuss our modifications to
adapt it for the video modality. Furthermore, we propose to simultaneously
distill the probability flow of multiple base diffusion models, resulting in a
single distilled motion module with broader style compatibility. We are pleased
to release our distilled AnimateDiff-Lightning model for the community's use. |
Presents AnimateDiff-Lightning, a lightning-fast video generation model using progressive adversarial diffusion distillation for few-step video generation, and introduces cross-model diffusion distillation to enhance the generalization ability of the distilled motion module across diverse stylized base models. |
Addresses the speed limitations of video generation models, particularly AnimateDiff, to make them more practical and widely adoptable by reducing the time and computational cost of the generation process. |
Adapts progressive adversarial diffusion distillation to the video modality by simultaneously distilling the probability flow of multiple base diffusion models (Stable Diffusion, RealisticVision, epiCRealism, ToonYou, IMP, Counterfeit) using a shared motion module, and employs a flow-conditional video discriminator to ensure sharp and flow-preserving predictions. |
Achieves better quality video generation in fewer inference steps compared to prior video distillation methods, particularly AnimateLCM.
Demonstrates superior generalization ability to unseen stylized base models due to cross-model distillation.
Retains compatibility with key AnimateDiff features, including Motion LoRAs, different aspect ratios, and video-to-video generation with ControlNet. |
Experiences heavy noise artifacts in 1-step generation and brightness flickers in 2-step generation due to limitations in the epsilon formulation.
Shows a higher probability of generating bad cases when the aspect ratio deviates significantly from the square aspect ratio used during distillation training. |
video generation, diffusion models, model distillation, cross-model distillation, animatediff |
2403.12658
Report |
Tuning-Free Image Customization with Image and Text Guidance |
Pengzhi Li, Qiang Nie, Ying Chen, Xi Jiang, Kai Wu, Yuhuan Lin, Yong Liu, Jinlong Peng, Chengjie Wang, Feng Zheng |
Despite significant advancements in image customization with diffusion
models, current methods still have several limitations: 1) unintended changes
in non-target areas when regenerating the entire image; 2) guidance solely by a
reference image or text descriptions; and 3) time-consuming fine-tuning, which
limits their practical application. In response, we introduce a tuning-free
framework for simultaneous text-image-guided image customization, enabling
precise editing of specific image regions within seconds. Our approach
preserves the semantic features of the reference image subject while allowing
modification of detailed attributes based on text descriptions. To achieve
this, we propose an innovative attention blending strategy that blends
self-attention features in the UNet decoder during the denoising process. To
our knowledge, this is the first tuning-free method that concurrently utilizes
text and image guidance for image customization in specific regions. Our
approach outperforms previous methods in both human and quantitative
evaluations, providing an efficient solution for various practical
applications, such as image synthesis, design, and creative photography. |
This paper proposes a novel tuning-free framework for image customization that utilizes both text and reference images to edit specific regions within an image. |
Current image customization methods have limitations such as unintended changes in non-target areas, reliance on a single guidance modality (text or image), and time-consuming fine-tuning. This work addresses these limitations by enabling precise region-based editing with dual guidance in a tuning-free manner. |
The method utilizes a three-stream denoising architecture with a self-attention blending strategy. It inverts a collage of the target region and reference subject to obtain latent codes. Then, it blends features from reconstruction, text-guided, and noise-injected streams during denoising to generate the customized image. |
The proposed method outperforms existing single-modality and two-step methods in both qualitative and quantitative comparisons.
It achieves high fidelity to reference subjects while enabling text-driven attribute editing.
User studies confirm the effectiveness of the approach, showing superior performance in fidelity, quality, and text alignment. |
The method faces challenges in editing scenes with significant perspective changes or non-rigid motion.
Future work could explore incorporating perspective and motion guidance for more complex editing scenarios. |
image editing, image customization, diffusion model, text-image guidance, tuning-free |
2403.12585
Report |
LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing |
Yazeed Alharbi, Peter Wonka |
We present a novel, training-free approach for textual editing of real images
using diffusion models. Unlike prior methods that rely on computationally
expensive finetuning, our approach leverages LAtent SPatial Alignment (LASPA)
to efficiently preserve image details. We demonstrate how the diffusion process
is amenable to spatial guidance using a reference image, leading to
semantically coherent edits. This eliminates the need for complex optimization
and costly model finetuning, resulting in significantly faster editing compared
to previous methods. Additionally, our method avoids the storage requirements
associated with large finetuned models. These advantages make our approach
particularly well-suited for editing on mobile devices and applications
demanding rapid response times. While simple and fast, our method achieves
62-71\% preference in a user-study and significantly better model-based editing
strength and image preservation scores. |
This paper presents LASPA, a novel training-free method for single-image editing using text-to-image diffusion models that leverages latent spatial alignment for fast and efficient editing. |
Existing single-image editing methods using diffusion models are computationally expensive, requiring finetuning or complex optimization, making them impractical for real-time applications and resource-constrained devices. |
LASPA leverages the spatial latent of diffusion models by aligning it with the reference image features during the reverse diffusion process. This allows preserving image details while incorporating textual edits without modifying the model's parameters. |
LASPA achieves significantly faster editing speeds compared to previous methods (under 6 seconds).
Qualitative and quantitative evaluations demonstrate superior image preservation and editing strength compared to state-of-the-art methods.
The method is shown to be versatile and promising for various applications such as video editing, facial editing, and editing with faster diffusion models. |
LASPA can benefit from parameter tuning for specific edits and seed selection.
Achieving large pose changes remains a challenge. |
text-to-image, diffusion models, single-image editing, latent spatial alignment, fast editing |
2403.12550
Report |
RGBD GS-ICP SLAM |
Seongbo Ha, Jiung Yeon, Hyeonwoo Yu |
Simultaneous Localization and Mapping (SLAM) with dense representation plays
a key role in robotics, Virtual Reality (VR), and Augmented Reality (AR)
applications. Recent advancements in dense representation SLAM have highlighted
the potential of leveraging neural scene representation and 3D Gaussian
representation for high-fidelity spatial representation. In this paper, we
propose a novel dense representation SLAM approach with a fusion of Generalized
Iterative Closest Point (G-ICP) and 3D Gaussian Splatting (3DGS). In contrast
to existing methods, we utilize a single Gaussian map for both tracking and
mapping, resulting in mutual benefits. Through the exchange of covariances
between tracking and mapping processes with scale alignment techniques, we
minimize redundant computations and achieve an efficient system. Additionally,
we enhance tracking accuracy and mapping quality through our keyframe selection
methods. Experimental results demonstrate the effectiveness of our approach,
showing an incredibly fast speed up to 107 FPS (for the entire system) and
superior quality of the reconstructed map. |
This paper proposes RGBD GS-ICP SLAM, a novel real-time dense representation SLAM that integrates Generalized Iterative Closest Point (G-ICP) and 3D Gaussian Splatting (3DGS) for accurate and efficient tracking and mapping. |
Existing dense SLAM methods using neural scene representation or 3D Gaussian representation struggle to balance speed and accuracy, often relying on computationally expensive rendering or decoupled approaches. This paper addresses this limitation. |
The method leverages the shared representation of 3D Gaussians between G-ICP tracking and 3DGS mapping. It directly utilizes 3D information from G-ICP for tracking, eliminates redundant covariance computations, and introduces scale alignment techniques for smooth information transfer between the two processes. Additionally, it employs dynamic keyframe selection for both tracking and mapping to optimize performance. |
The method achieves state-of-the-art camera pose estimation accuracy on the Replica dataset, outperforming previous methods by over 50%.
It demonstrates incredibly fast system speed, up to 107 FPS, while maintaining high-quality map reconstruction, significantly surpassing existing methods in speed.
The paper provides comprehensive ablation studies, validating the contribution of each proposed component (scale regularization, scale alignment, keyframe selection, and local minima avoidance) to the overall performance. |
The method heavily relies on depth information, making it potentially susceptible to noise in real-world scenarios with low-quality depth sensors.
Future work includes exploring the trade-off between speed and robustness by incorporating RGB information to enhance performance in challenging environments. |
slam, 3d gaussian splatting, g-icp, dense representation, real-time |
2403.12532
Report |
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All |
Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, Lin Wang |
We present UniBind, a flexible and efficient approach that learns a unified
representation space for seven diverse modalities -- images, text, audio, point
cloud, thermal, video, and event data. Existing works, eg., ImageBind, treat
the image as the central modality and build an image-centered representation
space; however, the space may be sub-optimal as it leads to an unbalanced
representation space among all modalities. Moreover, the category names are
directly used to extract text embeddings for the downstream tasks, making it
hardly possible to represent the semantics of multi-modal data. The
'out-of-the-box' insight of our UniBind is to make the alignment center
modality-agnostic and further learn a unified and balanced representation
space, empowered by the large language models (LLMs). UniBind is superior in
its flexible application to all CLIP-style models and delivers remarkable
performance boosts. To make this possible, we 1) construct a knowledge base of
text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build
LLM-augmented class-wise embedding center on top of the knowledge base and
encoded visual embeddings; 3) align all the embeddings to the LLM-augmented
embedding center via contrastive learning to achieve a unified and balanced
representation space. UniBind shows strong zero-shot recognition performance
gains over prior arts by an average of 6.36%. Finally, we achieve new
state-of-the-art performance, eg., a 6.75% gain on ImageNet, on the multi-modal
fine-tuning setting while reducing 90% of the learnable parameters. |
Presents UniBind, a novel approach for multi-modal learning that uses LLM-augmented contrastive learning and modality-agnostic embedding centers to achieve a unified and balanced representation space. |
Existing methods often rely on image-centric representation spaces, leading to unbalanced performance across modalities. Additionally, using only category names as embedding centers fails to fully capture the semantic richness of multi-modal data. |
1) Constructs a knowledge base of text descriptions using LLMs and multi-modal LLMs for each category and multi-modal data. 2) Adaptively builds class-wise embedding centers by selecting the most relevant text embeddings from the knowledge base. 3) Aligns multi-modal embeddings to these embedding centers via contrastive learning. |
Achieves significant performance improvements on zero-shot recognition tasks across seven modalities, averaging +6.27% gain in top-1 accuracy.
Outperforms supervised methods on 10 out of 12 benchmarks for fine-tuning recognition, particularly excelling in datasets with many categories.
Demonstrates substantial improvement in cross-modal retrieval tasks, with +17.96% gain on top-20 recall for event-to-image retrieval. |
The robustness of the LLM-augmented method requires further investigation and enhancement.
Future work will explore leveraging LLMs to enhance the robustness of the modality-agnostic representation space. |
multi-modal learning, representation learning, contrastive learning, large language models, knowledge base |
2403.12510
Report |
Generalized Consistency Trajectory Models for Image Manipulation |
Beomsu Kim, Jaemin Kim, Jeongsol Kim, Jong Chul Ye |
Diffusion-based generative models excel in unconditional generation, as well
as on applied tasks such as image editing and restoration. The success of
diffusion models lies in the iterative nature of diffusion: diffusion breaks
down the complex process of mapping noise to data into a sequence of simple
denoising tasks. Moreover, we are able to exert fine-grained control over the
generation process by injecting guidance terms into each denoising step.
However, the iterative process is also computationally intensive, often taking
from tens up to thousands of function evaluations. Although consistency
trajectory models (CTMs) enable traversal between any time points along the
probability flow ODE (PFODE) and score inference with a single function
evaluation, CTMs only allow translation from Gaussian noise to data. Thus, this
work aims to unlock the full potential of CTMs by proposing generalized CTMs
(GCTMs), which translate between arbitrary distributions via ODEs. We discuss
the design space of GCTMs and demonstrate their efficacy in various image
manipulation tasks such as image-to-image translation, restoration, and
editing. Code: \url{https://github.com/1202kbs/GCTM} |
The paper proposes Generalized Consistency Trajectory Models (GCTMs), which extend Consistency Trajectory Models (CTMs) to enable one-step translation between arbitrary distributions via ODEs. |
Diffusion models, while powerful, are computationally intensive. CTMs offer fast sampling but are limited to Gaussian noise to data transformations. GCTMs overcome this by learning ODEs between any two distributions, enabling various image manipulation tasks efficiently. |
The paper leverages Flow Matching theory to generalize CTMs. It proposes a new parametrization for the FM ODE solution, enabling traversal between arbitrary distributions. GCTMs are trained by minimizing a combination of distillation and denoising score-matching losses. |
GCTMs with Optimal Transport coupling significantly accelerate training convergence in unconditional generation.
In image-to-image translation, GCTMs achieve superior performance with NFE=1, outperforming SDE-based methods and GANs in terms of image quality and faithfulness.
GCTMs excel in image restoration, surpassing DPS and CM in zero-shot settings, and achieving a good balance between perception and distortion metrics in supervised settings. |
GCTMs haven't yet reached state-of-the-art performance in unconditional generation.
Further hyperparameter tuning, particularly inspired by iCMs, is suggested as future work to potentially boost performance. |
diffusion models, flow matching, consistency models, image manipulation, fast sampling |
2403.12488
Report |
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM |
Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Jian Wu, Philip Torr |
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot
object detection ability of multimodal large language models (MLLMs), such as
GPT-4V and Gemini. Our approach consists of a detection prompting toolkit
inspired by high-precision detection priors and a new Chain-of-Thought to
implement these prompts. Specifically, the prompts in the toolkit are designed
to guide the MLLM to focus on regional information (e.g., zooming in), read
coordinates according to measure standards (e.g., overlaying rulers and
compasses), and infer from the contextual information (e.g., overlaying scene
graphs). Building upon these tools, the new detection chain-of-thought can
automatically decompose the task into simple subtasks, diagnose the
predictions, and plan for progressive box refinements. The effectiveness of our
framework is demonstrated across a spectrum of detection tasks, especially hard
cases. Compared to existing state-of-the-art methods, GPT-4V with our
DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS
COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val
set for zero-shot referring expression comprehension, +14.5% AP on D-cube
describe object detection FULL setting. |
DetToolChain, a novel prompting paradigm using visual and reasoning prompts with a chain-of-thought approach, is proposed to unleash the zero-shot object detection ability of MLLMs. |
Existing methods for detection with MLLMs rely on finetuning, which is computationally expensive and infeasible for closed-source models. This work explores the potential of MLLMs as zero-shot detectors through prompting. |
The methodology involves: (1) Visual processing prompts (regional amplifier, spatial measurement standard, scene image parser) to pre-process images, (2) Detection reasoning prompts for result diagnosis and next prompt selection, and (3) A multimodal detection Chain-of-Thought (Det-CoT) to manage the detection process. |
DetToolChain significantly improves GPT-4V and Gemini performance on open-vocabulary detection, outperforming SOTA methods by a large margin (e.g., +21.5% AP50 on COCO Novel class set).
It achieves state-of-the-art performance on described object detection (+14.5% AP on D-cube FULL set) and referring expression comprehension (+24.23% Acc on RefCOCO val set) tasks.
Ablation studies demonstrate the effectiveness of individual visual prompting tools and highlight the superiority of Det-CoT over other CoT methods. |
The sequential processing of prompts in DetToolChain limits parallel computation, impacting efficiency.
The framework's reliance on large-scale MLLMs and extensive message histories raises concerns about scalability and cost. |
multimodal large language model, prompting, object detection, chain-of-thought, zero-shot learning |
2403.12431
Report |
Geometric Constraints in Deep Learning Frameworks: A Survey |
Vibhas K Vats, David J Crandall |
Stereophotogrammetry is an emerging technique of scene understanding. Its
origins go back to at least the 1800s when people first started to investigate
using photographs to measure the physical properties of the world. Since then,
thousands of approaches have been explored. The classic geometric techniques of
Shape from Stereo is built on using geometry to define constraints on scene and
camera geometry and then solving the non-linear systems of equations. More
recent work has taken an entirely different approach, using end-to-end deep
learning without any attempt to explicitly model the geometry. In this survey,
we explore the overlap for geometric-based and deep learning-based frameworks.
We compare and contrast geometry enforcing constraints integrated into a deep
learning framework for depth estimation or other closely related problems. We
present a new taxonomy for prevalent geometry enforcing constraints used in
modern deep learning frameworks. We also present insightful observations and
potential future research directions. |
This paper surveys the use of geometric constraints in deep learning frameworks for depth estimation and related problems. It introduces a new taxonomy for these constraints and discusses their integration into various frameworks. |
While deep learning has advanced depth estimation, most methods rely heavily on supervised learning and large datasets. This paper explores how integrating geometric constraints can enhance structural consistency and reduce reliance on ground truth data. |
The paper reviews a range of geometric constraints, categorizing them and describing their mathematical formulations. It examines their application in different frameworks, including supervised, self-supervised, stereo, multi-view stereo, and monocular depth estimation. |
Explicitly modeling geometric constraints, along with supervision signals, enforces structural and occlusion reasoning and cross-view consistency.
The integration of geometric constraints can potentially improve depth estimation accuracy, particularly in challenging scenarios like featureless regions or varying lighting conditions.
The survey reveals a taxonomy of geometric constraints applicable to deep learning depth estimation, providing a valuable resource for researchers. |
The paper primarily focuses on summarizing existing work, with limited discussion on quantitative comparisons of different methods.
Further research is needed to explore the optimal combination and integration of various geometric constraints for specific depth estimation tasks. |
depth estimation, geometric constraints, multi-view stereo, self-supervised learning, deep learning |
2403.12409
Report |
ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance |
Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, Ziwei Liu |
Generating high-quality 3D assets from a given image is highly desirable in
various applications such as AR/VR. Recent advances in single-image 3D
generation explore feed-forward models that learn to infer the 3D model of an
object without optimization. Though promising results have been achieved in
single object generation, these methods often struggle to model complex 3D
assets that inherently contain multiple objects. In this work, we present
ComboVerse, a 3D generation framework that produces high-quality 3D assets with
complex compositions by learning to combine multiple models. 1) We first
perform an in-depth analysis of this ``multi-object gap'' from both model and
data perspectives. 2) Next, with reconstructed 3D models of different objects,
we seek to adjust their sizes, rotation angles, and locations to create a 3D
asset that matches the given image. 3) To automate this process, we apply
spatially-aware score distillation sampling (SSDS) from pretrained diffusion
models to guide the positioning of objects. Our proposed framework emphasizes
spatial alignment of objects, compared with standard score distillation
sampling, and thus achieves more accurate results. Extensive experiments
validate ComboVerse achieves clear improvements over existing methods in
generating compositional 3D assets. |
ComboVerse is a two-stage 3D generation framework that creates complex 3D assets by composing multiple objects, addressing the limitations of existing single-object models. |
Current single-image 3D generation methods struggle to model complex assets with multiple objects due to dataset bias and limitations in handling object interactions. |
1. Single-object reconstruction: Objects in the input image are segmented, inpainted, and reconstructed individually. 2. Multi-object combination: Objects are automatically combined by optimizing their scale, rotation, and translation, guided by a spatially-aware score distillation sampling (SSDS) loss from pretrained diffusion models. |
Outperforms state-of-the-art methods in generating compositional 3D assets from single images.
Effectively handles multiple objects, occlusion, and varying camera settings.
Achieves better spatial object placement compared to standard SDS methods, as demonstrated by both qualitative and quantitative evaluations. |
Faces challenges in creating highly complex scenes with numerous objects.
Relies on the quality of the backbone image-to-3D method used for single-object reconstruction. |
3d generation, compositional generation, diffusion models, score distillation sampling, spatial awareness |
2403.12365
Report |
GaussianFlow: Splatting Gaussian Dynamics for 4D Content Creation |
Quankai Gao, Qiangeng Xu, Zhe Cao, Ben Mildenhall, Wenchao Ma, Le Chen, Danhang Tang, Ulrich Neumann |
Creating 4D fields of Gaussian Splatting from images or videos is a
challenging task due to its under-constrained nature. While the optimization
can draw photometric reference from the input videos or be regulated by
generative models, directly supervising Gaussian motions remains underexplored.
In this paper, we introduce a novel concept, Gaussian flow, which connects the
dynamics of 3D Gaussians and pixel velocities between consecutive frames. The
Gaussian flow can be efficiently obtained by splatting Gaussian dynamics into
the image space. This differentiable process enables direct dynamic supervision
from optical flow. Our method significantly benefits 4D dynamic content
generation and 4D novel view synthesis with Gaussian Splatting, especially for
contents with rich motions that are hard to be handled by existing methods. The
common color drifting issue that happens in 4D generation is also resolved with
improved Guassian dynamics. Superior visual quality on extensive experiments
demonstrates our method's effectiveness. Quantitative and qualitative
evaluations show that our method achieves state-of-the-art results on both
tasks of 4D generation and 4D novel view synthesis. Project page:
https://zerg-overmind.github.io/GaussianFlow.github.io/ |
This paper introduces Gaussian flow, a differentiable method for directly supervising the dynamics of 3D Gaussians in 4D Gaussian Splatting using optical flow. |
Creating 4D Gaussian Splatting fields from images or videos is challenging due to under-constrained scene dynamics, especially from sparse-view or monocular videos. Existing methods lack direct supervision of Gaussian motions, leading to temporal inconsistencies and artifacts. |
Gaussian flow connects 3D Gaussian dynamics with 2D pixel velocities. It leverages the rendering process of 3D Gaussian Splatting to splat Gaussian dynamics onto the image plane, enabling direct supervision by matching Gaussian flow with pre-computed optical flow. |
Gaussian flow significantly improves 4D content generation and 4D novel view synthesis with Gaussian Splatting.
The method excels at handling scenes with rich and fast motions, outperforming existing approaches.
Color drifting artifacts common in 4D generation are resolved due to the improved accuracy of Gaussian dynamics. |
The current implementation focuses on short-term flow supervision between consecutive frames; exploring long-term supervision could further enhance temporal consistency.
The paper primarily focuses on single-view supervision; future work could explore multi-view flow supervision. |
4d generation, 4d novel view synthesis, 3d gaussian splatting, dynamic scene, optical flow |
2403.12326
Report |
Removing Undesirable Concepts in Text-to-Image Generative Models with Learnable Prompts |
Anh Bui, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung |
Generative models have demonstrated remarkable potential in generating
visually impressive content from textual descriptions. However, training these
models on unfiltered internet data poses the risk of learning and subsequently
propagating undesirable concepts, such as copyrighted or unethical content. In
this paper, we propose a novel method to remove undesirable concepts from
text-to-image generative models by incorporating a learnable prompt into the
cross-attention module. This learnable prompt acts as additional memory to
transfer the knowledge of undesirable concepts into it and reduce the
dependency of these concepts on the model parameters and corresponding textual
inputs. Because of this knowledge transfer into the prompt, erasing these
undesirable concepts is more stable and has minimal negative impact on other
concepts. We demonstrate the effectiveness of our method on the Stable
Diffusion model, showcasing its superiority over state-of-the-art erasure
methods in terms of removing undesirable content while preserving other
unrelated elements. |
This paper introduces KPOP, a novel method using learnable parameter prompts in cross-attention layers to remove undesirable concepts from text-to-image generative models while minimizing impact on other concepts. |
Training on unfiltered data risks generative models learning and propagating undesirable, unethical or copyrighted content. Existing erasure methods often degrade model performance on related concepts. |
KPOP uses a two-step process: 1) **Knowledge Transfer**: Train the prompt to mimic generation of the undesirable concept. 2) **Knowledge Removal**: Fine-tune the model to erase the concept, using the prompt to regularize the process and minimize impact on other concepts. |
KPOP demonstrates superior performance in erasing object-related concepts while preserving unrelated ones compared to baselines.
KPOP effectively mitigates NSFW content generation, achieving lower ratios of exposed body parts in images compared to baselines.
KPOP successfully erases artistic style concepts according to CLIP alignment scores, outperforming baselines in erasing while comparably preserving content. |
Larger prompt sizes, while improving erasure, can negatively impact the model's ability to preserve unrelated concepts due to softmax normalization.
Exploration of alternative prompting mechanisms, such as amortizing the prompt or injecting it before the text encoder, is left for future work. |
concept erasure, text-to-image generation, stable diffusion, cross-attention, prompt tuning |
2403.12042
Report |
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation |
Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, Gang Hua |
In this paper, we explore the visual representations produced from a
pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We hypothesize that the latent representation learned from a pretrained
generative T2V model encapsulates rich semantics and coherent temporal
correspondences, thereby naturally facilitating video understanding. Our
hypothesis is validated through the classic referring video object segmentation
(R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with
dedicatedly designed components built upon a fixed pretrained T2V model.
Specifically, VD-IT uses textual information as a conditional input, ensuring
semantic consistency across time for precise temporal instance matching. It
further incorporates image tokens as supplementary textual inputs, enriching
the feature set to generate detailed and nuanced masks.Besides, instead of
using the standard Gaussian noise, we propose to predict the video-specific
noise with an extra noise prediction module, which can help preserve the
feature fidelity and elevates segmentation quality. Through extensive
experiments, we surprisingly observe that fixed generative T2V diffusion
models, unlike commonly used video backbones (e.g., Video Swin Transformer)
pretrained with discriminative image/video pre-tasks, exhibit better potential
to maintain semantic alignment and temporal consistency. On existing standard
benchmarks, our VD-IT achieves highly competitive results, surpassing many
existing state-of-the-art methods. The code will be available at
\url{https://github.com/buxiangzhiren/VD-IT} |
This paper explores the potential of pre-trained text-to-video (T2V) diffusion models for video understanding, specifically for the task of Referring Video Object Segmentation (R-VOS). It introduces a novel framework, VD-IT, built upon a fixed pre-trained T2V model, incorporating text-guided image projection and video-specific noise prediction for enhanced feature extraction. |
The paper investigates whether the latent representations learned by generative T2V models, which excel in capturing temporal consistency, can benefit video understanding tasks like R-VOS. This exploration aims to advance the understanding and application of generative models in discriminative tasks. |
The VD-IT framework utilizes a pre-trained T2V model for feature extraction, employing two key innovations: (1) Text-Guided Image Projection, combining referring text and visual tokens as prompts to enhance feature richness and temporal consistency. (2) Video-Specific Noise Prediction, replacing standard Gaussian noise with predicted video-correlated noise to preserve feature fidelity. |
VD-IT achieves state-of-the-art results on four R-VOS benchmarks, demonstrating significant improvements over existing methods, particularly in maintaining temporal consistency.
Analysis shows that visual features extracted using VD-IT exhibit better temporal semantic consistency and spatial smoothness compared to those from discriminatively fine-tuned video backbones.
Experiments confirm that the use of referring text in feature extraction, coupled with video-specific noise prediction, significantly contributes to enhanced performance. |
The current implementation of VD-IT is limited by its computational cost, primarily due to the T2V diffusion model.
The framework focuses on single-object R-VOS, requiring further exploration for multi-object scenarios. |
referring video object segmentation, text-to-video diffusion models, video understanding, temporal consistency, generative models for discriminative tasks |
2403.12038
Report |
Zero-Shot Image Feature Consensus with Deep Functional Maps |
Xinle Cheng, Congyue Deng, Adam Harley, Yixin Zhu, Leonidas Guibas |
Correspondences emerge from large-scale vision models trained for generative
and discriminative tasks. This has been revealed and benchmarked by computing
correspondence maps between pairs of images, using nearest neighbors on the
feature grids. Existing work has attempted to improve the quality of these
correspondence maps by carefully mixing features from different sources, such
as by combining the features of different layers or networks. We point out that
a better correspondence strategy is available, which directly imposes structure
on the correspondence field: the functional map. Wielding this simple
mathematical tool, we lift the correspondence problem from the pixel space to
the function space and directly optimize for mappings that are globally
coherent. We demonstrate that our technique yields correspondences that are not
only smoother but also more accurate, with the possibility of better reflecting
the knowledge embedded in the large-scale vision models that we are studying.
Our approach sets a new state-of-the-art on various dense correspondence tasks.
We also demonstrate our effectiveness in keypoint correspondence and affordance
map transfer. |
The paper presents a zero-shot framework for image correspondence that leverages functional maps to improve the coherence and accuracy of matches derived from pre-trained large-scale vision models. |
Existing methods based on nearest neighbor search in feature space often lack global structure awareness, leading to distortions and discontinuities in the correspondence maps. This paper addresses this limitation by representing correspondences as functional maps, which capture global deformations more effectively. |
The method utilizes two sets of features from pre-trained networks. It constructs a graph Laplacian from one set to define a function basis and optimizes a functional map on this basis using the second set as a regularizer. The optimization incorporates descriptor preservation, compactness, and bijectivity constraints. |
The framework outperforms previous zero-shot methods on dense correspondence benchmarks, demonstrating both improved accuracy and smoothness.
It effectively fuses features from different networks and layers, outperforming simple concatenation approaches.
The method shows promising results in applications like keypoint matching and affordance transfer. |
The current framework is better suited for object-centric images than complex scenes, as it relies on the manifold assumption.
Future work could explore extending the method to handle complex scenes by incorporating segmentation or exploring matches between quotient spaces. |
functional map, zero-shot image matching, dense correspondence, emergent feature property, feature fusion |
2403.12036
Report |
One-Step Image Translation with Text-to-Image Models |
Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, Jun-Yan Zhu |
In this work, we address two limitations of existing conditional diffusion
models: their slow inference speed due to the iterative denoising process and
their reliance on paired data for model fine-tuning. To tackle these issues, we
introduce a general method for adapting a single-step diffusion model to new
tasks and domains through adversarial learning objectives. Specifically, we
consolidate various modules of the vanilla latent diffusion model into a single
end-to-end generator network with small trainable weights, enhancing its
ability to preserve the input image structure while reducing overfitting. We
demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms
existing GAN-based and diffusion-based methods for various scene translation
tasks, such as day-to-night conversion and adding/removing weather effects like
fog, snow, and rain. We extend our method to paired settings, where our model
pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and
Edge2Image, but with a single-step inference. This work suggests that
single-step diffusion models can serve as strong backbones for a range of GAN
learning objectives. Our code and models are available at
https://github.com/GaParmar/img2img-turbo. |
This paper introduces a novel one-step image translation method using text-to-image diffusion models, achieving efficient adaptation to new tasks and domains through adversarial learning objectives. |
This approach addresses limitations of existing conditional diffusion models, namely slow inference speed and reliance on paired training data. |
The method leverages a pre-trained one-step diffusion model (SD-Turbo), adapting it via: 1) Direct conditioning input to the noise encoder, 2) Consolidating encoder, UNet, and decoder into a single trainable architecture with LoRA, 3) Incorporating skip connections for detail preservation. |
Outperforms GAN-based and diffusion-based methods in unpaired image translation tasks (e.g., day-night conversion, weather effects).
Achieves comparable results to ControlNet in paired settings (e.g., Sketch2Photo, Edge2Image) with single-step inference.
Enables diverse output generation by interpolating between noise maps and encoder outputs. |
Lacks control over guidance strength due to the absence of classifier-free guidance in the backbone model.
Memory intensive training due to cycle-consistency loss and high-capacity generators. |
image translation, diffusion models, text-to-image synthesis, adversarial learning, one-step inference |
2403.12035
Report |
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility |
Bojia Zi, Shihao Zhao, Xianbiao Qi, Jianan Wang, Yukai Shi, Qianyu Chen, Bin Liang, Kam-Fai Wong, Lei Zhang |
Recent advancements in video generation have been remarkable, yet many
existing methods struggle with issues of consistency and poor text-video
alignment. Moreover, the field lacks effective techniques for text-guided video
inpainting, a stark contrast to the well-explored domain of text-guided image
inpainting. To this end, this paper proposes a novel text-guided video
inpainting model that achieves better consistency, controllability and
compatibility. Specifically, we introduce a simple but efficient motion capture
module to preserve motion consistency, and design an instance-aware region
selection instead of a random region selection to obtain better textual
controllability, and utilize a novel strategy to inject some personalized
models into our CoCoCo model and thus obtain better model compatibility.
Extensive experiments show that our model can generate high-quality video
clips. Meanwhile, our model shows better motion consistency, textual
controllability and model compatibility. More details are shown in
[cococozibojia.github.io](cococozibojia.github.io). |
This paper proposes CoCoCo, a novel text-guided video inpainting model that improves upon existing methods by enhancing consistency, controllability, and compatibility. |
Existing video generation methods struggle with maintaining consistency across frames, aligning generated content with text prompts, and integrating personalized text-to-image models. CoCoCo addresses these limitations to improve text-guided video inpainting. |
CoCoCo introduces a motion capture module with damped global attention and textual cross-attention, employs an instance-aware region selection strategy, and utilizes a task vector combination approach to adapt personalized text-to-image models. |
CoCoCo demonstrates superior background preservation and temporal consistency compared to baselines.
The instance-aware region selection and textual cross-attention significantly improve text-alignment capabilities, as evidenced by CLIP score.
The proposed method successfully integrates personalized text-to-image models, allowing for customized content generation within inpainted regions. |
The optimal parameters for integrating personalized models may vary depending on the specific models used.
Further research can explore extending the compatibility to a wider range of pretrained models. |
video inpainting, text-guided synthesis, motion consistency, text-video alignment, personalized models |
2403.12034
Report |
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models |
Junlin Han, Filippos Kokkinos, Philip Torr |
This paper presents a novel paradigm for building scalable 3D generative
models utilizing pre-trained video diffusion models. The primary obstacle in
developing foundation 3D generative models is the limited availability of 3D
data. Unlike images, texts, or videos, 3D data are not readily accessible and
are difficult to acquire. This results in a significant disparity in scale
compared to the vast quantities of other types of data. To address this issue,
we propose using a video diffusion model, trained with extensive volumes of
text, images, and videos, as a knowledge source for 3D data. By unlocking its
multi-view generative capabilities through fine-tuning, we generate a
large-scale synthetic multi-view dataset to train a feed-forward 3D generative
model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view
data, can generate a 3D asset from a single image in seconds and achieves
superior performance when compared to current SOTA feed-forward 3D generative
models, with users preferring our results over 70% of the time. |
Presents VFusion3D, a novel paradigm for building scalable 3D generative models by leveraging pre-trained video diffusion models as 3D data generators. |
Addresses the obstacle of limited 3D data availability by utilizing the vast knowledge base of video diffusion models trained on extensive text, image, and video data. |
1. Fine-tunes a video diffusion model (EMU Video) with rendered multi-view videos from a 3D dataset to generate 3D-consistent multi-view sequences. 2. Creates a large-scale synthetic multi-view dataset using text prompts and the fine-tuned EMU Video. 3. Trains a feed-forward 3D generative model (VFusion3D) using the synthetic dataset and fine-tunes it with the original 3D data. |
VFusion3D generates high-quality 3D assets from a single image in seconds.
Outperforms state-of-the-art feed-forward 3D generative models in user studies and automated metrics.
Demonstrates the scalability of learning 3D generative models from synthetic multi-view data generated by video diffusion models. |
Limited performance of the fine-tuned video diffusion model in generating multi-view sequences for certain object categories like vehicles and text.
Future work includes exploring stronger video diffusion models, larger and more diverse 3D datasets, and advancements in feed-forward 3D generative model architectures. |
3d generative models, video diffusion models, synthetic data generation, multi-view synthesis, large-scale training |
2403.12032
Report |
Generic 3D Diffusion Adapter Using Controlled Multi-View Editing |
Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Jiayuan Gu, Gordon Wetzstein, Hao Su, Leonidas Guibas |
Open-domain 3D object synthesis has been lagging behind image synthesis due
to limited data and higher computational complexity. To bridge this gap, recent
works have investigated multi-view diffusion but often fall short in either 3D
consistency, visual quality, or efficiency. This paper proposes MVEdit, which
functions as a 3D counterpart of SDEdit, employing ancestral sampling to
jointly denoise multi-view images and output high-quality textured meshes.
Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency
through a training-free 3D Adapter, which lifts the 2D views of the last
timestep into a coherent 3D representation, then conditions the 2D views of the
next timestep using rendered views, without uncompromising visual quality. With
an inference time of only 2-5 minutes, this framework achieves better trade-off
between quality and speed than score distillation. MVEdit is highly versatile
and extendable, with a wide range of applications including text/image-to-3D
generation, 3D-to-3D editing, and high-quality texture synthesis. In
particular, evaluations demonstrate state-of-the-art performance in both
image-to-3D and text-guided texture generation tasks. Additionally, we
introduce a method for fine-tuning 2D latent diffusion models on small 3D
datasets with limited resources, enabling fast low-resolution text-to-3D
initialization. |
This paper introduces MVEdit, a generic framework for adapting pre-trained 2D image diffusion models to enable 3D-aware diffusion for high-quality textured mesh generation. |
Open-domain 3D object synthesis lags behind image synthesis due to limited data and high computational complexity. Existing multi-view diffusion methods often fall short in 3D consistency, visual quality, or efficiency. |
MVEdit employs a novel training-free 3D Adapter within an ancestral sampling process. This adapter fuses multi-view 2D images into a coherent 3D representation, using either NeRF or mesh, to control subsequent 2D denoising steps for 3D consistency without sacrificing image quality. |
MVEdit achieves state-of-the-art results in both image-to-3D and text-guided texture generation, outperforming previous methods in visual quality and efficiency.
The 3D Adapter effectively resolves 3D inconsistencies in multi-view images, leading to more accurate and detailed 3D reconstructions.
The authors also introduce StableSSDNeRF, a fast text-to-3D diffusion model fine-tuned from Stable Diffusion, which can be used to initialize MVEdit for efficient domain-specific generation. |
The 3D-to-3D editing pipeline can still suffer from the Janus problem, especially when the degree of editing is high.
The off-the-shelf ControlNets used in the 3D Adapter may introduce minor inconsistencies or biases. |
diffusion models, 3d generation, texture synthesis, multi-view consistency, 3d editing |
2403.12028
Report |
Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail |
Mingjin Chen, Junhao Chen, Xiaojun Ye, Huan-ang Gao, Xiaoxue Chen, Zhaoxin Fan, Hao Zhao |
3D human body reconstruction has been a challenge in the field of computer
vision. Previous methods are often time-consuming and difficult to capture the
detailed appearance of the human body. In this paper, we propose a new method
called \emph{Ultraman} for fast reconstruction of textured 3D human models from
a single image. Compared to existing techniques, \emph{Ultraman} greatly
improves the reconstruction speed and accuracy while preserving high-quality
texture details. We present a set of new frameworks for human reconstruction
consisting of three parts, geometric reconstruction, texture generation and
texture mapping. Firstly, a mesh reconstruction framework is used, which
accurately extracts 3D human shapes from a single image. At the same time, we
propose a method to generate a multi-view consistent image of the human body
based on a single image. This is finally combined with a novel texture mapping
method to optimize texture details and ensure color consistency during
reconstruction. Through extensive experiments and evaluations, we demonstrate
the superior performance of \emph{Ultraman} on various standard datasets. In
addition, \emph{Ultraman} outperforms state-of-the-art methods in terms of
human rendering quality and speed. Upon acceptance of the article, we will make
the code and data publicly available. |
Ultraman, a novel 3D human reconstruction framework that reconstructs high-quality body meshes with detailed textures from single front-view images. |
Existing methods are time-consuming and struggle to capture detailed appearance, especially for clothed humans. Ultraman addresses these limitations by achieving faster and more detailed reconstruction. |
The framework consists of three modules: 1) Mesh Reconstruction: Generates a 3D human mesh from the input image. 2) Multi-view Image Generation: Uses a diffusion-based model to synthesize consistent images from unobserved viewpoints guided by depth, text prompts, and the input image. 3) Texturing: Projects the generated multi-view images onto the mesh's texture space, ensuring consistency and smoothing seams. |
Ultraman reconstructs high-quality 3D human models with detailed textures in 20-30 minutes, outperforming state-of-the-art methods in terms of speed (93% faster) and visual quality.
The multi-view image generation module, guided by VQA prompts and depth information, effectively synthesizes realistic textures for unseen areas, improving consistency between front and back views.
Quantitative evaluations on standard datasets demonstrate Ultraman's superiority in capturing geometric details and generating high-fidelity textures. |
The current view selection strategy might not fully cover all details for complex poses.
Exploring alternative texturing techniques to further enhance texture quality and reduce artifacts. |
3d human reconstruction, single-image reconstruction, diffusion models, texture synthesis, multi-view consistency |
2403.12019
Report |
LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation |
Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, Chen Change Loy |
The field of neural rendering has witnessed significant progress with
advancements in generative models and differentiable rendering techniques.
Though 2D diffusion has achieved success, a unified 3D diffusion pipeline
remains unsettled. This paper introduces a novel framework called LN3Diff to
address this gap and enable fast, high-quality, and generic conditional 3D
generation. Our approach harnesses a 3D-aware architecture and variational
autoencoder (VAE) to encode the input image into a structured, compact, and 3D
latent space. The latent is decoded by a transformer-based decoder into a
high-capacity 3D neural field. Through training a diffusion model on this
3D-aware latent space, our method achieves state-of-the-art performance on
ShapeNet for 3D generation and demonstrates superior performance in monocular
3D reconstruction and conditional 3D generation across various datasets.
Moreover, it surpasses existing 3D diffusion methods in terms of inference
speed, requiring no per-instance optimization. Our proposed LN3Diff presents a
significant advancement in 3D generative modeling and holds promise for various
applications in 3D vision and graphics tasks. |
This paper introduces LN3Diff, a novel framework for fast and generic conditional 3D generation that utilizes a 3D-aware variational autoencoder (VAE) to encode images into a compact latent space for efficient 3D diffusion learning. |
Existing methods for 3D diffusion face challenges in scalability, efficiency, and generalizability due to reliance on high-dimensional neural fields and limitations in handling conditional generation. |
LN3Diff employs a 3D-aware VAE to compress input images into a lower-dimensional latent space. A transformer-based decoder then reconstructs high-capacity 3D neural fields from this latent space. A diffusion model is trained on this compact latent space, enabling efficient conditional 3D generation. |
LN3Diff achieves state-of-the-art 3D generation performance on ShapeNet, outperforming GAN-based and other 3D diffusion methods.
It exhibits superior performance in monocular 3D reconstruction and conditional generation across ShapeNet, FFHQ, and Objaverse datasets.
LN3Diff surpasses existing 3D diffusion approaches in inference speed, achieving 3x faster generation without per-instance optimization. |
The monocular encoder struggles with challenging 3D scenes, suggesting the need for a multi-view encoder.
The reliance on volume rendering poses memory constraints; exploring more efficient 3D representations like 3DGS is a potential future direction. |
3d generation, 3d reconstruction, latent diffusion model, neural rendering, variational autoencoder |
2403.12015
Report |
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation |
Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, Robin Rombach |
Diffusion models are the main driver of progress in image and video
synthesis, but suffer from slow inference speed. Distillation methods, like the
recently introduced adversarial diffusion distillation (ADD) aim to shift the
model from many-shot to single-step inference, albeit at the cost of expensive
and difficult optimization due to its reliance on a fixed pretrained DINOv2
discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a
novel distillation approach overcoming the limitations of ADD. In contrast to
pixel-based ADD, LADD utilizes generative features from pretrained latent
diffusion models. This approach simplifies training and enhances performance,
enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to
Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the
performance of state-of-the-art text-to-image generators using only four
unguided sampling steps. Moreover, we systematically investigate its scaling
behavior and demonstrate LADD's effectiveness in various applications such as
image editing and inpainting. |
This paper presents Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach for diffusion models that utilizes generative features from pretrained latent diffusion models, enabling high-resolution multi-aspect ratio image synthesis. |
Diffusion models, while powerful for image and video synthesis, suffer from slow inference speed. LADD addresses this by enabling fast, single-step inference while maintaining high image quality. |
LADD operates in latent space, unifying the discriminator and teacher model, and leverages synthetic data for training, simplifying the distillation process and enhancing performance. |
SD3-Turbo, a fast, distilled version of Stable Diffusion 3, achieves state-of-the-art text-to-image generation quality in just four sampling steps.
LADD demonstrates stable scaling behavior, with larger student models significantly impacting performance.
The versatility of LADD is demonstrated in image editing and inpainting tasks, achieving comparable results to the teacher model in a single step. |
While achieving fast inference, SD3-Turbo exhibits a slight reduction in prompt alignment compared to the teacher model.
In image editing, the lack of adjustable image and text guidance strengths limits controllability. |
diffusion models, image synthesis, model distillation, adversarial training, latent space |
2403.12010
Report |
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model |
Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, Qixing Huang |
Generating multi-view images based on text or single-image prompts is a
critical capability for the creation of 3D content. Two fundamental questions
on this topic are what data we use for training and how to ensure multi-view
consistency. This paper introduces a novel framework that makes fundamental
contributions to both questions. Unlike leveraging images from 2D diffusion
models for training, we propose a dense consistent multi-view generation model
that is fine-tuned from off-the-shelf video generative models. Images from
video generative models are more suitable for multi-view generation because the
underlying network architecture that generates them employs a temporal module
to enforce frame consistency. Moreover, the video data sets used to train these
models are abundant and diverse, leading to a reduced train-finetuning domain
gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising
Sampling, which first employs a feed-forward reconstruction module to get an
explicit global 3D model, and then adopts a sampling strategy that effectively
involves images rendered from the global 3D model into the denoising sampling
loop to improve the multi-view consistency of the final images. As a
by-product, this module also provides a fast way to create 3D assets
represented by 3D Gaussians within a few seconds. Our approach can generate 24
dense views and converges much faster in training than state-of-the-art
approaches (4 GPU hours versus many thousand GPU hours) with comparable visual
quality and consistency. By further fine-tuning, our approach outperforms
existing state-of-the-art methods in both quantitative metrics and visual
effects. Our project page is aigc3d.github.io/VideoMV. |
This paper proposes VideoMV, a method for consistent dense multi-view image generation by fine-tuning pre-trained video generative models and introducing 3D-Aware Denoising Sampling. |
Creating multi-view consistent images is crucial for 3D content creation, but existing methods struggle with efficiency, consistency, or generalizability. This paper leverages the inherent temporal consistency in video generation models to improve upon these limitations. |
The method consists of three stages: 1) Fine-tuning a pre-trained video generative model on rendered multi-view images with camera pose conditioning. 2) Training a feed-forward network to reconstruct 3D models from noisy multi-view images. 3) Applying 3D-Aware Denoising Sampling which incorporates rendered views from the reconstructed 3D model into the denoising loop. |
VideoMV achieves state-of-the-art results on text-based and image-based multi-view generation benchmarks, outperforming existing methods in image quality, consistency, and efficiency.
The method can generate 24 consistent views in just 5 seconds, enabling applications like dense view reconstruction and distillation-based 3D generation.
Experiments demonstrate that VideoMV generalizes well to unseen prompts and web images. |
The reconstruction module currently uses a sparse view setup due to computational constraints, limiting its ability to fully leverage dense view information.
Further exploration is needed to optimize the distillation sampling pipeline for dense views, potentially leading to even higher-quality 3D reconstructions. |
multi-view image generation, 3d-aware denoising, video generative models, 3d reconstruction, novel view synthesis |
2403.12008
Report |
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion |
Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, Varun Jampani |
We present Stable Video 3D (SV3D) -- a latent video diffusion model for
high-resolution, image-to-multi-view generation of orbital videos around a 3D
object. Recent work on 3D generation propose techniques to adapt 2D generative
models for novel view synthesis (NVS) and 3D optimization. However, these
methods have several disadvantages due to either limited views or inconsistent
NVS, thereby affecting the performance of 3D object generation. In this work,
we propose SV3D that adapts image-to-video diffusion model for novel multi-view
synthesis and 3D generation, thereby leveraging the generalization and
multi-view consistency of the video models, while further adding explicit
camera control for NVS. We also propose improved 3D optimization techniques to
use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental
results on multiple datasets with 2D and 3D metrics as well as user study
demonstrate SV3D's state-of-the-art performance on NVS as well as 3D
reconstruction compared to prior works. |
Presents Stable Video 3D (SV3D), a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object, enabling novel view synthesis (NVS) and 3D generation. |
Addresses limitations in existing 3D generation methods, which suffer from limited views or inconsistent NVS, by adapting a high-resolution, image-conditioned video diffusion model for multi-view consistency and generalization. |
Finetunes Stable Video Diffusion (SVD) to generate orbital videos conditioned on a single image and camera poses, utilizing static and dynamic orbits, triangular CFG scaling, and a two-stage 3D optimization process with a disentangled illumination model and masked score distillation sampling (SDS) loss. |
SV3D achieves state-of-the-art performance on NVS, demonstrating high multi-view consistency, generalization to real-world images, and camera pose controllability.
The proposed 3D generation pipeline produces high-quality meshes with intricate geometric and texture details.
Ablation studies confirm the benefits of progressive finetuning, dynamic orbits, disentangled illumination, and masked SDS loss. |
SV3D is currently limited to two degrees of freedom (elevation and azimuth) in camera control.
The model exhibits inconsistency for mirror-like reflective surfaces, and the shading model doesn't account for such surfaces. |
novel view synthesis, 3d generation, video diffusion models, score distillation sampling, multi-view consistency |
2403.12002
Report |
DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing |
Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye |
Text-driven diffusion-based video editing presents a unique challenge not
encountered in image editing literature: establishing real-world motion. Unlike
existing video editing approaches, here we focus on score distillation sampling
to circumvent the standard reverse diffusion process and initiate optimization
from videos that already exhibit natural motion. Our analysis reveals that
while video score distillation can effectively introduce new content indicated
by target text, it can also cause significant structure and motion deviation.
To counteract this, we propose to match space-time self-similarities of the
original video and the edited video during the score distillation. Thanks to
the use of score distillation, our approach is model-agnostic, which can be
applied for both cascaded and non-cascaded video diffusion frameworks. Through
extensive comparisons with leading methods, our approach demonstrates its
superiority in altering appearances while accurately preserving the original
structure and motion. |
DreamMotion presents a novel approach for zero-shot video editing that leverages score distillation sampling from pre-trained text-to-video diffusion models to inject target appearances into videos while preserving the original structure and motion. |
Existing video editing methods struggle to balance introducing new content while maintaining realistic and temporally consistent motion. DreamMotion addresses this challenge by directly optimizing on real video data, bypassing the limitations of traditional denoising processes. |
DreamMotion utilizes Video Delta Denoising Score (V-DDS) gradients to gradually inject target appearances while employing a space-time self-similarity regularization technique. This regularization minimizes structural deviations by aligning spatial self-similarities and prevents temporal artifacts via temporal self-similarity matching. |
DreamMotion successfully injects target appearances while accurately preserving the structure and motion of the source video.
The method is model-agnostic, demonstrating effectiveness in both cascaded and non-cascaded video diffusion frameworks.
Quantitative and qualitative evaluations, including a user study, confirm that DreamMotion outperforms existing state-of-the-art approaches. |
DreamMotion is primarily designed for edits that preserve the overall structure of the original video, limiting its applicability in scenarios requiring significant structural alterations.
Future work could explore extending the approach to incorporate more sophisticated masking techniques or investigate alternative self-similarity measures for enhanced performance. |
video editing, diffusion models, score distillation sampling, self-similarity, zero-shot learning |
2403.11999
Report |
HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs |
Ting Yao, Yehao Li, Yingwei Pan, Tao Mei |
The hybrid deep models of Vision Transformer (ViT) and Convolution Neural
Network (CNN) have emerged as a powerful class of backbones for vision tasks.
Scaling up the input resolution of such hybrid backbones naturally strengthes
model capacity, but inevitably suffers from heavy computational cost that
scales quadratically. Instead, we present a new hybrid backbone with
HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage
ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built
upon the seminal idea of decomposing the typical CNN operations into two
parallel CNN branches in a cost-efficient manner. One high-resolution branch
directly takes primary high-resolution features as inputs, but uses less
convolution operations. The other low-resolution branch first performs
down-sampling and then utilizes more convolution operations over such
low-resolution features. Experiments on both recognition task (ImageNet-1K
dataset) and dense prediction tasks (COCO and ADE20K datasets) demonstrate the
superiority of HIRI-ViT. More remarkably, under comparable computational cost
($\sim$5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy
of 84.3% on ImageNet with 448$\times$448 inputs, which absolutely improves
83.4% of iFormer-S by 0.9% with 224$\times$224 inputs. |
HIRI-ViT, a novel five-stage Vision Transformer backbone tailored for high-resolution inputs, decomposing typical CNN operations into two parallel branches to achieve cost-efficient scaling. |
Scaling up input resolution enhances model capacity but suffers from heavy computational cost in existing ViT backbones. |
A five-stage ViT structure with two-branch design (high-resolution branch with fewer convolutions and low-resolution branch with more convolutions) is proposed, coupled with inverted residual downsampling and EMA distillation. |
HIRI-ViT achieves state-of-the-art performance on ImageNet-1K with high-resolution inputs, surpassing existing backbones under comparable computational costs.
HIRI-ViT demonstrates superior generalizability in downstream tasks like object detection, instance segmentation, and semantic segmentation on COCO and ADE20K datasets.
Ablation studies validate the effectiveness of the proposed five-stage structure, two-branch design, and EMA distillation strategy. |
Limited improvement observed with six-stage structure.
Scaling up Video Vision Transformer with high-resolution inputs remains a challenge. |
vision transformer, high-resolution inputs, cnn+vit hybrid backbone, image recognition, dense prediction tasks |
2403.11990
Report |
GetMesh: A Controllable Model for High-quality Mesh Generation and Manipulation |
Zhaoyang Lyu, Ben Fei, Jinyi Wang, Xudong Xu, Ya Zhang, Weidong Yang, Bo Dai |
Mesh is a fundamental representation of 3D assets in various industrial
applications, and is widely supported by professional softwares. However, due
to its irregular structure, mesh creation and manipulation is often
time-consuming and labor-intensive. In this paper, we propose a highly
controllable generative model, GetMesh, for mesh generation and manipulation
across different categories. By taking a varying number of points as the latent
representation, and re-organizing them as triplane representation, GetMesh
generates meshes with rich and sharp details, outperforming both
single-category and multi-category counterparts. Moreover, it also enables
fine-grained control over the generation process that previous mesh generative
models cannot achieve, where changing global/local mesh topologies,
adding/removing mesh parts, and combining mesh parts across categories can be
intuitively, efficiently, and robustly accomplished by adjusting the number,
positions or features of latent points. Project page is
https://getmesh.github.io. |
This paper introduces GetMesh, a novel controllable generative model for high-quality mesh generation and manipulation across different categories. |
Creating and editing meshes is currently time-consuming and labor-intensive due to their irregular structure. GetMesh addresses this by enabling intuitive and efficient generation and manipulation of meshes. |
GetMesh utilizes a varying number of points as the latent representation, re-organized as a triplane representation. Two diffusion models, one for point positions and another for features, learn the data distribution. A triplane-based decoder with a refinement module reconstructs high-quality meshes from the latent representation. |
GetMesh generates meshes with rich details, outperforming both single-category and multi-category counterparts.
GetMesh allows intuitive control over mesh generation, enabling changes to topology, addition/removal of parts, and combination of parts across categories.
GetMesh can be seamlessly combined with off-the-shelf material generation methods for textured mesh generation. |
Training GetMesh requires expensive ground-truth 3D data.
GetMesh's scalability is validated only on the ShapeNet dataset. |
3d generation, controllable generation, diffusion model, mesh generation, mesh manipulation |
2403.11956
Report |
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment |
Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, Ning Liu |
With the rapid development of generative models, Artificial
Intelligence-Generated Contents (AIGC) have exponentially increased in daily
lives. Among them, Text-to-Video (T2V) generation has received widespread
attention. Though many T2V models have been released for generating high
perceptual quality videos, there is still lack of a method to evaluate the
quality of these videos quantitatively. To solve this issue, we establish the
largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The
dataset is composed of 10,000 videos generated by 9 different T2V models. We
also conduct a subjective study to obtain each video's corresponding mean
opinion score. Based on T2VQA-DB, we propose a novel transformer-based model
for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model
extracts features from text-video alignment and video fidelity perspectives,
then it leverages the ability of a large language model to give the prediction
score. Experimental results show that T2VQA outperforms existing T2V metrics
and SOTA video quality assessment models. Quantitative analysis indicates that
T2VQA is capable of giving subjective-align predictions, validating its
effectiveness. The dataset and code will be released at
https://github.com/QMME/T2VQA. |
This paper introduces T2VQA-DB, the largest subjective text-to-video dataset to date, and proposes T2VQA, a novel transformer-based model for subjective-aligned text-to-video quality assessment. |
Existing T2V datasets lack scale and comprehensive human annotations, while current metrics inadequately capture the nuances of human perception, particularly text-video alignment. |
T2VQA-DB is built with 10,000 videos from 9 T2V models and 1,000 prompts, annotated with MOS from 27 subjects. T2VQA leverages BLIP and Swin-T for text-video alignment and video fidelity feature extraction, fuses them with cross-attention, and employs an LLM for quality regression. |
T2VQA-DB surpasses existing T2V datasets in scale and annotation comprehensiveness.
T2VQA outperforms existing T2V metrics and SOTA VQA models on T2VQA-DB, demonstrating its effectiveness.
Qualitative analysis reveals T2VQA's superior ability to align with subjective human judgments on video quality. |
T2VQA-DB may not fully represent the capabilities of state-of-the-art models like Sora due to resolution and length limitations.
Further cross-dataset validation is needed to confirm T2VQA's generalization to other T2V datasets. |
text-to-video dataset, video quality assessment, text-to-video generation, multi-modal learning, large language models |
2403.11929
Report |
LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model |
Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, Hang Xu |
Despite the success of generating high-quality images given any text prompts
by diffusion-based generative models, prior works directly generate the entire
images, but cannot provide object-wise manipulation capability. To support
wider real applications like professional graphic design and digital artistry,
images are frequently created and manipulated in multiple layers to offer
greater flexibility and control. Therefore in this paper, we propose a
layer-collaborative diffusion model, named LayerDiff, specifically designed for
text-guided, multi-layered, composable image synthesis. The composable image
consists of a background layer, a set of foreground layers, and associated mask
layers for each foreground element. To enable this, LayerDiff introduces a
layer-based generation paradigm incorporating multiple layer-collaborative
attention modules to capture inter-layer patterns. Specifically, an inter-layer
attention module is designed to encourage information exchange and learning
between layers, while a text-guided intra-layer attention module incorporates
layer-specific prompts to direct the specific-content generation for each
layer. A layer-specific prompt-enhanced module better captures detailed textual
cues from the global prompt. Additionally, a self-mask guidance sampling
strategy further unleashes the model's ability to generate multi-layered
images. We also present a pipeline that integrates existing perceptual and
generative models to produce a large dataset of high-quality, text-prompted,
multi-layered images. Extensive experiments demonstrate that our LayerDiff
model can generate high-quality multi-layered images with performance
comparable to conventional whole-image generation methods. Moreover, LayerDiff
enables a broader range of controllable generative applications, including
layer-specific image editing and style transfer. |
Introduces LayerDiff, a layer-collaborative diffusion model for text-guided, multi-layered, and composable image synthesis. |
Existing text-to-image models lack object-wise manipulation capability, limiting their use in applications like graphic design where layered compositions are crucial. |
LayerDiff employs layer-collaborative attention blocks for inter- and intra-layer information exchange, a layer-specific prompt enhancer to refine content generation using global textual cues, and a self-mask guidance sampling strategy for high-quality multi-layered images. |
LayerDiff generates high-fidelity multi-layered images with performance comparable to traditional whole-image generation methods.
LayerDiff enables versatile control for various generative applications, including layer-wise composable image manipulation and style transfer.
A new data construction pipeline generates high-quality, multi-layered composable images for training LayerDiff, integrating state-of-the-art techniques in image captioning, object localization, segmentation, and inpainting. |
Existing multi-layer training data generation pipelines are inefficient, limiting the ability to produce large-scale training data and impacting model performance.
The model's performance on three and four-layered images is limited by the availability of training data. |
multi-layered composable image synthesis, layer-collaborative diffusion model, layer-specific image editing, text-to-image synthesis, controllable image generation |
2403.11909
Report |
RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF |
Sibi Catley-Chandar, Richard Shaw, Gregory Slabaugh, Eduardo Perez-Pellitero |
Recent advances in neural rendering have enabled highly photorealistic 3D
scene reconstruction and novel view synthesis. Despite this progress, current
state-of-the-art methods struggle to reconstruct high frequency detail, due to
factors such as a low-frequency bias of radiance fields and inaccurate camera
calibration. One approach to mitigate this issue is to enhance images
post-rendering. 2D enhancers can be pre-trained to recover some detail but are
agnostic to scene geometry and do not easily generalize to new distributions of
image degradation. Conversely, existing 3D enhancers are able to transfer
detail from nearby training images in a generalizable manner, but suffer from
inaccurate camera calibration and can propagate errors from the geometry into
rendered images. We propose a neural rendering enhancer, RoGUENeRF, which
exploits the best of both paradigms. Our method is pre-trained to learn a
general enhancer while also leveraging information from nearby training images
via robust 3D alignment and geometry-aware fusion. Our approach restores
high-frequency textures while maintaining geometric consistency and is also
robust to inaccurate camera calibration. We show that RoGUENeRF substantially
enhances the rendering quality of a wide range of neural rendering baselines,
e.g. improving the PSNR of MipNeRF360 by 0.63dB and Nerfacto by 1.34dB on the
real world 360v2 dataset. |
This paper introduces RoGUENeRF, a geometry-consistent NeRF enhancer that improves the image quality of NeRF renderings while being robust to inaccurate camera calibration. |
Current NeRF models struggle to reconstruct high-frequency details due to factors like low-frequency bias and inaccurate camera calibration. Existing enhancement methods are either 2D (geometry-agnostic) or 3D (sensitive to calibration errors). RoGUENeRF leverages both paradigms for improved quality and robustness. |
RoGUENeRF uses a 3D+2D alignment with depth maps and camera poses, refined by an optical flow network. A geometry-aware attention module regulates misaligned regions. It's pre-trained on render-GT image pairs and fine-tuned on novel scenes. |
RoGUENeRF consistently improves PSNR, SSIM, and LPIPS across six NeRF baselines and three datasets (LLFF, DTU, 360v2).
It shows significant qualitative improvements, especially in high-frequency regions like foliage and text.
It exhibits robustness to inaccurate camera calibration, outperforming other methods in noisy settings. |
A limitation is the storage requirement for training images, potentially prohibitive for large scenes.
While faster than baselines, it doesn't yet achieve real-time inference.
Future work includes exploring more efficient architectures and larger-scale pre-training datasets. |
neural rendering, nerf, image enhancement, 3d vision, robustness |
2403.11887
Report |
SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules |
Xiangyu Chen, Jing Liu, Ye Wang, Pu Perry Wang, Matthew Brand, Guanghui Wang, Toshiaki Koike-Akino |
Low-rank adaptation (LoRA) and its variants are widely employed in
fine-tuning large models, including large language models for natural language
processing and diffusion models for computer vision. This paper proposes a
generalized framework called SuperLoRA that unifies and extends different LoRA
variants, which can be realized under different hyper-parameter settings.
Introducing grouping, folding, shuffling, projecting, and tensor factoring,
SuperLoRA offers high flexibility compared with other LoRA variants and
demonstrates superior performance for transfer learning tasks especially in the
extremely few-parameter regimes. |
This paper proposes SuperLoRA, a generalized framework unifying and extending LoRA variants for parameter-efficient fine-tuning of large models. |
SuperLoRA addresses limitations of existing LoRA methods by introducing grouping, folding, shuffling, and projection, enabling high flexibility and superior performance in transfer learning, especially with extremely few parameters. |
SuperLoRA concatenates weight updates across layers, divides them into groups, reshapes them into regular tensors, applies low-rank decomposition (LoRA, LoNKr, or LoRTA), and projects the results with a fixed mapping function (e.g., fastfood projection). |
SuperLoRA achieves 3-10x parameter efficiency compared to LoRA in image classification and generation tasks.
Reshaping weight updates to regular tensors significantly improves performance, allowing higher rank usage with fewer parameters.
Fixed random projection enables further parameter reduction while maintaining competitive accuracy. |
Exploring more efficient projection functions for extremely low-parameter regimes.
Applying and evaluating SuperLoRA to various large models (e.g., LLMs) and transfer learning tasks. |
low-rank adaptation, parameter-efficient fine-tuning, transfer learning, tensor rank decomposition, efficient ai |
2403.11882
Report |
ReGenNet: Towards Human Action-Reaction Synthesis |
Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, Wenjun Zeng |
Humans constantly interact with their surrounding environments. Current
human-centric generative models mainly focus on synthesizing humans plausibly
interacting with static scenes and objects, while the dynamic human
action-reaction synthesis for ubiquitous causal human-human interactions is
less explored. Human-human interactions can be regarded as asymmetric with
actors and reactors in atomic interaction periods. In this paper, we
comprehensively analyze the asymmetric, dynamic, synchronous, and detailed
nature of human-human interactions and propose the first multi-setting human
action-reaction synthesis benchmark to generate human reactions conditioned on
given human actions. To begin with, we propose to annotate the actor-reactor
order of the interaction sequences for the NTU120, InterHuman, and Chi3D
datasets. Based on them, a diffusion-based generative model with a Transformer
decoder architecture called ReGenNet together with an explicit distance-based
interaction loss is proposed to predict human reactions in an online manner,
where the future states of actors are unavailable to reactors. Quantitative and
qualitative results show that our method can generate instant and plausible
human reactions compared to the baselines, and can generalize to unseen actor
motions and viewpoint changes. |
This paper introduces the first multi-setting human action-reaction synthesis benchmark and proposes ReGenNet, a diffusion-based model, to generate plausible and instant human reactions. |
Modeling human-human interactions, crucial for applications like AR/VR and gaming, is challenging due to its asymmetric, dynamic, synchronous, and detailed nature, which previous works have not addressed holistically. |
The authors annotate actor-reactor order in existing datasets (NTU120, Chi3D, InterHuman) and propose ReGenNet, a diffusion model with a Transformer decoder architecture. ReGenNet uses an explicit distance-based interaction loss to model the relative distances of interacted body poses, orientations, and translations. |
ReGenNet outperforms baselines in FID, demonstrating closer proximity to real human reaction distributions.
The model shows strong generalization ability to unseen actor motions and viewpoint changes.
ReGenNet is modular and can be customized for various settings like offline and intention-aware reaction generation. |
Current benchmark focuses on atomic action periods and can be extended to handle longer interactions with role transitions.
Dataset quality can be improved with less noisy motion capture and more natural facial expressions. |
human action-reaction synthesis, human motion generation, diffusion models, transformer decoders, human-human interaction |
2403.11878
Report |
InTeX: Interactive Text-to-texture Synthesis via Unified Depth-aware Inpainting |
Jiaxiang Tang, Ruijie Lu, Xiaokang Chen, Xiang Wen, Gang Zeng, Ziwei Liu |
Text-to-texture synthesis has become a new frontier in 3D content creation
thanks to the recent advances in text-to-image models. Existing methods
primarily adopt a combination of pretrained depth-aware diffusion and
inpainting models, yet they exhibit shortcomings such as 3D inconsistency and
limited controllability. To address these challenges, we introduce InteX, a
novel framework for interactive text-to-texture synthesis. 1) InteX includes a
user-friendly interface that facilitates interaction and control throughout the
synthesis process, enabling region-specific repainting and precise texture
editing. 2) Additionally, we develop a unified depth-aware inpainting model
that integrates depth information with inpainting cues, effectively mitigating
3D inconsistencies and improving generation speed. Through extensive
experiments, our framework has proven to be both practical and effective in
text-to-texture synthesis, paving the way for high-quality 3D content creation. |
Introduces InteX, an interactive text-to-texture synthesis framework using a unified depth-aware inpainting model. |
Addresses limitations in existing methods like 3D inconsistency, limited controllability, and lack of user interaction in texture synthesis. |
Trains a unified depth-aware inpainting prior model on 3D datasets and employs an iterative texture synthesis algorithm with a user-friendly GUI for interaction. |
Generates high-quality textures with enhanced detail and 3D consistency compared to previous methods.
Enables interactive visualization, inpainting, and repainting of textures through a user-friendly GUI.
Significantly faster (30 seconds per instance) than previous iterative inpainting methods. |
Single-view rendering can lead to 3D inconsistencies in the iterative inpainting process.
Reliance on auto-generated UV maps when artist-created ones are unavailable can impact texture symmetry. |
text-to-texture synthesis, 3d content creation, diffusion models, depth-aware inpainting, interactive design |
2403.11868
Report |
View-Consistent 3D Editing with Gaussian Splatting |
Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, Hanwang Zhang |
The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing,
offering efficient, high-fidelity rendering and enabling precise local
manipulations. Currently, diffusion-based 2D editing models are harnessed to
modify multi-view rendered images, which then guide the editing of 3DGS models.
However, this approach faces a critical issue of multi-view inconsistency,
where the guidance images exhibit significant discrepancies across views,
leading to mode collapse and visual artifacts of 3DGS. To this end, we
introduce View-consistent Editing (VcEdit), a novel framework that seamlessly
incorporates 3DGS into image editing processes, ensuring multi-view consistency
in edited guidance images and effectively mitigating mode collapse issues.
VcEdit employs two innovative consistency modules: the Cross-attention
Consistency Module and the Editing Consistency Module, both designed to reduce
inconsistencies in edited images. By incorporating these consistency modules
into an iterative pattern, VcEdit proficiently resolves the issue of multi-view
inconsistency, facilitating high-quality 3DGS editing across a diverse range of
scenes. Further code and video results are released at
http://yuxuanw.me/vcedit/. |
Introduces View-consistent Editing (VcEdit), a framework for high-quality 3D Gaussian Splatting (3DGS) editing that ensures multi-view consistency in guidance images to address mode collapse issues. |
Image-guided 3DGS editing often suffers from multi-view inconsistency in edited guidance images, leading to mode collapse and visual artifacts. |
VcEdit incorporates two novel consistency modules: the Cross-attention Consistency Module (CCM) harmonizes attention maps across views, and the Editing Consistency Module (ECM) calibrates editing outputs using 3DGS. These modules operate within an iterative pattern to refine editing quality. |
Effectively addresses multi-view inconsistency in edited images, resulting in superior 3DGS editing quality.
Outperforms state-of-the-art methods in both qualitative and quantitative evaluations, including CLIP similarity and user studies.
Demonstrates strong adaptability in handling diverse scenes and prompts, ranging from facial details to large-scale scene modifications. |
Performance depends on the quality of 2D image editing models, which can sometimes struggle with complex prompts.
Limitations in handling non-rigid editing scenarios with drastic shape changes due to high inconsistency in 2D editing outputs. |
3d gaussian splatting, 3d editing, multi-view consistency, text-guided image editing, diffusion models |
2403.11835
Report |
Agent3D-Zero: An Agent for Zero-shot 3D Understanding |
Sha Zhang, Di Huang, Jiajun Deng, Shixiang Tang, Wanli Ouyang, Tong He, Yanyong Zhang |
The ability to understand and reason the 3D real world is a crucial milestone
towards artificial general intelligence. The current common practice is to
finetune Large Language Models (LLMs) with 3D data and texts to enable 3D
understanding. Despite their effectiveness, these approaches are inherently
limited by the scale and diversity of the available 3D data. Alternatively, in
this work, we introduce Agent3D-Zero, an innovative 3D-aware agent framework
addressing the 3D scene understanding in a zero-shot manner. The essence of our
approach centers on reconceptualizing the challenge of 3D scene perception as a
process of understanding and synthesizing insights from multiple images,
inspired by how our human beings attempt to understand 3D scenes. By
consolidating this idea, we propose a novel way to make use of a Large Visual
Language Model (VLM) via actively selecting and analyzing a series of
viewpoints for 3D understanding. Specifically, given an input 3D scene,
Agent3D-Zero first processes a bird's-eye view image with custom-designed
visual prompts, then iteratively chooses the next viewpoints to observe and
summarize the underlying knowledge. A distinctive advantage of Agent3D-Zero is
the introduction of novel visual prompts, which significantly unleash the VLMs'
ability to identify the most informative viewpoints and thus facilitate
observing 3D scenes. Extensive experiments demonstrate the effectiveness of the
proposed framework in understanding diverse and previously unseen 3D
environments. |
This paper introduces Agent3D-Zero, an agent framework that leverages Vision-Language Models (VLMs) for zero-shot 3D scene understanding using only multi-view images, eliminating the need for explicit 3D data. |
Collecting and annotating 3D data is resource-intensive, limiting the scalability of existing 3D scene understanding methods that rely on 3D data. This work explores a zero-shot approach using VLMs to overcome this limitation. |
Agent3D-Zero employs an iterative viewpoint selection process guided by a novel visual prompting technique called Set-of-Line Prompting (SoLP) to enhance the VLM's understanding of spatial relationships within a scene. SoLP uses a bird's-eye view image with superimposed grid lines to aid in viewpoint selection. |
Agent3D-Zero outperforms previous state-of-the-art methods on the ScanQA dataset for 3D question answering, demonstrating its effectiveness in zero-shot 3D scene understanding.
The method shows promising results in other 3D tasks such as task decomposition, 3D-assisted dialog, and 3D scene captioning, indicating its potential as a general framework for 3D scene analysis.
Ablation studies confirm the importance of viewpoint selection and the effectiveness of SoLP in improving the model's performance on 3D understanding tasks. |
The current implementation of Agent3D-Zero exhibits limitations in precise and mathematical pose estimation due to constraints in the VLM's ability to interpret highly dense visual prompts.
Future research will focus on enhancing the agent's navigation capabilities and extending its application to a wider array of real-world scenarios, further bridging the gap between language models and 3D scene understanding. |
3d scene understanding, vision-language models, zero-shot learning, viewpoint selection, visual prompting |
2403.11831
Report |
BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting |
Lingzhe Zhao, Peng Wang, Peidong Liu |
While neural rendering has demonstrated impressive capabilities in 3D scene
reconstruction and novel view synthesis, it heavily relies on high-quality
sharp images and accurate camera poses. Numerous approaches have been proposed
to train Neural Radiance Fields (NeRF) with motion-blurred images, commonly
encountered in real-world scenarios such as low-light or long-exposure
conditions. However, the implicit representation of NeRF struggles to
accurately recover intricate details from severely motion-blurred images and
cannot achieve real-time rendering. In contrast, recent advancements in 3D
Gaussian Splatting achieve high-quality 3D scene reconstruction and real-time
rendering by explicitly optimizing point clouds as Gaussian spheres.
In this paper, we introduce a novel approach, named BAD-Gaussians (Bundle
Adjusted Deblur Gaussian Splatting), which leverages explicit Gaussian
representation and handles severe motion-blurred images with inaccurate camera
poses to achieve high-quality scene reconstruction. Our method models the
physical image formation process of motion-blurred images and jointly learns
the parameters of Gaussians while recovering camera motion trajectories during
exposure time.
In our experiments, we demonstrate that BAD-Gaussians not only achieves
superior rendering quality compared to previous state-of-the-art deblur neural
rendering methods on both synthetic and real datasets but also enables
real-time rendering capabilities.
Our project page and source code is available at
https://lingzhezhao.github.io/BAD-Gaussians/ |
This paper introduces BAD-Gaussians, a novel method for reconstructing high-quality 3D scenes from motion-blurred images with inaccurate camera poses, leveraging the explicit representation of 3D Gaussian Splatting and achieving real-time rendering. |
Existing neural rendering methods, including NeRF and 3D Gaussian Splatting, struggle to handle motion-blurred images due to the violation of sharp image assumptions and difficulties in accurate camera pose estimation. This hinders their application in real-world scenarios with motion blur. |
BAD-Gaussians models the physical image formation process of motion blur and jointly optimizes Gaussian parameters and camera motion trajectories within exposure time. It represents camera trajectories using spline functions and synthesizes blurred images by averaging virtual sharp images rendered from interpolated camera poses along the trajectory. The optimization is achieved by minimizing the photometric error between synthesized and input blurred images. |
BAD-Gaussians outperforms previous state-of-the-art deblurring neural rendering methods on both synthetic and real datasets in terms of rendering quality.
The method achieves real-time rendering capabilities, surpassing the limitations of implicit neural rendering techniques.
BAD-Gaussians effectively recovers accurate camera poses from motion-blurred images, demonstrating robustness against pose inaccuracies. |
The performance of BAD-Gaussians can be affected by the accuracy of the initial camera poses and sparse point clouds obtained from COLMAP.
The assumption of short exposure time may limit the generalizability of the method to scenarios with very long exposures. |
3d gaussian splatting, deblurring, bundle adjustment, differentiable rendering, motion blur |
2403.11796
Report |
OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation |
Haochen Jiang, Yueming Xu, Yihan Zeng, Hang Xu, Wei Zhang, Jianfeng Feng, Li Zhang |
3D reconstruction has been widely used in autonomous navigation fields of
mobile robotics. However, the former research can only provide the basic
geometry structure without the capability of open-world scene understanding,
limiting advanced tasks like human interaction and visual navigation. Moreover,
traditional 3D scene understanding approaches rely on expensive labeled 3D
datasets to train a model for a single task with supervision. Thus, geometric
reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D
Understanding and Reconstruction, is crucial for the future development of
mobile robots. In this paper, we propose OpenOcc, a novel framework unifying
the 3D scene reconstruction and open vocabulary understanding with neural
radiance fields. We model the geometric structure of the scene with occupancy
representation and distill the pre-trained open vocabulary model into a 3D
language field via volume rendering for zero-shot inference. Furthermore, a
novel semantic-aware confidence propagation (SCP) method has been proposed to
relieve the issue of language field representation degeneracy caused by
inconsistent measurements in distilled features. Experimental results show that
our approach achieves competitive performance in 3D scene understanding tasks,
especially for small and long-tail objects. |
This paper presents OpenOcc, a novel framework that unifies 3D scene reconstruction and open-vocabulary understanding using neural radiance fields, enabling zero-shot semantic segmentation. |
Existing 3D reconstruction methods often lack semantic understanding, while traditional 3D scene understanding approaches struggle with open-world scenarios and require extensive labeled data. This work addresses these limitations by integrating both aspects into a single framework. |
OpenOcc employs an occupancy representation for efficient geometric reconstruction and distills pre-trained open-vocabulary 2D segmentation features into a 3D language field. A novel semantic-aware confidence propagation (SCP) method mitigates inconsistencies in the language field arising from multi-view observations. |
OpenOcc achieves competitive performance on 3D semantic segmentation benchmarks, particularly for small and long-tail objects.
The method demonstrates superior accuracy in reconstructing shapes and contours of objects compared to baseline methods.
OpenOcc enables efficient open-vocabulary 3D understanding with reduced memory and computational requirements compared to traditional approaches. |
The reconstruction quality is limited by the quality of input depth data, which can be noisy or incomplete, especially on datasets like ScanNet.
Future work could explore incorporating temporal information and object-level reasoning for improved scene understanding and dynamic scene reconstruction. |
3d reconstruction, open vocabulary, semantic segmentation, neural radiance fields, robotic visual navigation |
2403.11781
Report |
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm |
Yi Wu, Ziqiang Li, Heliang Zheng, Chaoyue Wang, Bin Li |
Drawing on recent advancements in diffusion models for text-to-image
generation, identity-preserved personalization has made significant progress in
accurately capturing specific identities with just a single reference image.
However, existing methods primarily integrate reference images within the text
embedding space, leading to a complex entanglement of image and text
information, which poses challenges for preserving both identity fidelity and
semantic consistency. To tackle this challenge, we propose Infinite-ID, an
ID-semantics decoupling paradigm for identity-preserved personalization.
Specifically, we introduce identity-enhanced training, incorporating an
additional image cross-attention module to capture sufficient ID information
while deactivating the original text cross-attention module of the diffusion
model. This ensures that the image stream faithfully represents the identity
provided by the reference image while mitigating interference from textual
input. Additionally, we introduce a feature interaction mechanism that combines
a mixed attention module with an AdaIN-mean operation to seamlessly merge the
two streams. This mechanism not only enhances the fidelity of identity and
semantic consistency but also enables convenient control over the styles of the
generated images. Extensive experimental results on both raw photo generation
and style image generation demonstrate the superior performance of our proposed
method. |
This paper introduces Infinite-ID, a novel identity-preserved personalization method for text-to-image generation that maintains high fidelity to a reference image while ensuring consistency with the text prompt. |
Existing methods struggle to balance identity fidelity and semantic consistency due to the entanglement of image and text information. Infinite-ID addresses this challenge to enable diverse applications like personalized AI portraits. |
The authors propose an ID-semantics decoupling paradigm. It uses identity-enhanced training with a dedicated image cross-attention module to capture identity information without text interference. A mixed attention mechanism then merges identity and text features during inference. An AdaIN-mean operation further refines style control. |
Infinite-ID outperforms state-of-the-art methods in preserving identity fidelity while maintaining semantic consistency.
The method demonstrates robust performance across various image resolutions and enables the mixing of multiple identities.
It excels in both raw photo generation and style image generation, showcasing its versatility and effectiveness. |
The method currently lacks multi-object personalization capabilities.
Artifacts may arise when the face is small in the input image, highlighting a limitation inherited from the base diffusion model. |
text-to-image generation, identity-preserved personalization, stable diffusion, diffusion models, attention mechanisms |
2403.11703
Report |
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images |
Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang |
Visual encoding constitutes the basis of large multimodal models (LMMs) in
understanding the visual world. Conventional LMMs process images in fixed sizes
and limited resolutions, while recent explorations in this direction are
limited in adaptivity, efficiency, and even correctness. In this work, we first
take GPT-4V and LLaVA-1.5 as representative examples and expose systematic
flaws rooted in their visual encoding strategy. To address the challenges, we
present LLaVA-UHD, a large multimodal model that can efficiently perceive
images in any aspect ratio and high resolution. LLaVA-UHD includes three key
components: (1) An image modularization strategy that divides native-resolution
images into smaller variable-sized slices for efficient and extensible
encoding, (2) a compression module that further condenses image tokens from
visual encoders, and (3) a spatial schema to organize slice tokens for LLMs.
Comprehensive experiments show that LLaVA-UHD outperforms established LMMs
trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our
model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088)
resolution images using only 94% inference computation, and achieves 6.4
accuracy improvement on TextVQA. Moreover, the model can be efficiently trained
in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of
LLaVA-1.5). We make the data and code publicly available at
https://github.com/thunlp/LLaVA-UHD. |
This paper introduces LLaVA-UHD, a large multimodal model capable of efficiently processing images of any aspect ratio and high resolution. |
Current LMMs struggle with varied aspect ratios and high-resolution images, limiting their understanding of fine-grained details and increasing hallucination errors. This paper aims to address these limitations. |
LLaVA-UHD utilizes (1) an image modularization strategy to divide images into smaller variable-sized slices for efficient encoding, (2) a compression module to condense visual tokens, and (3) a spatial schema to organize slice tokens for LLM processing. |
LLaVA-UHD outperforms existing LMMs on 9 benchmarks, including those trained with significantly more data.
Compared to the LLaVA-1.5 backbone, LLaVA-UHD achieves a 6.4 accuracy improvement on TextVQA and supports 6 times larger resolution images with less computation.
The model demonstrates superior performance on images with extreme aspect ratios and excels in fine-grained recognition tasks. |
Current implementation is limited to a maximum resolution of 672x1008; future work will explore higher resolutions.
Image slices are currently encoded independently; future research will focus on establishing connections between slices for enhanced global information interaction. |
large multimodal models, visual encoding, high-resolution image understanding, image modularization, llava-uhd |
2403.11697
Report |
Urban Scene Diffusion through Semantic Occupancy Map |
Junge Zhang, Qihang Zhang, Li Zhang, Ramana Rao Kompella, Gaowen Liu, Bolei Zhou |
Generating unbounded 3D scenes is crucial for large-scale scene understanding
and simulation. Urban scenes, unlike natural landscapes, consist of various
complex man-made objects and structures such as roads, traffic signs, vehicles,
and buildings. To create a realistic and detailed urban scene, it is crucial to
accurately represent the geometry and semantics of the underlying objects,
going beyond their visual appearance. In this work, we propose UrbanDiffusion,
a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and
generates an urban scene with geometry and semantics in the form of semantic
occupancy map. Our model introduces a novel paradigm that learns the data
distribution of scene-level structures within a latent space and further
enables the expansion of the synthesized scene into an arbitrary scale. After
training on real-world driving datasets, our model can generate a wide range of
diverse urban scenes given the BEV maps from the held-out set and also
generalize to the synthesized maps from a driving simulator. We further
demonstrate its application to scene image synthesis with a pretrained image
generator as a prior. |
This paper proposes Urban Scene Diffusion through Semantic Occupancy Map (UrbanDiff), a novel 3D diffusion model for generating unbounded 3D urban scenes using semantic occupancy maps, conditioned on Bird's-Eye View (BEV) maps. |
Generating large-scale urban scenes with accurate geometry and semantics is crucial for applications like scene simulation and autonomous driving. Existing methods struggle to achieve this while preserving controllability and scalability. |
UrbanDiff employs a 3D VQVAE to encode semantic occupancy maps into a latent space, where a BEV-conditioned diffusion model learns the data distribution. A scene extension module enables the generation of large-scale scenes by aggregating single-frame outputs while maintaining temporal consistency. |
UrbanDiff generates diverse and realistic urban scenes from real-world and simulator-generated BEV maps.
Quantitative evaluation demonstrates superior performance over baseline methods in terms of V-FID, MMD, and human evaluation.
The generated scenes benefit downstream tasks like point cloud segmentation and can be used as a prior for scene image synthesis with promising results. |
The visual quality of synthesized scene images can be further improved.
Future work will focus on incorporating object instance information for enhanced realism. |
3d scene generation, diffusion models, semantic occupancy maps, "birds-eye view", urban scene synthesis |
2403.11679
Report |
NEDS-SLAM: A Novel Neural Explicit Dense Semantic SLAM Framework using 3D Gaussian Splatting |
Yiming Ji, Yang Liu, Guanghu Xie, Boyu Ma, Zongwu Xie |
We propose NEDS-SLAM, an Explicit Dense semantic SLAM system based on 3D
Gaussian representation, that enables robust 3D semantic mapping, accurate
camera tracking, and high-quality rendering in real-time. In the system, we
propose a Spatially Consistent Feature Fusion model to reduce the effect of
erroneous estimates from pre-trained segmentation head on semantic
reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we
employ a lightweight encoder-decoder to compress the high-dimensional semantic
features into a compact 3D Gaussian representation, mitigating the burden of
excessive memory consumption. Furthermore, we leverage the advantage of 3D
Gaussian splatting, which enables efficient and differentiable novel view
rendering, and propose a Virtual Camera View Pruning method to eliminate
outlier GS points, thereby effectively enhancing the quality of scene
representations. Our NEDS-SLAM method demonstrates competitive performance over
existing dense semantic SLAM methods in terms of mapping and tracking accuracy
on Replica and ScanNet datasets, while also showing excellent capabilities in
3D dense semantic mapping. |
Proposes NEDS-SLAM, an explicit dense semantic SLAM system using 3D Gaussian Splatting for robust 3D semantic mapping, camera tracking, and real-time rendering. |
Addresses limitations in existing semantic SLAM methods that rely on accurate semantic pre-segmentation and suffer from inconsistent semantic feature estimation. |
Combines semantic and appearance features with a fusion module for spatial consistency, compresses semantic features with an encoder-decoder, and employs a virtual camera view pruning method to remove noisy Gaussians. |
Achieves competitive mapping and tracking accuracy compared to existing dense semantic SLAM methods on Replica and ScanNet datasets.
Demonstrates robust semantic reconstruction by mitigating the impact of inconsistent semantic features from pre-trained models.
Improves scene representation quality by effectively eliminating outlier Gaussian points through the virtual view pruning method. |
Virtual view pruning increases computational load and may affect real-time performance.
Future work includes optimizing the virtual view method and extending semantic reconstruction to dynamic scenes. |
3d gaussian splatting, dense semantic mapping, neural slam, 3d reconstruction, semantic feature fusion |
2403.11627
Report |
LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models |
Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Boxi Wu, Wei Liu |
Customization generation techniques have significantly advanced the synthesis
of specific concepts across varied contexts. Multi-concept customization
emerges as the challenging task within this domain. Existing approaches often
rely on training a Low-Rank Adaptations (LoRA) fusion matrix of multiple LoRA
to merge various concepts into a single image. However, we identify this
straightforward method faces two major challenges: 1) concept confusion, which
occurs when the model cannot preserve distinct individual characteristics, and
2) concept vanishing, where the model fails to generate the intended subjects.
To address these issues, we introduce LoRA-Composer, a training-free framework
designed for seamlessly integrating multiple LoRAs, thereby enhancing the
harmony among different concepts within generated images. LoRA-Composer
addresses concept vanishing through Concept Injection Constraints, enhancing
concept visibility via an expanded cross-attention mechanism. To combat concept
confusion, Concept Isolation Constraints are introduced, refining the
self-attention computation. Furthermore, Latent Re-initialization is proposed
to effectively stimulate concept-specific latent within designated regions. Our
extensive testing showcases a notable enhancement in LoRA-Composer's
performance compared to standard baselines, especially when eliminating the
image-based conditions like canny edge or pose estimations. Code is released at
https://github.com/Young98CN/LoRA\_Composer. |
LoRA-Composer, a training-free framework for multi-concept image customization by seamlessly integrating multiple pre-trained concepts encoded as LoRAs. |
Existing multi-concept customization methods face challenges like concept confusion and concept vanishing, particularly without relying on image-based conditions like sketches or poses. LoRA-Composer addresses these limitations, offering more flexibility and accuracy. |
LoRA-Composer introduces a novel LoRA-Composer Block within the Stable Diffusion U-Net. It employs Concept Injection Constraints with Region-Aware LoRA Injection and Concept Enhancement to mitigate concept vanishing. It utilizes Concept Isolation Constraints with a concept region mask and Region Perceptual Restriction to address concept confusion. Finally, Latent Re-initialization enhances layout generation by refining the latent space. |
Outperforms baselines in image similarity across anime and realistic styles, demonstrating effective concept representation.
Exhibits robustness even without image-based conditions, unlike methods like Mix-of-Show.
User study confirms preference for LoRA-Composer, especially for its text-to-image and image-to-image alignment accuracy. |
Concept boundaries can disappear when concepts are too close due to down-sampling.
Foreground pixels might exceed layout boundaries due to Stable Diffusion's inherent design. Future work will focus on refining the attention mechanism and optimizing inference efficiency. |
multi-concept customization, lora integration, training-free, controllable generation, diffusion models |
2403.11589
Report |
UV Gaussians: Joint Learning of Mesh Deformation and Gaussian Textures for Human Avatar Modeling |
Yujiao Jiang, Qingmin Liao, Xiaoyu Li, Li Ma, Qi Zhang, Chaopeng Zhang, Zongqing Lu, Ying Shan |
Reconstructing photo-realistic drivable human avatars from multi-view image
sequences has been a popular and challenging topic in the field of computer
vision and graphics. While existing NeRF-based methods can achieve high-quality
novel view rendering of human models, both training and inference processes are
time-consuming. Recent approaches have utilized 3D Gaussians to represent the
human body, enabling faster training and rendering. However, they undermine the
importance of the mesh guidance and directly predict Gaussians in 3D space with
coarse mesh guidance. This hinders the learning procedure of the Gaussians and
tends to produce blurry textures. Therefore, we propose UV Gaussians, which
models the 3D human body by jointly learning mesh deformations and 2D UV-space
Gaussian textures. We utilize the embedding of UV map to learn Gaussian
textures in 2D space, leveraging the capabilities of powerful 2D networks to
extract features. Additionally, through an independent Mesh network, we
optimize pose-dependent geometric deformations, thereby guiding Gaussian
rendering and significantly enhancing rendering quality. We collect and process
a new dataset of human motion, which includes multi-view images, scanned
models, parametric model registration, and corresponding texture maps.
Experimental results demonstrate that our method achieves state-of-the-art
synthesis of novel view and novel pose. The code and data will be made
available on the homepage https://alex-jyj.github.io/UV-Gaussians/ once the
paper is accepted. |
This paper introduces UV Gaussians, a novel method combining 3D Gaussian Splatting and mesh deformation to reconstruct photo-realistic and animatable human avatars from multi-view images. |
Existing NeRF-based methods for human avatar modeling are computationally expensive, while recent 3D Gaussian-based methods overlook the importance of accurate mesh guidance for high-quality rendering. |
UV Gaussians jointly learns pose-dependent mesh deformations using a Mesh U-Net and 2D UV-space Gaussian textures using a Gaussian U-Net. It then uses the refined mesh to guide the animation of 3D Gaussians for rendering. |
Achieves state-of-the-art performance in novel view synthesis, outperforming NeRF-based and other 3DGS-based methods.
Exhibits superior quality in novel pose synthesis, accurately capturing clothing wrinkles and texture details.
Demonstrates the effectiveness of mesh guidance and UV space representation for high-fidelity human avatar modeling. |
Reliance on scanned mesh data limits applicability to scenarios without such information.
Limited evaluation on extremely loose clothing types like long skirts. |
human modeling, neural rendering, gaussian splatting, 3d avatars, mesh deformation |
2403.11568
Report |
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models |
Zhenghao Zhang, Zuozhuo Dai, Long Qin, Weizhi Wang |
Large-scale text-to-video models have shown remarkable abilities, but their
direct application in video editing remains challenging due to limited
available datasets. Current video editing methods commonly require per-video
fine-tuning of diffusion models or specific inversion optimization to ensure
high-fidelity edits. In this paper, we introduce EffiVED, an efficient
diffusion-based model that directly supports instruction-guided video editing.
To achieve this, we present two efficient workflows to gather video editing
pairs, utilizing augmentation and fundamental vision-language techniques. These
workflows transform vast image editing datasets and open-world videos into a
high-quality dataset for training EffiVED. Experimental results reveal that
EffiVED not only generates high-quality editing videos but also executes
rapidly. Finally, we demonstrate that our data collection method significantly
improves editing performance and can potentially tackle the scarcity of video
editing data. The datasets will be made publicly available upon publication. |
This paper introduces EffiVED, an efficient diffusion-based model for instruction-guided video editing that does not require per-video fine-tuning. |
Current video editing methods are computationally expensive, often requiring per-video fine-tuning or inversion optimization. |
The authors propose two workflows to generate a video editing dataset from: 1) image editing datasets using data augmentation to simulate camera movements, and 2) open-world videos using LLM (ChatGPT) to generate editing instructions and CoDeF to create edited videos. EffiVED is trained on this dataset using a 3D U-Net architecture with decoupled classifier-free guidance for text and video conditions. |
EffiVED achieves comparable editing quality to state-of-the-art methods like CoDeF.
EffiVED is significantly faster than previous methods, achieving a speedup of 6 to 28 times.
The proposed data collection method effectively converts existing resources into a high-quality video editing dataset, addressing the data scarcity issue. |
The quality of generated videos can be further improved, especially for complex motion editing.
The model's ability to generalize to unseen editing instructions and video domains needs further exploration. |
video editing, diffusion models, text-guided synthesis, data augmentation, large language models |
2403.11535
Report |
EchoReel: Enhancing Action Generation of Existing Video Diffusion Models |
Jianzhi liu, Junchen Zhu, Lianli Gao, Jingkuan Song |
Recent large-scale video datasets have facilitated the generation of diverse
open-domain videos of Video Diffusion Models (VDMs). Nonetheless, the efficacy
of VDMs in assimilating complex knowledge from these datasets remains
constrained by their inherent scale, leading to suboptimal comprehension and
synthesis of numerous actions. In this paper, we introduce EchoReel, a novel
approach to augment the capability of VDMs in generating intricate actions by
emulating motions from pre-existing videos, which are readily accessible from
databases or online repositories. EchoReel seamlessly integrates with existing
VDMs, enhancing their ability to produce realistic motions without compromising
their fundamental capabilities. Specifically, the Action Prism (AP), is
introduced to distill motion information from reference videos, which requires
training on only a small dataset. Leveraging the knowledge from pre-trained
VDMs, EchoReel incorporates new action features into VDMs through the
additional layers, eliminating the need for any further fine-tuning of
untrained actions. Extensive experiments demonstrate that EchoReel is not
merely replicating the whole content from references, and it significantly
improves the generation of realistic actions, even in situations where existing
VDMs might directly fail. |
This paper introduces EchoReel, a novel framework that enhances the ability of existing Video Diffusion Models (VDMs) to generate complex human actions by leveraging readily available videos as references in an in-context learning approach. |
Existing VDMs struggle to learn and synthesize a wide range of actions due to limitations in model scale and data diversity. EchoReel addresses this by enabling VDMs to learn and imitate intricate actions from reference videos, even those not encountered during training. |
EchoReel consists of two main components: (1) Action Prism: Extracts motion-related features from reference videos using a transformer-based architecture with spatial and temporal self-attention and spatial cross-attention. (2) Action Integration: Integrates extracted motion features into the VDM through newly added temporal cross-attention layers, guiding action generation without altering pre-trained layers. |
EchoReel significantly improves action generation quality in pre-trained VDMs, as evidenced by substantial reductions in FVD scores and improvements in text-visual alignment and frame consistency.
The framework generalizes well to multiple reference videos and shows promising results in image-to-video generation tasks.
Ablation studies confirm the importance of each component and design choice within EchoReel, highlighting the effectiveness of the proposed action extraction and integration mechanisms. |
EchoReel currently faces limitations in improving the generation of objects involved in actions, particularly when the base VDM struggles with synthesizing those objects.
Future work will focus on addressing this limitation by exploring methods to enhance the generation of both actions and related objects. |
video generation, in-context learning, diffusion model, action recognition, motion imitation |
2403.11503
Report |
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors |
Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, Xin Tong |
We propose a novel image editing technique that enables 3D manipulations on
single images, such as object rotation and translation. Existing 3D-aware image
editing approaches typically rely on synthetic multi-view datasets for training
specialized models, thus constraining their effectiveness on open-domain images
featuring significantly more varied layouts and styles. In contrast, our method
directly leverages powerful image diffusion models trained on a broad spectrum
of text-image pairs and thus retain their exceptional generalization abilities.
This objective is realized through the development of an iterative novel view
synthesis and geometry alignment algorithm. The algorithm harnesses diffusion
models for dual purposes: they provide appearance prior by predicting novel
views of the selected object using estimated depth maps, and they act as a
geometry critic by correcting misalignments in 3D shapes across the sampled
views. Our method can generate high-quality 3D-aware image edits with large
viewpoint transformations and high appearance and shape consistency with the
input image, pushing the boundaries of what is possible with single-image
3D-aware editing. |
This paper introduces a novel single-image 3D-aware editing method that leverages pre-trained diffusion models, enabling 3D object manipulations (e.g., rotation, translation) on open-domain images without requiring specialized training datasets. |
Existing 3D-aware editing techniques often rely on synthetic datasets, limiting their effectiveness on real-world images with diverse styles and layouts. This method addresses this limitation by utilizing the powerful generalization capabilities of large-scale, pre-trained image diffusion models. |
The method employs an iterative algorithm with three phases: (1) View synthesis using depth-based warping and layered diffusion inpainting, (2) Undistortion to correct geometric imperfections using diffusion models as geometry critics, and (3) Shape alignment to refine object shapes using dense image correspondences. |
The method generates high-quality 3D edits with large viewpoint transformations while maintaining appearance and shape consistency with the input image.
It outperforms previous methods, including OBJect-3DIT and Zero123, in terms of layout plausibility, image quality, and appearance consistency, as demonstrated by visual comparisons and quantitative metrics.
A user study confirms the superiority of the method, with participants significantly preferring its editing results over other approaches. |
The method's ability to preserve extremely fine details is limited by the capabilities of the pre-trained diffusion models.
Handling large transformations where minimal object regions are visible in the target view remains challenging, requiring further research to enhance robustness. |
diffusion models, 3d-aware image editing, tuning-free editing, novel view synthesis, geometry correction |
2403.11481
Report |
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding |
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li |
We explore how reconciling several foundation models (large language models
and vision-language models) with a novel unified memory mechanism could tackle
the challenging video understanding problem, especially capturing the long-term
temporal relations in lengthy videos. In particular, the proposed multimodal
agent VideoAgent: 1) constructs a structured memory to store both the generic
temporal event descriptions and object-centric tracking states of the video; 2)
given an input task query, it employs tools including video segment
localization and object memory querying along with other visual foundation
models to interactively solve the task, utilizing the zero-shot tool-use
ability of LLMs. VideoAgent demonstrates impressive performances on several
long-horizon video understanding benchmarks, an average increase of 6.6% on
NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between
open-sourced models and private counterparts including Gemini 1.5 Pro. |
Proposes VideoAgent, an LLM-powered multimodal tool-use agent for video understanding that leverages a novel unified memory mechanism. |
Addresses the limitations of current end-to-end video-language models in handling long-form videos with complex temporal dependencies, which suffer from high computational cost and attention limitations. |
Constructs a unified memory consisting of a temporal memory storing segment-level descriptions and an object memory tracking object states. It utilizes a minimal set of tools (caption retrieval, segment localization, visual question answering, object memory querying) to interact with this memory and solve tasks. |
Achieves state-of-the-art performance on EgoSchema, outperforming baselines by up to 26% and approaching the accuracy of Gemini 1.5 Pro.
Demonstrates strong performance on Ego4D NLQ, exceeding supervised baselines like 2D-TAN and VSLNet in a zero-shot setting.
Outperforms other methods on NExT-QA, particularly excelling in causal questions that demand robust temporal reasoning, and shows significant improvement over using individual tools like Video-LLaVA alone. |
Limited exploration of real-world applications.
Potential for further investigation into incorporating more sophisticated tools and reasoning mechanisms. |
video understanding, llms, tool-use, multimodal agents, unified memory |
2403.11453
Report |
Bridging 3D Gaussian and Mesh for Freeview Video Rendering |
Yuting Xiao, Xuan Wang, Jiafei Li, Hongrui Cai, Yanbo Fan, Nan Xue, Minghui Yang, Yujun Shen, Shenghua Gao |
This is only a preview version of GauMesh. Recently, primitive-based
rendering has been proven to achieve convincing results in solving the problem
of modeling and rendering the 3D dynamic scene from 2D images. Despite this, in
the context of novel view synthesis, each type of primitive has its inherent
defects in terms of representation ability. It is difficult to exploit the mesh
to depict the fuzzy geometry. Meanwhile, the point-based splatting (e.g. the 3D
Gaussian Splatting) method usually produces artifacts or blurry pixels in the
area with smooth geometry and sharp textures. As a result, it is difficult,
even not impossible, to represent the complex and dynamic scene with a single
type of primitive. To this end, we propose a novel approach, GauMesh, to bridge
the 3D Gaussian and Mesh for modeling and rendering the dynamic scenes. Given a
sequence of tracked mesh as initialization, our goal is to simultaneously
optimize the mesh geometry, color texture, opacity maps, a set of 3D Gaussians,
and the deformation field. At a specific time, we perform $\alpha$-blending on
the RGB and opacity values based on the merged and re-ordered z-buffers from
mesh and 3D Gaussian rasterizations. This produces the final rendering, which
is supervised by the ground-truth image. Experiments demonstrate that our
approach adapts the appropriate type of primitives to represent the different
parts of the dynamic scene and outperforms all the baseline methods in both
quantitative and qualitative comparisons without losing render speed. |
Presents GauMesh, a novel approach for freeview video rendering that bridges the strengths of 3D Gaussian splatting and triangle meshes in a hybrid representation. |
Addresses limitations of using a single primitive type for representing complex dynamic scenes, aiming to leverage the advantages of each type for improved visual quality and rendering efficiency. |
Employs a hybrid differentiable rendering pipeline that blends 3D Gaussians and textured meshes. Uses a grid-based deformation field for 3D Gaussians and a mesh tracking approach initialized from keyframes. |
Achieves state-of-the-art performance on the Multiface dataset, demonstrating superior visual quality compared to baselines.
Effectively reconstructs both complex geometry (e.g., hair) and fine color details on smooth surfaces (e.g., facial features).
Maintains fast rendering capabilities due to the use of rasterization-based rendering for both 3D Gaussians and meshes. |
Could explore more advanced mesh deformation techniques beyond simple vertex translation.
Further investigate compression methods for the deformation field to improve storage efficiency. |
freeview video, primitive-based rendering, novel view synthesis, 3d gaussian splatting, hybrid representation |
2403.11451
Report |
CasSR: Activating Image Power for Real-World Image Super-Resolution |
Haolan Chen, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Wei Hu |
The objective of image super-resolution is to generate clean and
high-resolution images from degraded versions. Recent advancements in diffusion
modeling have led to the emergence of various image super-resolution techniques
that leverage pretrained text-to-image (T2I) models. Nevertheless, due to the
prevalent severe degradation in low-resolution images and the inherent
characteristics of diffusion models, achieving high-fidelity image restoration
remains challenging. Existing methods often exhibit issues including semantic
loss, artifacts, and the introduction of spurious content not present in the
original image. To tackle this challenge, we propose Cascaded diffusion for
Super-Resolution, CasSR , a novel method designed to produce highly detailed
and realistic images. In particular, we develop a cascaded controllable
diffusion model that aims to optimize the extraction of information from
low-resolution images. This model generates a preliminary reference image to
facilitate initial information extraction and degradation mitigation.
Furthermore, we propose a multi-attention mechanism to enhance the T2I model's
capability in maximizing the restoration of the original image content. Through
a comprehensive blend of qualitative and quantitative analyses, we substantiate
the efficacy and superiority of our approach. |
This paper introduces CasSR, a novel cascaded diffusion model designed for real-world image super-resolution, emphasizing image guidance over semantic information for enhanced fidelity and detail. |
Existing diffusion-based super-resolution methods often struggle with semantic loss, artifacts, and spurious content, particularly when handling severely degraded images. CasSR addresses these limitations by maximizing the extraction and utilization of information from the low-resolution input itself. |
CasSR employs a two-stage approach: (1) an image activation module (e.g., SCEdit) enhances the input image, generating a reference image with reduced degradation. (2) a multiple attention module integrates information from both the original and enhanced images, guiding a pre-trained Stable Diffusion model for high-fidelity restoration. |
CasSR consistently outperforms or achieves competitive results against state-of-the-art methods on both real-world and synthetic benchmarks.
The method excels in perceptual metrics (MUSIQ, MANIQA), indicating superior image quality and detail restoration.
Ablation studies highlight the effectiveness of the image activation and multiple attention modules, demonstrating the importance of image guidance over relying solely on semantic information (text prompts). |
The performance of CasSR may be slightly impacted when input images are cropped, resulting in information loss.
Future work could explore alternative image activation techniques for even richer reference image generation. |
image super-resolution, diffusion models, image restoration, text-to-image models, image activation |
2403.11447
Report |
Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction |
Zhiyang Guo, Wengang Zhou, Li Li, Min Wang, Houqiang Li |
3D Gaussian Splatting (3DGS) has become an emerging tool for dynamic scene
reconstruction. However, existing methods focus mainly on extending static 3DGS
into a time-variant representation, while overlooking the rich motion
information carried by 2D observations, thus suffering from performance
degradation and model redundancy. To address the above problem, we propose a
novel motion-aware enhancement framework for dynamic scene reconstruction,
which mines useful motion cues from optical flow to improve different paradigms
of dynamic 3DGS. Specifically, we first establish a correspondence between 3D
Gaussian movements and pixel-level flow. Then a novel flow augmentation method
is introduced with additional insights into uncertainty and loss collaboration.
Moreover, for the prevalent deformation-based paradigm that presents a harder
optimization problem, a transient-aware deformation auxiliary module is
proposed. We conduct extensive experiments on both multi-view and monocular
scenes to verify the merits of our work. Compared with the baselines, our
method shows significant superiority in both rendering quality and efficiency. |
This paper introduces a motion-aware enhancement framework for dynamic 3D Gaussian Splatting, improving reconstruction quality and efficiency by leveraging optical flow priors. |
Existing dynamic 3DGS methods often overlook rich motion information in 2D sequences, leading to performance degradation and model redundancy. |
The framework establishes a cross-dimensional correspondence between 3D Gaussian movements and pixel-level optical flow. It features uncertainty-aware flow augmentation and a transient-aware deformation auxiliary module for enhanced optimization. |
The method outperforms baselines in multi-view and monocular dynamic scene benchmarks, achieving higher PSNR, SSIM, and lower LPIPS.
Motion-aware regularization reduces Gaussian and motion redundancy, enabling more efficient dynamic modeling, especially in monocular settings.
The framework exhibits robustness under sparser viewpoints for multi-view scenarios, demonstrating potential for wider application. |
Motion blur remains a challenge as the model might overfit blurred regions, impacting temporal consistency.
Exploring additional priors beyond optical flow could further mitigate motion uncertainty, particularly in monocular scenes. |
3d gaussian splatting, dynamic scene reconstruction, optical flow, motion awareness, neural rendering |
2403.11423
Report |
VmambaIR: Visual State Space Model for Image Restoration |
Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, Wenming Yang |
Image restoration is a critical task in low-level computer vision, aiming to
restore high-quality images from degraded inputs. Various models, such as
convolutional neural networks (CNNs), generative adversarial networks (GANs),
transformers, and diffusion models (DMs), have been employed to address this
problem with significant impact. However, CNNs have limitations in capturing
long-range dependencies. DMs require large prior models and computationally
intensive denoising steps. Transformers have powerful modeling capabilities but
face challenges due to quadratic complexity with input image size. To address
these challenges, we propose VmambaIR, which introduces State Space Models
(SSMs) with linear complexity into comprehensive image restoration tasks. We
utilize a Unet architecture to stack our proposed Omni Selective Scan (OSS)
blocks, consisting of an OSS module and an Efficient Feed-Forward Network
(EFFN). Our proposed omni selective scan mechanism overcomes the unidirectional
modeling limitation of SSMs by efficiently modeling image information flows in
all six directions. Furthermore, we conducted a comprehensive evaluation of our
VmambaIR across multiple image restoration tasks, including image deraining,
single image super-resolution, and real-world image super-resolution. Extensive
experimental results demonstrate that our proposed VmambaIR achieves
state-of-the-art (SOTA) performance with much fewer computational resources and
parameters. Our research highlights the potential of state space models as
promising alternatives to the transformer and CNN architectures in serving as
foundational frameworks for next-generation low-level visual tasks. |
This paper introduces VmambaIR, a novel image restoration network leveraging state space models (SSMs) with linear complexity for tasks like image deraining and super-resolution. |
Existing methods like CNNs, GANs, and Transformers face limitations in capturing long-range dependencies, high computational costs, or quadratic complexity. SSMs offer a promising alternative with linear complexity and efficient high-frequency modeling capabilities. |
VmambaIR incorporates a Unet architecture with Omni Selective Scan (OSS) blocks. The OSS block consists of an OSS module for comprehensive information flow modeling from six directions and an Efficient Feed-Forward Network (EFFN) for information flow regulation across hierarchical levels. |
VmambaIR achieves state-of-the-art performance on single image super-resolution, outperforming existing methods in both PSNR and LPIPS metrics.
In real-world image super-resolution, VmambaIR achieves superior results with only 26% of the computational cost compared to previous SOTA methods.
VmambaIR demonstrates superior performance in image deraining, exceeding previous methods in PSNR and SSIM while maintaining lower complexity. |
The current design of selective scan operations in OSS involves significant data type and dimension conversions, leading to slower speeds compared to vanilla convolution despite similar computational complexity.
Future work includes exploring the application of VmambaIR to video processing and other low-level vision tasks. |
state space models, image restoration, super-resolution, image deraining, omni selective scan |
2403.11415
Report |
DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation |
Jeongsol Kim, Geon Yeong Park, Jong Chul Ye |
Reverse sampling and score-distillation have emerged as main workhorses in
recent years for image manipulation using latent diffusion models (LDMs). While
reverse diffusion sampling often requires adjustments of LDM architecture or
feature engineering, score distillation offers a simple yet powerful
model-agnostic approach, but it is often prone to mode-collapsing. To address
these limitations and leverage the strengths of both approaches, here we
introduce a novel framework called {\em DreamSampler}, which seamlessly
integrates these two distinct approaches through the lens of regularized latent
optimization. Similar to score-distillation, DreamSampler is a model-agnostic
approach applicable to any LDM architecture, but it allows both distillation
and reverse sampling with additional guidance for image editing and
reconstruction. Through experiments involving image editing, SVG reconstruction
and etc, we demonstrate the competitive performance of DreamSampler compared to
existing approaches, while providing new applications. |
DreamSampler is a novel framework for image manipulation that unifies diffusion sampling and score distillation via regularized latent optimization. |
Reverse diffusion sampling often requires architectural adjustments or feature engineering. Score distillation, while model-agnostic, is prone to mode collapse. DreamSampler addresses these limitations, leveraging the strengths of both approaches. |
DreamSampler interprets latent optimization during reverse diffusion as a proximal update, allowing integration of regularization terms. It shows that the proximal update loss can be conceptualized as the score distillation loss, enabling their unification. |
DreamSampler enables novel applications like image vectorization from blurry input with semantic text guidance, outperforming multi-stage baseline approaches.
For real image editing, DreamSampler effectively modifies images according to text prompts while preserving image fidelity and outperforming or being on par with existing methods.
In text-guided image inpainting, DreamSampler generates semantically consistent content within masked regions while maintaining high fidelity to the original image, surpassing baseline methods in reconstruction quality. |
DreamSampler's performance is reliant on the quality of the pre-trained diffusion model.
Further exploration of time scheduling and regularization functions could improve DreamSampler's efficacy. |
latent diffusion model, image manipulation, score distillation, reverse diffusion sampling, image generation |
2403.11401
Report |
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning |
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong |
This paper introduces Scene-LLM, a 3D-visual-language model that enhances
embodied agents' abilities in interactive 3D indoor environments by integrating
the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a
hybrid 3D visual feature representation, that incorporates dense spatial
information and supports scene state updates. The model employs a projection
layer to efficiently project these features in the pre-trained textual
embedding space, enabling effective interpretation of 3D visual information.
Unique to our approach is the integration of both scene-level and ego-centric
3D information. This combination is pivotal for interactive planning, where
scene-level data supports global planning and ego-centric data is important for
localization. Notably, we use ego-centric 3D frame features for feature
alignment, an efficient technique that enhances the model's ability to align
features of small objects within the scene. Our experiments with Scene-LLM
demonstrate its strong capabilities in dense captioning, question answering,
and interactive planning. We believe Scene-LLM advances the field of 3D visual
understanding and reasoning, offering new possibilities for sophisticated agent
interactions in indoor settings. |
\methodname{} is a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs) with both egocentric and scene-level 3D information. |
Existing visual-language models often struggle to handle persistent 3D spatial information and scene changes in interactive environments, limiting their effectiveness in tasks like indoor planning. |
The model employs a hybrid 3D visual feature representation, integrating both egocentric and scene-level information. It uses a projection layer to align these features with pre-trained textual embeddings. A two-stage training strategy first aligns conceptual features and then fine-tunes with instructional following annotations. |
\methodname{} achieves state-of-the-art results on ScanQA and SQA3D benchmarks for 3D visual question answering, demonstrating strong 3D scene understanding and reasoning.
The model effectively handles scene changes and performs well on the Alfred benchmark for interactive planning, highlighting its ability in dynamic environments.
Ablation studies show the effectiveness of the hybrid representation, the importance of egocentric and scene-level updates, and the benefit of using frame data for concept alignment. |
Current limitations include a dependence on the maximum token length of the LLM, posing challenges for processing high-resolution 3D scenes.
The model currently lacks an explicit state detection mechanism for complex dynamic scenes, potentially hindering performance in such environments. |
3d visual language model, interactive planning, egocentric and scene-level understanding, hybrid 3d feature representation, large language models |
2403.11324
Report |
GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering |
Yanyan Li, Chenyu Lyu, Yan Di, Guangyao Zhai, Gim Hee Lee, Federico Tombari |
During the Gaussian Splatting optimization process, the scene's geometry can
gradually deteriorate if its structure is not deliberately preserved,
especially in non-textured regions such as walls, ceilings, and furniture
surfaces. This degradation significantly affects the rendering quality of novel
views that deviate significantly from the viewpoints in the training data. To
mitigate this issue, we propose a novel approach called GeoGaussian. Based on
the smoothly connected areas observed from point clouds, this method introduces
a novel pipeline to initialize thin Gaussians aligned with the surfaces, where
the characteristic can be transferred to new generations through a carefully
designed densification strategy. Finally, the pipeline ensures that the scene's
geometry and texture are maintained through constrained optimization processes
with explicit geometry constraints. Benefiting from the proposed architecture,
the generative ability of 3D Gaussians is enhanced, especially in structured
regions. Our proposed pipeline achieves state-of-the-art performance in novel
view synthesis and geometric reconstruction, as evaluated qualitatively and
quantitatively on public datasets. |
GeoGaussian, a novel geometry-aware Gaussian Splatting method for enhancing 3D scene representation and novel view synthesis, especially in low-textured regions. |
Gaussian Splatting methods often prioritize image clarity over geometric fidelity, leading to degradation in rendering performance for novel views, particularly in non-textured areas. |
The method leverages thin ellipsoid Gaussian parameterization initialized based on surface normals, employs a constrained densification strategy to ensure new Gaussians align with smooth surfaces, and introduces a geometrically consistent loss function during optimization. |
GeoGaussian achieves state-of-the-art performance in novel view synthesis, outperforming 3DGS and LightGS on Replica and TUM RGB-D datasets, especially in sparse view scenarios.
The method demonstrates improved geometry accuracy compared to 3DGS, as evidenced by better alignment of point clouds with ground truth mesh models.
GeoGaussian shows faster convergence and enhanced robustness during training due to accurate initialization and constrained densification strategies. |
Reliance on accurate point cloud normals for initialization.
Limited performance in non-structured scenes where accurate normal estimation is challenging. |
gaussian splatting, novel view synthesis, 3d reconstruction, geometry-aware densification, thin ellipsoid gaussian |
2403.11262
Report |
Understanding Diffusion Models by Feynman's Path Integral |
Yuji Hirono, Akinori Tanaka, Kenji Fukushima |
Score-based diffusion models have proven effective in image generation and
have gained widespread usage; however, the underlying factors contributing to
the performance disparity between stochastic and deterministic (i.e., the
probability flow ODEs) sampling schemes remain unclear. We introduce a novel
formulation of diffusion models using Feynman's path integral, which is a
formulation originally developed for quantum physics. We find this formulation
providing comprehensive descriptions of score-based generative models, and
demonstrate the derivation of backward stochastic differential equations and
loss functions.The formulation accommodates an interpolating parameter
connecting stochastic and deterministic sampling schemes, and we identify this
parameter as a counterpart of Planck's constant in quantum physics. This
analogy enables us to apply the Wentzel-Kramers-Brillouin (WKB) expansion, a
well-established technique in quantum physics, for evaluating the negative
log-likelihood to assess the performance disparity between stochastic and
deterministic sampling schemes. |
This paper presents a novel formulation of diffusion models using Feynman's path integral, a framework originating from quantum physics. The formulation offers a unified perspective on various aspects of score-based generative models and provides a new method for scrutinizing the role of noise in the sampling process. |
This formulation is important because it allows for a deeper understanding of diffusion models by connecting them to well-established techniques in quantum physics. It also provides a way to calculate the negative log-likelihood for stochastic sampling processes, which was previously elusive. |
The authors apply path integral techniques to derive the time-reversed stochastic differential equations and loss functions for diffusion models. They introduce an interpolating parameter linking stochastic and deterministic sampling schemes and use the Wentzel–Kramers–Brillouin (WKB) expansion to evaluate the negative log-likelihood for stochastic processes. |
The path integral formulation provides an alternative derivation of the time-reversed SDE.
The interpolating parameter plays an analogous role to Planck's constant in quantum physics, and the limit of zero noise corresponds to the classical limit.
The WKB expansion enables a perturbative evaluation of the negative log-likelihood, quantifying the impact of noise on the sampling process. |
The current experiments do not include actual image data due to limitations in evaluating NLLs for high-dimensional data.
The estimated numerical error in the computed NLLs might be underestimated. |
diffusion models, path integral, wkb expansion, negative log-likelihood, stochastic sampling |
2403.11247
Report |
Compact 3D Gaussian Splatting For Dense Visual SLAM |
Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Danwei Wang, Weidong Chen |
Recent work has shown that 3D Gaussian-based SLAM enables high-quality
reconstruction, accurate pose estimation, and real-time rendering of scenes.
However, these approaches are built on a tremendous number of redundant 3D
Gaussian ellipsoids, leading to high memory and storage costs, and slow
training speed. To address the limitation, we propose a compact 3D Gaussian
Splatting SLAM system that reduces the number and the parameter size of
Gaussian ellipsoids. A sliding window-based masking strategy is first proposed
to reduce the redundant ellipsoids. Then we observe that the covariance matrix
(geometry) of most 3D Gaussian ellipsoids are extremely similar, which
motivates a novel geometry codebook to compress 3D Gaussian geometric
attributes, i.e., the parameters. Robust and accurate pose estimation is
achieved by a global bundle adjustment method with reprojection loss. Extensive
experiments demonstrate that our method achieves faster training and rendering
speed while maintaining the state-of-the-art (SOTA) quality of the scene
representation. |
This paper introduces a novel 3D Gaussian Splatting-based SLAM system that compresses scene representation to enhance speed, storage efficiency, and rendering while maintaining high-quality reconstruction. |
Existing 3D Gaussian-based SLAM methods, while offering high-quality reconstruction, suffer from high memory and storage costs and slow training speeds due to a large number of redundant 3D Gaussian ellipsoids. |
The proposed system employs a three-pronged approach: 1) a sliding window-based online masking method to remove redundant 3D Gaussian ellipsoids, 2) a codebook-based method to compress the geometric attributes of the remaining ellipsoids, and 3) a global bundle adjustment method with reprojection loss for accurate and robust camera pose estimation. |
The system achieves a nearly 176% increase in rendering speed compared to existing GS-based SLAM methods.
It achieves over 1.97x compression on memory usage compared to existing GS-based SLAM methods.
The system maintains state-of-the-art quality of scene representation despite the significant reduction in the number of Gaussian ellipsoids. |
The system's performance relies heavily on the quality of depth information, which might be limited in real-world scenarios with noisy or incomplete depth data.
Future work could explore incorporating semantic information into the scene representation to further enhance the system's capabilities and performance in complex environments. |
slam, 3d gaussian splatting, scene representation, compression, real-time rendering |
2403.11207
Report |
MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data |
Paul S. Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Norman, Tanishq Mathew Abraham |
Reconstructions of visual perception from brain activity have improved
tremendously, but the practical utility of such methods has been limited. This
is because such models are trained independently per subject where each subject
requires dozens of hours of expensive fMRI training data to attain high-quality
results. The present work showcases high-quality reconstructions using only 1
hour of fMRI training data. We pretrain our model across 7 subjects and then
fine-tune on minimal data from a new subject. Our novel functional alignment
procedure linearly maps all brain data to a shared-subject latent space,
followed by a shared non-linear mapping to CLIP image space. We then map from
CLIP space to pixel space by fine-tuning Stable Diffusion XL to accept CLIP
latents as inputs instead of text. This approach improves out-of-subject
generalization with limited training data and also attains state-of-the-art
image retrieval and reconstruction metrics compared to single-subject
approaches. MindEye2 demonstrates how accurate reconstructions of perception
are possible from a single visit to the MRI facility. All code is available on
GitHub. |
MindEye2 reconstructs visual perception from fMRI data using only one hour of training data per subject, achieving comparable quality to previous approaches that require dozens of hours. |
This advancement holds the potential to revolutionize clinical assessment and brain-computer interfaces by enabling practical reconstruction of perception from minimal fMRI data. |
The approach pretrains a shared-subject model on data from multiple subjects, then fine-tunes it on limited data from a new subject. It maps fMRI activity to a shared latent space using ridge regression, then to CLIP image space using an MLP backbone and diffusion prior. Finally, a fine-tuned Stable Diffusion XL model generates images from the predicted CLIP embeddings. |
Achieves state-of-the-art performance on image retrieval and reconstruction metrics when trained on the full Natural Scenes Dataset.
Maintains competitive decoding performance with only 2.5% of a subject's full dataset (one hour of scanning data).
Outperforms previous methods in subjective human evaluations of reconstruction quality, even with limited training data. |
fMRI's sensitivity to movement and task compliance can affect decoding accuracy.
The model's current focus on natural scenes may require additional data or specialized models for other image distributions. Future work could explore expanding to other image types or real-time applications. |
neuroai, fmri, computational neuroscience, visual perception, deep learning |
2403.11197
Report |
TAG: Guidance-free Open-Vocabulary Semantic Segmentation |
Yasufumi Kawano, Yoshimitsu Aoki |
Semantic segmentation is a crucial task in computer vision, where each pixel
in an image is classified into a category. However, traditional methods face
significant challenges, including the need for pixel-level annotations and
extensive training. Furthermore, because supervised learning uses a limited set
of predefined categories, models typically struggle with rare classes and
cannot recognize new ones. Unsupervised and open-vocabulary segmentation,
proposed to tackle these issues, faces challenges, including the inability to
assign specific class labels to clusters and the necessity of user-provided
text queries for guidance. In this context, we propose a novel approach, TAG
which achieves Training, Annotation, and Guidance-free open-vocabulary semantic
segmentation. TAG utilizes pre-trained models such as CLIP and DINO to segment
images into meaningful categories without additional training or dense
annotations. It retrieves class labels from an external database, providing
flexibility to adapt to new scenarios. Our TAG achieves state-of-the-art
results on PascalVOC, PascalContext and ADE20K for open-vocabulary segmentation
without given class names, i.e. improvement of +15.3 mIoU on PascalVOC. All
code and data will be released at https://github.com/Valkyrja3607/TAG. |
TAG, a novel Training, Annotation, and Guidance-free method for open-vocabulary semantic segmentation, retrieves segment categories from an external database using CLIP and DINOv2. |
Addresses limitations of traditional semantic segmentation methods: reliance on costly pixel-level annotations, predefined categories, and the need for user-provided text queries in open-vocabulary settings. |
1. Identifies segment candidates using per-pixel features from DINOv2. 2. Obtains representative segment embeddings using CLIP's per-pixel features. 3. Assigns categories by retrieving closest matching sentences from an external database. |
Achieves state-of-the-art results on PascalVOC, PascalContext, and ADE20K for open-vocabulary segmentation without given class names.
Shows significant improvement (+15.3 mIoU) over previous zero-guidance segmentation methods on PascalVOC.
Successfully segments and labels images containing general objects, specific categories like 'joker,' and proper nouns. |
Performance depends on the choice of database, posing challenges for unknown domains.
Doesn't differentiate between class granularity levels, potentially predicting a broader category than desired. |
semantic segmentation, open-vocabulary segmentation, zero-guidance segmentation, clip, dinov2 |
2403.11194
Report |
MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation |
Yasufumi Kawano, Yoshimitsu Aoki |
Semantic segmentation is essential in computer vision for various
applications, yet traditional approaches face significant challenges, including
the high cost of annotation and extensive training for supervised learning.
Additionally, due to the limited predefined categories in supervised learning,
models typically struggle with infrequent classes and are unable to predict
novel classes. To address these limitations, we propose MaskDiffusion, an
innovative approach that leverages pretrained frozen Stable Diffusion to
achieve open-vocabulary semantic segmentation without the need for additional
training or annotation, leading to improved performance compared to similar
methods. We also demonstrate the superior performance of MaskDiffusion in
handling open vocabularies, including fine-grained and proper noun-based
categories, thus expanding the scope of segmentation applications. Overall, our
MaskDiffusion shows significant qualitative and quantitative improvements in
contrast to other comparable unsupervised segmentation methods, i.e. on the
Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU
compared to DiffSeg). All code and data will be released at
https://github.com/Valkyrja3607/MaskDiffusion. |
This paper introduces MaskDiffusion, a novel method leveraging pre-trained Stable Diffusion models for open-vocabulary semantic segmentation without additional training or annotation. |
Semantic segmentation faces challenges such as annotation costs and limitations in predicting novel classes. MaskDiffusion addresses these issues by exploiting the rich semantic information embedded in diffusion models pre-trained on massive image-text datasets. |
MaskDiffusion extracts internal features and cross-attention maps from a frozen Stable Diffusion model. It then calculates representative internal features for each category using a weighted average based on cross-attention map values. Finally, it assigns classes to pixels by measuring the cosine similarity between pixel-wise internal features and representative features. |
MaskDiffusion outperforms previous state-of-the-art methods like MaskCLIP and GEM on datasets like Potsdam, Cityscapes, PascalVOC, and COCO-Stuff.
The method demonstrates robust open-vocabulary segmentation capabilities, successfully segmenting challenging concepts, rare classes, and proper nouns.
An unsupervised version, Unsupervised MaskDiffusion, utilizing spectral clustering on internal features, outperforms other unsupervised methods, including DiffSeg, on Cityscapes and COCO-Stuff datasets. |
The cross-attention map in MaskDiffusion shows limitations in accurately assigning internal features to classes.
The current method assumes prior knowledge of potential classes in the image. |
semantic segmentation, open-vocabulary segmentation, diffusion models, stable diffusion, unsupervised learning |
2403.11176
Report |
Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment |
Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini |
No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods
to measure image quality in alignment with human perception when a high-quality
reference image is unavailable. The reliance on annotated Mean Opinion Scores
(MOS) in the majority of state-of-the-art NR-IQA approaches limits their
scalability and broader applicability to real-world scenarios. To overcome this
limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based
self-supervised opinion-unaware method that does not require labeled MOS. In
particular, we introduce a quality-aware image-text alignment strategy to make
CLIP generate representations that correlate with the inherent quality of the
images. Starting from pristine images, we synthetically degrade them with
increasing levels of intensity. Then, we train CLIP to rank these degraded
images based on their similarity to quality-related antonym text prompts, while
guaranteeing consistent representations for images with comparable quality. Our
method achieves state-of-the-art performance on several datasets with authentic
distortions. Moreover, despite not requiring MOS, QualiCLIP outperforms
supervised methods when their training dataset differs from the testing one,
thus proving to be more suitable for real-world scenarios. Furthermore, our
approach demonstrates greater robustness and improved explainability than
competing methods. The code and the model are publicly available at
https://github.com/miccunifi/QualiCLIP. |
This paper proposes QualiCLIP, a self-supervised and opinion-unaware No-Reference Image Quality Assessment (NR-IQA) method based on CLIP that does not require labeled Mean Opinion Scores (MOS). |
Existing NR-IQA methods are limited by their reliance on expensive and scale-limiting MOS labels, hindering their applicability to real-world scenarios. This paper addresses this challenge by leveraging the capabilities of CLIP. |
The method utilizes a quality-aware image-text alignment strategy. Pairs of pristine image crops are synthetically degraded with increasing levels of intensity. The CLIP image encoder is then fine-tuned to rank these degraded images based on their similarity to quality-related antonym text prompts, like 'Good photo' and 'Bad photo'. |
QualiCLIP achieves state-of-the-art performance on multiple IQA datasets with authentic distortions, outperforming existing opinion-unaware methods.
Despite not using MOS labels, QualiCLIP surpasses supervised methods in cross-dataset evaluations, demonstrating superior generalization ability for real-world applications.
QualiCLIP exhibits improved robustness compared to other methods, as shown by gMAD competition results, and showcases enhanced explainability through gradCAM visualization. |
The method relies on synthetic distortions during training, which may not fully represent the complexities of real-world image degradations.
Future work could explore the application of QualiCLIP’s quality-aware image representations to improve CLIP-based semantic tasks such as image retrieval. |
image quality assessment, clip, self-supervised learning, opinion-unaware, image-text alignment |
2403.11162
Report |
CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion |
Xiaoyu Wu, Yang Hua, Chumeng Liang, Jiaru Zhang, Hao Wang, Tao Song, Haibing Guan |
Diffusion Models (DMs) have evolved into advanced image generation tools,
especially for few-shot generation where a pretrained model is fine-tuned on a
small set of images to capture a specific style or object. Despite their
success, concerns exist about potential copyright violations stemming from the
use of unauthorized data in this process. In response, we present Contrasting
Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring
vivid visual representations for digital copyright authentication. Our approach
involves removing partial information of an image and recovering missing
details by exploiting conceptual differences between the pretrained and
fine-tuned models. We formulate the differences as KL divergence between latent
variables of the two models when given the same input image, which can be
maximized through Monte Carlo sampling and Projected Gradient Descent (PGD).
The similarity between original and recovered images serves as a strong
indicator of potential infringements. Extensive experiments on the WikiArt and
Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital
copyright authentication, surpassing alternative validation techniques. Code
implementation is available at https://github.com/Nicholas0228/Revelio. |
This paper presents CGI-DM, a novel method for digital copyright authentication in few-shot image generation using diffusion models (DMs). CGI-DM leverages the differences between pre-trained and fine-tuned models to recover missing image details and authenticate copyright. |
Few-shot image generation techniques, while powerful, raise concerns about copyright infringement. Existing methods struggle to provide robust visual evidence for legal action. This work addresses this by providing a robust and visual method for authenticating copyright in DM-generated images. |
CGI-DM removes partial information from an image and then leverages the conceptual differences between pre-trained and fine-tuned DMs to recover the missing details. It maximizes the KL divergence between the latent variable distributions of the two models through Monte Carlo sampling and Projected Gradient Descent (PGD). |
CGI-DM achieves high accuracy in distinguishing between images used for training and those not used, outperforming existing image generation and inpainting pipelines.
The method is robust across different DM architectures, training image numbers, and training steps.
CGI-DM remains effective even under various defense mechanisms, demonstrating its resilience against attempts to mask training data. |
The computational cost of CGI-DM increases with the number of Monte Carlo sampling steps.
Future work could explore combining CGI-DM with data watermarking techniques to create a more comprehensive copyright protection system. |
diffusion models, copyright authentication, few-shot image generation, gradient inversion, digital copyright |
2403.11116
Report |
PhD: A Prompted Visual Hallucination Evaluation Dataset |
Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Xirong Li |
The rapid growth of Large Language Models (LLMs) has driven the development
of Large Vision-Language Models (LVLMs). The challenge of hallucination,
prevalent in LLMs, also emerges in LVLMs. However, most existing efforts mainly
focus on object hallucination in LVLM, ignoring diverse types of LVLM
hallucinations. In this study, we delve into the Intrinsic Vision-Language
Hallucination (IVL-Hallu) issue, thoroughly analyzing different types of
IVL-Hallu on their causes and reflections. Specifically, we propose several
novel IVL-Hallu tasks and categorize them into four types: (a) object
hallucination, which arises from the misidentification of objects, (b)
attribute hallucination, which is caused by the misidentification of
attributes, (c) multi-modal conflicting hallucination, which derives from the
contradictions between textual and visual information, and (d)
counter-common-sense hallucination, which owes to the contradictions between
the LVLM knowledge and actual images. Based on these taxonomies, we propose a
more challenging benchmark named PhD to evaluate and explore IVL-Hallu. An
automated pipeline is proposed for generating different types of IVL-Hallu
data. Extensive experiments on five SOTA LVLMs reveal their inability to
effectively tackle our proposed IVL-Hallu tasks, with detailed analyses and
insights on the origins and possible solutions of these new challenging
IVL-Hallu tasks, facilitating future researches on IVL-Hallu and LVLM. The
benchmark can be accessed at https://github.com/jiazhen-code/IntrinsicHallu |
This paper introduces Intrinsic Vision-Language Hallucination (IVLH) and proposes a new benchmark called PHD to evaluate and analyze it in Large Vision-Language Models (LVLMs). |
Hallucination, a significant issue in LLMs, also affects LVLMs, and existing research primarily focuses on object hallucination. This work aims to comprehensively analyze diverse types of IVLH and their causes. |
The study categorizes IVLH into four types: object, attribute, multi-modal conflicting, and counter-common-sense hallucinations. It proposes PHD, a benchmark with over 53,000 questions across these categories, and an automated data generation pipeline. |
LVLMs struggle with identifying non-existent objects and mismatched attributes due to over-reliance on internal knowledge.
Absurd questions and misaligned text/image information expose the susceptibility of LVLMs to multi-modal conflicts, leading to hallucinations.
Counter-common-sense images reveal the fundamental challenge of LVLMs balancing internal knowledge with actual image content. |
The benchmark primarily focuses on intrinsic hallucinations, leaving extrinsic hallucinations for future exploration.
Addressing IVLH necessitates structural enhancements to LVLMs, balancing multi-modal inputs and internal knowledge with image content. |
large vision-language models, hallucination, benchmarking, multi-modal learning, vision and language |
2403.11111
Report |
3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models |
Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen |
In this work, we show that synthetic data created by generative models is
complementary to computer graphics (CG) rendered data for achieving remarkable
generalization performance on diverse real-world scenes for 3D human pose and
shape estimation (HPS). Specifically, we propose an effective approach based on
recent diffusion models, termed HumanWild, which can effortlessly generate
human images and corresponding 3D mesh annotations. We first collect a
large-scale human-centric dataset with comprehensive annotations, e.g., text
captions and surface normal images. Then, we train a customized ControlNet
model upon this dataset to generate diverse human images and initial
ground-truth labels. At the core of this step is that we can easily obtain
numerous surface normal images from a 3D human parametric model, e.g., SMPL-X,
by rendering the 3D mesh onto the image plane. As there exists inevitable noise
in the initial labels, we then apply an off-the-shelf foundation segmentation
model, i.e., SAM, to filter negative data samples. Our data generation pipeline
is flexible and customizable to facilitate different real-world tasks, e.g.,
ego-centric scenes and perspective-distortion scenes. The generated dataset
comprises 0.79M images with corresponding 3D annotations, covering versatile
viewpoints, scenes, and human identities. We train various HPS regressors on
top of the generated data and evaluate them on a wide range of benchmarks
(3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the
generated data. By exclusively employing generative models, we generate
large-scale in-the-wild human images and high-quality annotations, eliminating
the need for real-world data collection. |
This paper introduces HumanWild, an automatic and scalable pipeline for synthesizing realistic human images with 3D annotations using generative models, aiming to address the limitations of existing mocap and CG-based datasets in providing diverse and in-the-wild data for 3D human pose and shape estimation (HPS). |
Existing datasets for HPS, based on either indoor motion capture or computer graphics rendering, lack diversity in human identities and real-world scenes, hindering model generalization to in-the-wild scenarios. |
The pipeline leverages SMPL-X for human body parameterization, renders surface normal maps, uses ControlNet with tailored text prompts to generate images, and filters noisy labels using a pre-trained segmentation model (SAM). |
HumanWild effectively complements CG-rendered datasets, leading to improved performance on diverse HPS benchmarks.
The pipeline generates a large-scale dataset of 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities.
Analysis suggests that synthetic data generated by generative models, like HumanWild, is beneficial for HPS tasks due to its diversity and realism. |
Limitations in current diffusion models affect the accuracy of hand and facial annotations.
Future work involves exploring the pipeline's application to other 3D perception tasks, such as 3D animal pose estimation and human interaction reconstruction. |
synthetic data generation, 3d human pose and shape estimation, diffusion models, controllable image generation, computer vision |
2403.11105
Report |
Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models |
Ruibin Li, Ruihuang Li, Song Guo, Lei Zhang |
Text-driven diffusion models have significantly advanced the image editing
performance by using text prompts as inputs. One crucial step in text-driven
image editing is to invert the original image into a latent noise code
conditioned on the source prompt. While previous methods have achieved
promising results by refactoring the image synthesizing process, the inverted
latent noise code is tightly coupled with the source prompt, limiting the image
editability by target text prompts. To address this issue, we propose a novel
method called Source Prompt Disentangled Inversion (SPDInv), which aims at
reducing the impact of source prompt, thereby enhancing the text-driven image
editing performance by employing diffusion models. To make the inverted noise
code be independent of the given source prompt as much as possible, we indicate
that the iterative inversion process should satisfy a fixed-point constraint.
Consequently, we transform the inversion problem into a searching problem to
find the fixed-point solution, and utilize the pre-trained diffusion models to
facilitate the searching process. The experimental results show that our
proposed SPDInv method can effectively mitigate the conflicts between the
target editing prompt and the source prompt, leading to a significant decrease
in editing artifacts. In addition to text-driven image editing, with SPDInv we
can easily adapt customized image generation models to localized editing tasks
and produce promising performance. The source code are available at
https://github.com/leeruibin/SPDInv. |
This paper proposes SPDInv, a novel image inversion method for text-driven image editing that disentangles the inverted latent noise code from the source prompt, thereby enhancing editing performance by reducing artifacts and inconsistencies. |
Existing text-driven image editing methods rely on inversion techniques that tightly couple the inverted latent code with the source prompt, hindering editing flexibility and fidelity. |
SPDInv leverages the fixed-point constraint inherent in the DDIM sampling process. It reformulates the constraint as a loss function and utilizes pre-trained diffusion models to search for a fixed-point solution, minimizing the influence of the source prompt on the inverted noise. |
SPDInv effectively reduces the noise gap compared to DDIM inversion, indicating less entanglement with the source prompt.
Quantitative evaluations on PIE-Bench and TDE-Bench datasets demonstrate significant improvements in editing quality over state-of-the-art methods.
SPDInv successfully extends the capabilities of customized image generation methods, allowing for localized editing while preserving object identity and background consistency. |
SPDInv relies on existing editing engines like P2P, PNP, and MasaCtrl, inheriting their limitations in handling complex editing operations such as adding or dropping content.
While promising for various objects, SPDInv faces challenges in portrait editing, requiring further investigation. |
image editing, image inversion, diffusion models, text-driven editing, latent space manipulation |
2403.11056
Report |
Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration |
Zhihao Liang, Qi Zhang, Wenbo Hu, Ying Feng, Lei Zhu, Kui Jia |
The 3D Gaussian Splatting (3DGS) gained its popularity recently by combining
the advantages of both primitive-based and volumetric 3D representations,
resulting in improved quality and efficiency for 3D scene rendering. However,
3DGS is not alias-free, and its rendering at varying resolutions could produce
severe blurring or jaggies. This is because 3DGS treats each pixel as an
isolated, single point rather than as an area, causing insensitivity to changes
in the footprints of pixels. Consequently, this discrete sampling scheme
inevitably results in aliasing, owing to the restricted sampling bandwidth. In
this paper, we derive an analytical solution to address this issue. More
specifically, we use a conditioned logistic function as the analytic
approximation of the cumulative distribution function (CDF) in a
one-dimensional Gaussian signal and calculate the Gaussian integral by
subtracting the CDFs. We then introduce this approximation in the
two-dimensional pixel shading, and present Analytic-Splatting, which
analytically approximates the Gaussian integral within the 2D-pixel window area
to better capture the intensity response of each pixel. Moreover, we use the
approximated response of the pixel window integral area to participate in the
transmittance calculation of volume rendering, making Analytic-Splatting
sensitive to the changes in pixel footprint at different resolutions.
Experiments on various datasets validate that our approach has better
anti-aliasing capability that gives more details and better fidelity. |
This paper introduces Analytic-Splatting, a novel approach for anti-aliasing in 3D Gaussian Splatting (3DGS) using an analytical approximation of the Gaussian integral within the pixel window area. |
3DGS suffers from aliasing artifacts due to its discrete sampling scheme, especially when pixel footprints change drastically at different resolutions. This leads to blurry or jagged renderings. Analytic-Splatting aims to overcome these limitations by considering the entire pixel area for intensity response. |
The method utilizes a conditioned logistic function to approximate the cumulative distribution function (CDF) of a one-dimensional Gaussian signal. This approximation is then extended to two dimensions for pixel shading by diagonalizing the covariance matrix and rotating the integration domain to decouple correlations. |
Analytic-Splatting demonstrates superior anti-aliasing capabilities compared to 3DGS and other methods, producing renderings with better detail fidelity.
The proposed analytic approximation significantly reduces errors compared to discrete sampling and prefiltering techniques.
Experiments on multi-scale Blender Synthetic and Mip-NeRF 360 datasets validate the effectiveness of Analytic-Splatting in achieving state-of-the-art novel view synthesis results under multi-scale and super-resolution settings. |
The increased number of root and exponential operations in the shading module slightly reduces rendering speed compared to 3DGS and Mip-Splatting.
Future work could explore more efficient implementations and applications of the analytic approximation in other areas of neural rendering. |
3d gaussian splatting, anti-aliasing, view synthesis, cumulative distribution function (cdf), analytic approximation |
2403.11053
Report |
OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization |
Ye Wang, Zili Yi, Rui Ma |
Personalized text-to-image (T2I) models not only produce lifelike and varied
visuals but also allow users to tailor the images to fit their personal taste.
These personalization techniques can grasp the essence of a concept through a
collection of images, or adjust a pre-trained text-to-image model with a
specific image input for subject-driven or attribute-aware guidance. Yet,
accurately capturing the distinct visual attributes of an individual image
poses a challenge for these methods. To address this issue, we introduce OSTAF,
a novel parameter-efficient one-shot fine-tuning method which only utilizes one
reference image for T2I personalization. A novel hypernetwork-powered
attribute-focused fine-tuning mechanism is employed to achieve the precise
learning of various attribute features (e.g., appearance, shape or drawing
style) from the reference image. Comparing to existing image customization
methods, our method shows significant superiority in attribute identification
and application, as well as achieves a good balance between efficiency and
output quality. |
This paper introduces OSTAF, a one-shot fine-tuning method for attribute-focused text-to-image personalization using only one reference image. |
Current personalized T2I models struggle to accurately separate and replicate distinct visual attributes from a single image, limiting attribute-focused customization. |
The method analyzes how different parts of the diffusion U-net learn attributes and uses a lightweight hypernetwork to guide the fine-tuning of specific U-net components based on the desired attribute (appearance, shape, or style). |
OSTAF outperforms existing methods in quantitative metrics like CLIP-T, IoU, and Gram matrix distance, demonstrating superior attribute customization.
Qualitative results showcase OSTAF's ability to accurately identify and apply attributes across domains while maintaining text controllability.
User studies confirm that OSTAF generates customized images that better align with user preferences compared to other methods. |
While efficient in terms of data, fine-tuning time is comparable to other computationally intensive methods and could be improved.
The method is currently limited to image inputs and could be expanded to video for more dynamic attribute customization. |
text-to-image synthesis, image personalization, attribute customization, one-shot learning, hypernetworks |
2403.11027
Report |
Reward Guided Latent Consistency Distillation |
Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang |
Latent Consistency Distillation (LCD) has emerged as a promising paradigm for
efficient text-to-image synthesis. By distilling a latent consistency model
(LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates
the generation of high-fidelity images within merely 2 to 4 inference steps.
However, the LCM's efficient inference is obtained at the cost of the sample
quality. In this paper, we propose compensating the quality loss by aligning
LCM's output with human preference during training. Specifically, we introduce
Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM)
into the LCD process by augmenting the original LCD loss with the objective of
maximizing the reward associated with LCM's single-step generation. As
validated through human evaluation, when trained with the feedback of a good
RM, the 2-step generations from our RG-LCM are favored by humans over the
50-step DDIM samples from the teacher LDM, representing a 25 times inference
acceleration without quality loss.
As directly optimizing towards differentiable RMs can suffer from
over-optimization, we overcome this difficulty by proposing the use of a latent
proxy RM (LRM). This novel component serves as an intermediary, connecting our
LCM with the RM. Empirically, we demonstrate that incorporating the LRM into
our RG-LCD successfully avoids high-frequency noise in the generated images,
contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on
HPSv2's test set, surpassing those achieved by the baseline LCM. |
This document outlines the formatting instructions for authors submitting papers to the NeurIPS 2023 conference. |
It ensures a consistent and standardized format for all submissions, aiding in the review process. |
The paper provides detailed specifications regarding style, layout, fonts, citations, figures, tables, and other formatting aspects. |
Authors must use the provided NeurIPS 2023 LaTeX style file.
Submissions are limited to nine pages, excluding references and acknowledgments.
Papers should be submitted in US Letter size with embedded Type 1 or TrueType fonts. |
The document assumes familiarity with LaTeX.
Specific guidance on handling supplementary materials could be clearer. |
neurips, conference, formatting, latex, submission |
2403.10983
Report |
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models |
Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guanying Chen, Wei Liu, Wenhan Luo |
Personalization is an important topic in text-to-image generation, especially
the challenging multi-concept personalization. Current multi-concept methods
are struggling with identity preservation, occlusion, and the harmony between
foreground and background. In this work, we propose OMG, an occlusion-friendly
personalized generation framework designed to seamlessly integrate multiple
concepts within a single image. We propose a novel two-stage sampling solution.
The first stage takes charge of layout generation and visual comprehension
information collection for handling occlusions. The second one utilizes the
acquired visual comprehension information and the designed noise blending to
integrate multiple concepts while considering occlusions. We also observe that
the initiation denoising timestep for noise blending is the key to identity
preservation and layout. Moreover, our method can be combined with various
single-concept models, such as LoRA and InstantID without additional tuning.
Especially, LoRA models on civitai.com can be exploited directly. Extensive
experiments demonstrate that OMG exhibits superior performance in multi-concept
personalization. |
This supplementary material provides additional details, results, and analysis for the main paper on multi-concept image customization with layout and identity preservation. |
Provides further insight into the evaluation setting, qualitative results, combination with other techniques like ControlNet and style LoRAs, and limitations of the proposed method. |
Presents additional qualitative results, examples, and ablation studies to support claims made in the main paper. |
Combining the method with ControlNet under various conditions (human pose, canny edge, depth maps) demonstrates its versatility.
The method effectively combines with different style LoRAs, showcasing its flexibility in style manipulation.
Layout preservation is crucial for maintaining image structure and quality during multi-concept customization. |
Generating high-quality small-face regions can be challenging due to information loss in VAE.
Computational intensity, particularly with noise fusion from multiple single-concept models, can lead to slower generation speed. |
image customization, multi-concept generation, layout preservation, identity preservation, controlnet |
2403.10953
Report |
Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription |
Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma |
Large image diffusion models have demonstrated zero-shot capability in novel
view synthesis (NVS). However, existing diffusion-based NVS methods struggle to
generate novel views that are accurately consistent with the corresponding
ground truth poses and appearances, even on the training set. This consequently
limits the performance of downstream tasks, such as image-to-multiview
generation and 3D reconstruction. We realize that such inconsistency is largely
due to the fact that it is difficult to enforce accurate pose and appearance
alignment directly in the diffusion training, as mostly done by existing
methods such as Zero123. To remedy this problem, we propose Ctrl123, a
closed-loop transcription-based NVS diffusion method that enforces alignment
between the generated view and ground truth in a pose-sensitive feature space.
Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks
of NVS and 3D reconstruction, achieving significant improvements in both
multiview-consistency and pose-consistency over existing methods. |
Introduces Ctrl123, a closed-loop transcription-based novel view synthesis diffusion model, to improve the pose and appearance consistency of generated views. |
Existing diffusion-based NVS methods struggle to generate views consistent with ground truth poses and appearances, limiting performance in tasks like 3D reconstruction. |
Extends open-loop NVS models to a closed-loop framework, measuring and minimizing the difference between generated and ground truth views in a pose-sensitive latent space using patch features. |
Significantly improves NVS pose and appearance consistency even with fewer training steps.
Achieves a 7 point increase in PSNR and substantial improvements in AA and IoU metrics compared to Zero123.
Demonstrates superior 3D reconstruction quality with smooth surfaces and detailed geometry. |
Exploring alternative latent space representations for enhanced consistency.
Investigating the generalization of the closed-loop framework for ensuring consistency in other attributes like object relations and shapes. |
novel view synthesis, diffusion models, closed-loop transcription, pose consistency, 3d reconstruction |
2403.10935
Report |
Understanding Robustness of Visual State Space Models for Image Classification |
Chengbin Du, Yanxi Li, Chang Xu |
Visual State Space Model (VMamba) has recently emerged as a promising
architecture, exhibiting remarkable performance in various computer vision
tasks. However, its robustness has not yet been thoroughly studied. In this
paper, we delve into the robustness of this architecture through comprehensive
investigations from multiple perspectives. Firstly, we investigate its
robustness to adversarial attacks, employing both whole-image and
patch-specific adversarial attacks. Results demonstrate superior adversarial
robustness compared to Transformer architectures while revealing scalability
weaknesses. Secondly, the general robustness of VMamba is assessed against
diverse scenarios, including natural adversarial examples, out-of-distribution
data, and common corruptions. VMamba exhibits exceptional generalizability with
out-of-distribution data but shows scalability weaknesses against natural
adversarial examples and common corruptions. Additionally, we explore VMamba's
gradients and back-propagation during white-box attacks, uncovering unique
vulnerabilities and defensive capabilities of its novel components. Lastly, the
sensitivity of VMamba to image structure variations is examined, highlighting
vulnerabilities associated with the distribution of disturbance areas and
spatial information, with increased susceptibility closer to the image center.
Through these comprehensive studies, we contribute to a deeper understanding of
VMamba's robustness, providing valuable insights for refining and advancing the
capabilities of deep neural networks in computer vision applications. |
This paper presents a comprehensive analysis of the robustness of the Visual State Space Model (VMamba), a promising architecture for visual representation learning. |
Despite its successes in various computer vision tasks, the robustness of VMamba, a novel architecture, has not been thoroughly studied. |
The authors investigate VMamba's robustness to adversarial attacks (both whole-image and patch-specific), its performance on various ImageNet datasets (A, R, and C), the behavior of its novel components (parameters A, B, C, and Δ) under white-box attacks, and its sensitivity to image structure variations. |
VMamba exhibits superior adversarial robustness compared to Transformer architectures but shows scalability weaknesses.
VMamba demonstrates exceptional generalizability with out-of-distribution data but shows scalability weaknesses against natural adversarial examples and common corruptions.
VMamba is highly sensitive to the spatial information and continuity of images, with increased susceptibility closer to the image center. |
The study primarily focuses on a limited set of VMamba and Transformer models, which may not fully represent the entire spectrum of model variations.
Future work can explore the development of specialized defense mechanisms tailored to the unique characteristics of VMamba's architecture, such as adaptive scanning strategies and robust feature extraction techniques. |
visual state space model, vmamba, robustness, adversarial attacks, image classification |
2403.10906
Report |
HourglassNeRF: Casting an Hourglass as a Bundle of Rays for Few-shot Neural Rendering |
Seunghyeon Seo, Yeonjin Chang, Jayeon Yoo, Seungwoo Lee, Hojun Lee, Nojun Kwak |
Recent advancements in the Neural Radiance Field (NeRF) have bolstered its
capabilities for novel view synthesis, yet its reliance on dense multi-view
training images poses a practical challenge. Addressing this, we propose
HourglassNeRF, an effective regularization-based approach with a novel
hourglass casting strategy. Our proposed hourglass is conceptualized as a
bundle of additional rays within the area between the original input ray and
its corresponding reflection ray, by featurizing the conical frustum via
Integrated Positional Encoding (IPE). This design expands the coverage of
unseen views and enables an adaptive high-frequency regularization based on
target pixel photo-consistency. Furthermore, we propose luminance consistency
regularization based on the Lambertian assumption, which is known to be
effective for training a set of augmented rays under the few-shot setting.
Leveraging the inherent property of a Lambertian surface, which retains
consistent luminance irrespective of the viewing angle, we assume our proposed
hourglass as a collection of flipped diffuse reflection rays and enhance the
luminance consistency between the original input ray and its corresponding
hourglass, resulting in more physically grounded training framework and
performance improvement. Our HourglassNeRF outperforms its baseline and
achieves competitive results on multiple benchmarks with sharply rendered fine
details. The code will be available. |
HourglassNeRF, a novel regularization-based method for few-shot neural rendering that employs an hourglass casting strategy. |
Addresses the challenge of NeRF's reliance on dense multi-view training images by introducing a novel ray augmentation and regularization technique. |
1. Casts an hourglass as a bundle of additional rays within the conical frustum, featurized using Integrated Positional Encoding (IPE).
2. Applies adaptive high-frequency regularization based on target pixel photo-consistency.
3. Introduces luminance consistency regularization based on the Lambertian assumption. |
Outperforms baseline methods and achieves state-of-the-art results on Realistic Synthetic 360° dataset.
Renders sharper fine details from earlier training stages compared to methods relying on fixed high-frequency masking.
Demonstrates competitive performance on DTU dataset without relying on dataset-specific priors. |
Limited consideration of surface properties, assuming all reflections as diffuse even on shiny surfaces.
Future work could explore adaptive use of specular and diffuse reflections based on estimated surface texture. |
neural radiance field, few-shot neural rendering, ray augmentation, hourglass casting, luminance consistency |
2403.10854
Report |
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment |
Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang |
While Multimodal Large Language Models (MLLMs) have experienced significant
advancement on visual understanding and reasoning, their potentials to serve as
powerful, flexible, interpretable, and text-driven models for Image Quality
Assessment (IQA) remains largely unexplored. In this paper, we conduct a
comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we
first investigate nine prompting systems for MLLMs as the combinations of three
standardized testing procedures in psychophysics (i.e., the single-stimulus,
double-stimulus, and multiple-stimulus methods) and three popular prompting
strategies in natural language processing (i.e., the standard, in-context, and
chain-of-thought prompting). We then present a difficult sample selection
procedure, taking into account sample diversity and uncertainty, to further
challenge MLLMs equipped with the respective optimal prompting systems. We
assess three open-source and one close-source MLLMs on several visual
attributes of image quality (e.g., structural and textural distortions, color
differences, and geometric transformations) in both full-reference and
no-reference scenarios. Experimental results show that only the close-source
GPT-4V provides a reasonable account for human perception of image quality, but
is weak at discriminating fine-grained quality variations (e.g., color
differences) and at comparing visual quality of multiple images, tasks humans
can perform effortlessly. |
This paper presents a comprehensive study of prompting Multimodal Large Language Models (MLLMs) for Image Quality Assessment (IQA), exploring different prompting strategies and their effectiveness in evaluating image quality. |
This study is important because it investigates the potential of MLLMs to serve as powerful, flexible, interpretable, and text-driven models for IQA, a task that traditional IQA methods struggle with. |
The authors systematically combine psychophysical testing procedures (single-stimulus, double-stimulus, and multiple-stimulus methods) with NLP prompting strategies (standard, in-context, and chain-of-thought prompting) to create nine prompting systems. They also propose a difficult sample selection procedure to challenge the MLLMs. |
The optimal prompting system varies between open-source and close-source MLLMs.
Only the close-source GPT-4V provides reasonable IQA performance, but still struggles with fine-grained quality variations and multiple-image comparison.
Chain-of-thought prompting consistently improves GPT-4V's performance across different testing protocols and visual attributes. |
The textual responses from MLLMs were not quantitatively assessed.
The study focuses on prompting and doesn't explore instruction tuning of MLLMs for enhanced IQA performance. |
image quality assessment, multimodal large language models, prompt engineering, psychophysics, benchmarking |
2403.10801
Report |
Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples |
Ziqi Zhou, Minghui Li, Wei Liu, Shengshan Hu, Yechao Zhang, Wei Wan, Lulu Xue, Leo Yu Zhang, Dezhong Yao, Hai Jin |
With the evolution of self-supervised learning, the pre-training paradigm has
emerged as a predominant solution within the deep learning landscape. Model
providers furnish pre-trained encoders designed to function as versatile
feature extractors, enabling downstream users to harness the benefits of
expansive models with minimal effort through fine-tuning. Nevertheless, recent
works have exposed a vulnerability in pre-trained encoders, highlighting their
susceptibility to downstream-agnostic adversarial examples (DAEs) meticulously
crafted by attackers. The lingering question pertains to the feasibility of
fortifying the robustness of downstream models against DAEs, particularly in
scenarios where the pre-trained encoders are publicly accessible to the
attackers.
In this paper, we initially delve into existing defensive mechanisms against
adversarial examples within the pre-training paradigm. Our findings reveal that
the failure of current defenses stems from the domain shift between
pre-training data and downstream tasks, as well as the sensitivity of encoder
parameters. In response to these challenges, we propose Genetic
Evolution-Nurtured Adversarial Fine-tuning (Gen-AF), a two-stage adversarial
fine-tuning approach aimed at enhancing the robustness of downstream models.
Our extensive experiments, conducted across ten self-supervised training
methods and six datasets, demonstrate that Gen-AF attains high testing accuracy
and robust testing accuracy against state-of-the-art DAEs. |
Gen-AF, a novel genetic evolution-nurtured adversarial fine-tuning approach, enhances downstream model robustness against DAEs while preserving generalization ability. |
Pre-trained encoders are vulnerable to DAEs, jeopardizing downstream tasks. Existing defenses are ineffective due to domain shift and encoder sensitivity. |
Two-stage approach: 1) Genetic-driven dual-track adversarial fine-tuning with bilevel optimization and genetic regularization. 2) Evolutionary adaptability fine-tuning, targeting robust-redundant layers. |
Gen-AF achieves high robust testing accuracy against five SOTA DAEs across ten SSL methods, two pre-training datasets, and six downstream datasets.
Maintains or improves generalization compared to standard training.
Effectively defends against backdoor attacks targeting pre-trained encoders. |
Exploration of other types of adversarial examples.
Investigation of more efficient fine-tuning strategies to further reduce computational overhead. |
adversarial machine learning, deep learning, self-supervised learning, transfer learning, adversarial examples |
2403.10783
Report |
StableGarment: Garment-Centric Generation via Stable Diffusion |
Rui Wang, Hailong Guo, Jiaming Liu, Huaxia Li, Haibo Zhao, Xu Tang, Yao Hu, Hao Tang, Peipei Li |
In this paper, we introduce StableGarment, a unified framework to tackle
garment-centric(GC) generation tasks, including GC text-to-image, controllable
GC text-to-image, stylized GC text-to-image, and robust virtual try-on. The
main challenge lies in retaining the intricate textures of the garment while
maintaining the flexibility of pre-trained Stable Diffusion. Our solution
involves the development of a garment encoder, a trainable copy of the
denoising UNet equipped with additive self-attention (ASA) layers. These ASA
layers are specifically devised to transfer detailed garment textures, also
facilitating the integration of stylized base models for the creation of
stylized images. Furthermore, the incorporation of a dedicated try-on
ControlNet enables StableGarment to execute virtual try-on tasks with
precision. We also build a novel data engine that produces high-quality
synthesized data to preserve the model's ability to follow prompts. Extensive
experiments demonstrate that our approach delivers state-of-the-art (SOTA)
results among existing virtual try-on methods and exhibits high flexibility
with broad potential applications in various garment-centric image generation. |
Proposed StableGarment, a unified framework tackling various garment-centric generation tasks, including text-to-image, controllable generation, stylized generation, and robust virtual try-on. |
Addresses limitations of existing virtual try-on methods and enables the creation of diverse product visuals (e.g., posters, display images) with accurate garment details and flexible image modifications. |
Leverages a garment encoder with additive self-attention for detailed texture transfer, a try-on ControlNet for precise virtual try-on, and a data engine producing synthesized data to enhance prompt following. |
Achieves state-of-the-art performance among virtual try-on methods.
Demonstrates high flexibility in garment-centric image generation with various text prompts, control signals, and stylized base models.
Outperforms existing methods in preserving intricate garment details, such as patterns and text. |
Limitations in VAE reconstruction affecting garment detail preservation.
Occasional generation of incorrect accessories due to inaccurate parsing conditions (garment masks, DensePose). |
virtual try-on, text-to-image synthesis, diffusion models, garment-centric generation, stable diffusion |
2403.10731
Report |
Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation |
Anton Pelykh, Ozge Mercanoglu Sincan, Richard Bowden |
Recent years have seen significant progress in human image generation,
particularly with the advancements in diffusion models. However, existing
diffusion methods encounter challenges when producing consistent hand anatomy
and the generated images often lack precise control over the hand pose. To
address this limitation, we introduce a novel approach to pose-conditioned
human image generation, dividing the process into two stages: hand generation
and subsequent body outpainting around the hands. We propose training the hand
generator in a multi-task setting to produce both hand images and their
corresponding segmentation masks, and employ the trained model in the first
stage of generation. An adapted ControlNet model is then used in the second
stage to outpaint the body around the generated hands, producing the final
result. A novel blending technique is introduced to preserve the hand details
during the second stage that combines the results of both stages in a coherent
way. This involves sequential expansion of the outpainted region while fusing
the latent representations, to ensure a seamless and cohesive synthesis of the
final image. Experimental evaluations demonstrate the superiority of our
proposed method over state-of-the-art techniques, in both pose accuracy and
image quality, as validated on the HaGRID dataset. Our approach not only
enhances the quality of the generated hands but also offers improved control
over hand pose, advancing the capabilities of pose-conditioned human image
generation. The source code of the proposed approach is available at
https://github.com/apelykh/hand-to-diffusion. |
This paper proposes a novel two-stage diffusion-based approach for human image generation that addresses the challenge of generating high-quality hands with precise pose control. |
Existing diffusion models often struggle to generate realistic and anatomically correct hands, particularly when precise pose control is desired. This limitation hinders their applicability in areas like advertising and game character creation. |
The proposed method first generates hands and their segmentation masks using a multi-task diffusion model. Then, it employs an adapted ControlNet model to outpaint the body around the generated hands, guided by the skeleton pose. A novel blending technique with sequential mask expansion ensures seamless integration of hands and body. |
The method achieves state-of-the-art results in pose accuracy, outperforming baselines by a significant margin in terms of DAP and MPJPE for both full body and hand keypoints.
Qualitative and quantitative evaluations on the HaGRID dataset demonstrate superior image quality with realistic and anatomically correct hands.
The sequential mask expansion blending strategy effectively preserves hand details while ensuring seamless transitions between the generated regions, as shown in the ablation study. |
The approach assumes connectivity between arms and wrists in the input pose, potentially leading to discontinuities if arm keypoints are missing.
Generating high-quality small hands is challenging due to the limited resolution of the latent space. |
human image generation, diffusion models, pose control, hand generation, multi-task learning |
2403.10701
Report |
IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation |
Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga |
Generative object compositing emerges as a promising new avenue for
compositional image editing. However, the requirement of object identity
preservation poses a significant challenge, limiting practical usage of most
existing methods. In response, this paper introduces IMPRINT, a novel
diffusion-based generative model trained with a two-stage learning framework
that decouples learning of identity preservation from that of compositing. The
first stage is targeted for context-agnostic, identity-preserving pretraining
of the object encoder, enabling the encoder to learn an embedding that is both
view-invariant and conducive to enhanced detail preservation. The subsequent
stage leverages this representation to learn seamless harmonization of the
object composited to the background. In addition, IMPRINT incorporates a
shape-guidance mechanism offering user-directed control over the compositing
process. Extensive experiments demonstrate that IMPRINT significantly
outperforms existing methods and various baselines on identity preservation and
composition quality. |
Introduces IMPRINT, a two-stage diffusion-based generative model for object compositing that decouples identity preservation from compositing, enhancing object detail fidelity and background harmonization. |
Addresses the limitations of existing generative object compositing methods that struggle to balance identity preservation with seamless integration into backgrounds. |
Employs a two-stage learning framework: 1) context-agnostic, identity-preserving pretraining of an object encoder on multi-view data and 2) fine-tuning the model for compositing, leveraging the learned representations for harmonization. |
Significantly outperforms existing methods in identity preservation and composition quality on benchmark datasets.
Demonstrates superior appearance preservation through a novel context-agnostic training approach.
Incorporates a shape-guidance mechanism for user-directed control over the compositing process. |
Identity preservation may degrade with large viewpoint changes, requiring further exploration of 3D representations.
Consistency of small details like text and logos can be improved, potentially through higher resolution encoders and improved latent space representation. |
image compositing, generative models, diffusion models, identity preservation, shape guidance |
2403.10615
Report |
LightIt: Illumination Modeling and Control for Diffusion Models |
Peter Kocsis, Julien Philip, Kalyan Sunkavalli, Matthias Nießner, Yannick Hold-Geoffroy |
We introduce LightIt, a method for explicit illumination control for image
generation. Recent generative methods lack lighting control, which is crucial
to numerous artistic aspects of image generation such as setting the overall
mood or cinematic appearance. To overcome these limitations, we propose to
condition the generation on shading and normal maps. We model the lighting with
single bounce shading, which includes cast shadows. We first train a shading
estimation module to generate a dataset of real-world images and shading pairs.
Then, we train a control network using the estimated shading and normals as
input. Our method demonstrates high-quality image generation and lighting
control in numerous scenes. Additionally, we use our generated dataset to train
an identity-preserving relighting model, conditioned on an image and a target
shading. Our method is the first that enables the generation of images with
controllable, consistent lighting and performs on par with specialized
relighting state-of-the-art methods. |
Introduces LightIt, a method for explicit illumination control in image generation using single-bounce shading and normal maps as conditioning signals for diffusion models. |
Recent generative methods lack explicit lighting control, which is crucial for artistic aspects of image generation like mood and realism. |
Trains a shading estimation module to generate a paired image-shading dataset from panoramas, then trains a control network using estimated shading and normals to guide a pre-trained diffusion model (Stable Diffusion). |
Generates images with controllable and consistent lighting across diverse text prompts and styles.
Enables novel lighting scenarios for both real and generated images.
Outperforms specialized relighting methods in terms of generalization and realism. |
Assumes directional lighting, limiting applicability to outdoor scenes.
Relies on estimated lighting directions, hindering training on larger, unconstrained datasets. |
image generation, illumination control, diffusion models, shading estimation, relighting |
2403.10520
Report |
Strong and Controllable Blind Image Decomposition |
Zeyu Zhang, Junlin Han, Chenhui Gou, Hongdong Li, Liang Zheng |
Blind image decomposition aims to decompose all components present in an
image, typically used to restore a multi-degraded input image. While fully
recovering the clean image is appealing, in some scenarios, users might want to
retain certain degradations, such as watermarks, for copyright protection. To
address this need, we add controllability to the blind image decomposition
process, allowing users to enter which types of degradation to remove or
retain. We design an architecture named controllable blind image decomposition
network. Inserted in the middle of U-Net structure, our method first decomposes
the input feature maps and then recombines them according to user instructions.
Advantageously, this functionality is implemented at minimal computational
cost: decomposition and recombination are all parameter-free. Experimentally,
our system excels in blind image decomposition tasks and can outputs partially
or fully restored images that well reflect user intentions. Furthermore, we
evaluate and configure different options for the network structure and loss
functions. This, combined with the proposed decomposition-and-recombination
method, yields an efficient and competitive system for blind image
decomposition, compared with current state-of-the-art methods. |
This paper introduces controllability to blind image decomposition (BID), enabling users to selectively remove or retain image components based on their needs. |
It addresses the limitations of existing BID methods that lack controllability and flexibility in handling user-specific preferences for image restoration. This makes image processing more aligned with real-world scenarios where users may want to keep certain degradations, like watermarks for copyright. |
The paper presents CBDNet, a U-Net-based architecture with a decomposition block, a controllability block, and a recombination block. The decomposition block splits the feature map into components. The controllability block predicts the components present and allows for user prompt input. The recombination block blends selected components based on the prompt. |
CBDNet achieves state-of-the-art performance on standard BID tasks, outperforming existing methods in both efficiency and accuracy.
CBDNet effectively performs controllable BID, removing or retaining components based on user prompts.
The authors create a new multi-domain degradation removal dataset to support research on controllable BID with nine degradation types across weather, lighting, and obstruction domains. |
CBDNet's in-painting capability has limitations, especially in heavily obscured areas.
Further research is needed to improve the robustness of the source classifier in corner cases. |
image decomposition, low-level vision, controllable image processing, image restoration, rain removal |
2403.10427
Report |
SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians |
Hiba Dahmani, Moussab Bennehar, Nathan Piasco, Luis Roldao, Dzmitry Tsishkou |
Implicit neural representation methods have shown impressive advancements in
learning 3D scenes from unstructured in-the-wild photo collections but are
still limited by the large computational cost of volumetric rendering. More
recently, 3D Gaussian Splatting emerged as a much faster alternative with
superior rendering quality and training efficiency, especially for small-scale
and object-centric scenarios. Nevertheless, this technique suffers from poor
performance on unstructured in-the-wild data. To tackle this, we extend over 3D
Gaussian Splatting to handle unstructured image collections. We achieve this by
modeling appearance to seize photometric variations in the rendered images.
Additionally, we introduce a new mechanism to train transient Gaussians to
handle the presence of scene occluders in an unsupervised manner. Experiments
on diverse photo collection scenes and multi-pass acquisition of outdoor
landmarks show the effectiveness of our method over prior works achieving
state-of-the-art results with improved efficiency. |
SWAG, a novel 3D Gaussian Splatting (3DGS)-based method for 3D scene reconstruction from in-the-wild photo collections, effectively handling appearance variations and occluders. |
Existing implicit neural representation methods for 3D scene reconstruction struggle with the high computational cost of volumetric rendering, particularly in challenging in-the-wild scenarios. |
SWAG introduces image-dependent embeddings to modulate Gaussian colors, capturing appearance variations. It also learns image-dependent opacity variations for each Gaussian, allowing for unsupervised handling of transient objects. |
SWAG achieves state-of-the-art results on the Phototourism dataset and NeRF-OSR benchmark.
It significantly outperforms 3DGS in in-the-wild settings, with an average PSNR improvement of 5 dB.
SWAG maintains real-time rendering capabilities while significantly reducing training time compared to implicit methods. |
Transient object removal, while generally effective, can lead to minor artifacts in areas with frequent occlusions.
Future work could explore per-scene hyperparameter tuning and extension to dynamic scenes. |
3d gaussian splatting, unconstrained photo collection, novel view synthesis, appearance modeling, real-time rendering |
2403.10395
Report |
Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding |
Pengkun Liu, Yikai Wang, Fuchun Sun, Jiafang Li, Hang Xiao, Hongxiang Xue, Xinzhou Wang |
Encouraged by the growing availability of pre-trained 2D diffusion models,
image-to-3D generation by leveraging Score Distillation Sampling (SDS) is
making remarkable progress. Most existing methods combine novel-view lifting
from 2D diffusion models which usually take the reference image as a condition
while applying hard L2 image supervision at the reference view. Yet heavily
adhering to the image is prone to corrupting the inductive knowledge of the 2D
diffusion model leading to flat or distorted 3D generation frequently. In this
work, we reexamine image-to-3D in a novel perspective and present Isotropic3D,
an image-to-3D generation pipeline that takes only an image CLIP embedding as
input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth
angle by solely resting on the SDS loss. The core of our framework lies in a
two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D
diffusion model by substituting its text encoder with an image encoder, by
which the model preliminarily acquires image-to-image capabilities. Secondly,
we perform fine-tuning using our Explicit Multi-view Attention (EMA) which
combines noisy multi-view images with the noise-free reference image as an
explicit condition. CLIP embedding is sent to the diffusion model throughout
the whole process while reference images are discarded once after fine-tuning.
As a result, with a single image CLIP embedding, Isotropic3D is capable of
generating multi-view mutually consistent images and also a 3D model with more
symmetrical and neat content, well-proportioned geometry, rich colored texture,
and less distortion compared with existing image-to-3D methods while still
preserving the similarity to the reference image to a large extent. The project
page is available at https://isotropic3d.github.io/. The code and models are
available at https://github.com/pkunliu/Isotropic3D. |
Isotropic3D is a novel image-to-3D generation pipeline that takes only an image CLIP embedding as input, allowing for isotropic optimization with respect to the azimuth angle using only the SDS loss, resulting in more symmetrical and neat 3D content. |
Existing image-to-3D methods heavily rely on reference images, leading to issues like 3D distortion, multi-face problems, and multi-view inconsistency. This work aims to leverage the power of 2D diffusion models without compromising the generation process by hard image supervision. |
Isotropic3D utilizes a two-stage fine-tuning of a text-to-3D diffusion model. Firstly, it substitutes the text encoder with an image encoder to enable image-to-image capabilities. Secondly, it introduces Explicit Multi-view Attention (EMA) to fine-tune the model using noisy multi-view images and a noise-free reference image, allowing the reference image to be discarded during the 3D generation stage. |
Isotropic3D generates high-quality 3D models with rich color and well-proportioned geometry from a single image CLIP embedding.
The method is robust to the object pose of the reference image.
Generated 3D content exhibits a high degree of consistency with the reference image. |
The resolution of the rendered 3D content is limited by the training data resolution.
The model's performance on faces requires further improvement. |
image-to-3d, clip embedding, multi-view attention, score distillation sampling, neural radiance fields |
2403.10336
Report |
How Powerful Potential of Attention on Image Restoration? |
Cong Wang, Jinshan Pan, Yeying Jin, Liyan Wang, Wei Wang, Gang Fu, Wenqi Ren, Xiaochun Cao |
Transformers have demonstrated their effectiveness in image restoration
tasks. Existing Transformer architectures typically comprise two essential
components: multi-head self-attention and feed-forward network (FFN). The
former captures long-range pixel dependencies, while the latter enables the
model to learn complex patterns and relationships in the data. Previous studies
have demonstrated that FFNs are key-value memories \cite{geva2020transformer},
which are vital in modern Transformer architectures. In this paper, we conduct
an empirical study to explore the potential of attention mechanisms without
using FFN and provide novel structures to demonstrate that removing FFN is
flexible for image restoration. Specifically, we propose Continuous Scaling
Attention (\textbf{CSAttn}), a method that computes attention continuously in
three stages without using FFN. To achieve competitive performance, we propose
a series of key components within the attention. Our designs provide a closer
look at the attention mechanism and reveal that some simple operations can
significantly affect the model performance. We apply our \textbf{CSAttn} to
several image restoration tasks and show that our model can outperform
CNN-based and Transformer-based image restoration approaches. |
This paper proposes Continuous Scaling Attention (CSAttn), a novel attention mechanism for image restoration that achieves competitive performance without relying on feed-forward networks (FFN) typically found in Transformer architectures. |
Existing Transformer architectures heavily depend on FFN after the attention computation. This work challenges this norm by exploring the potential of solely using attention mechanisms for image restoration, aiming for a more efficient and potentially more effective solution. |
The CSAttn block employs three consecutive attention computations, enhanced by several key designs: Continuous Attention Learning, Spatial Scaling Learning, Value Nonlinear Transformation Adjustment, Nonlinear Activation Function, Intra Attention Aggregation, Intra Progressive More Heads, and Intra Residual Connections. Each of these components contributes to scaling up the attention capacity for achieving superior performance. |
CSAttn outperforms state-of-the-art approaches on image deraining, achieving an average PSNR improvement of 0.41 dB over the best competitor.
CSAttn demonstrates superior performance on image desnowing, surpassing previous state-of-the-art methods on both CSD and Snow100K benchmarks.
CSAttn achieves significant improvements on low-light image enhancement (LOL dataset) and real image dehazing (Dense-Haze and NH-Haze datasets), outperforming recent state-of-the-art methods. |
The study primarily focuses on exploring the potential of attention without FFN within a specific network architecture (similar to SFNet). Investigating its effectiveness when integrated with other architectures would be beneficial.
Further research on exploring the combination of continuous attention learning with other efficient designs could potentially lead to even better performance. |
image restoration, continuous scaling attention, transformer, attention mechanism, feed-forward network |
2403.10335
Report |
NECA: Neural Customizable Human Avatar |
Junjin Xiao, Qing Zhang, Zhan Xu, Wei-Shi Zheng |
Human avatar has become a novel type of 3D asset with various applications.
Ideally, a human avatar should be fully customizable to accommodate different
settings and environments. In this work, we introduce NECA, an approach capable
of learning versatile human representation from monocular or sparse-view
videos, enabling granular customization across aspects such as pose, shadow,
shape, lighting and texture. The core of our approach is to represent humans in
complementary dual spaces and predict disentangled neural fields of geometry,
albedo, shadow, as well as an external lighting, from which we are able to
derive realistic rendering with high-frequency details via volumetric
rendering. Extensive experiments demonstrate the advantage of our method over
the state-of-the-art methods in photorealistic rendering, as well as various
editing tasks such as novel pose synthesis and relighting. The code is
available at https://github.com/iSEE-Laboratory/NECA. |
NECA, a novel framework for learning fully customizable neural human avatars from monocular or sparse-view videos. |
Human avatars need to be fully editable for diverse applications in the metaverse, telepresence, and 3D games. Previous methods only offer limited editing capabilities. |
Represents humans in dual spaces (canonical and surface) to capture high-frequency details and geometry-aware characteristics. Predicts disentangled neural fields for geometry, albedo, shadow, and lighting for flexible control. Trained in a self-supervised manner with photometric losses and normal regularization. |
Outperforms state-of-the-art methods in novel pose synthesis and relighting on ZJU-MoCap, NeuMan, DeepCap, DynaCap, and a synthetic dataset.
Enables shape, texture, and shadow editing, including reshaping, retexturing, shadow removal, and local shadow transfer.
Achieves high-fidelity rendering and diverse customization by disentangling neural fields and optimizing lighting representation. |
Performance can be sensitive to the accuracy of estimated SMPL parameters.
Shadows under complex novel poses may be erroneous due to the lack of explicit visibility modeling.
Future work includes exploring more robust shape and texture editing, as well as generalizing the method to handle multiple humans. |
human avatar, neural rendering, disentangled representation, customization, relighting |
2403.10242
Report |
FDGaussian: Fast Gaussian Splatting from Single Image via Geometric-aware Diffusion Model |
Qijun Feng, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang |
Reconstructing detailed 3D objects from single-view images remains a
challenging task due to the limited information available. In this paper, we
introduce FDGaussian, a novel two-stage framework for single-image 3D
reconstruction. Recent methods typically utilize pre-trained 2D diffusion
models to generate plausible novel views from the input image, yet they
encounter issues with either multi-view inconsistency or lack of geometric
fidelity. To overcome these challenges, we propose an orthogonal plane
decomposition mechanism to extract 3D geometric features from the 2D input,
enabling the generation of consistent multi-view images. Moreover, we further
accelerate the state-of-the-art Gaussian Splatting incorporating epipolar
attention to fuse images from different viewpoints. We demonstrate that
FDGaussian generates images with high consistency across different views and
reconstructs high-quality 3D objects, both qualitatively and quantitatively.
More examples can be found at our website https://qjfeng.net/FDGaussian/. |
Presents FDGaussian, a novel two-stage framework for single-image 3D reconstruction using a geometric-aware diffusion model and accelerated Gaussian Splatting. |
Addresses the limitations of current single-view 3D reconstruction methods that struggle with multi-view inconsistency or lack of geometric fidelity. |
1. Employs an orthogonal plane decomposition mechanism to extract 3D geometric features from the input image for consistent multi-view image generation using a diffusion model. 2. Introduces epipolar attention to fuse the generated multi-view images during Gaussian Splatting, improving geometric reconstruction. 3. Proposes Gaussian Divergent Significance (GDS) to accelerate optimization by avoiding unnecessary split and clone operations. |
FDGaussian outperforms baseline methods in novel view synthesis and single-image 3D reconstruction on Objaverse and Google Scanned Objects datasets.
The orthogonal plane decomposition mechanism significantly improves multi-view consistency and geometric accuracy.
GDS accelerates the optimization process by up to 15 times without compromising reconstruction quality. |
The number of generated views is fixed, limiting potential efficiency gains for objects with varying topological symmetries.
The current framework is limited to single-object reconstruction and cannot handle complex scenes or multiple objects. |
3d reconstruction, gaussian splatting, diffusion model, multi-view consistency, single image |
2403.10211
Report |
BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution |
Feng Li, Yixuan Wu, Zichao Liang, Runmin Cong, Huihui Bai, Yao Zhao, Meng Wang |
Diffusion models (DM) have achieved remarkable promise in image
super-resolution (SR). However, most of them are tailored to solving non-blind
inverse problems with fixed known degradation settings, limiting their
adaptability to real-world applications that involve complex unknown
degradations. In this work, we propose BlindDiff, a DM-based blind SR method to
tackle the blind degradation settings in SISR. BlindDiff seamlessly integrates
the MAP-based optimization into DMs, which constructs a joint distribution of
the low-resolution (LR) observation, high-resolution (HR) data, and degradation
kernels for the data and kernel priors, and solves the blind SR problem by
unfolding MAP approach along with the reverse process. Unlike most DMs,
BlindDiff firstly presents a modulated conditional transformer (MCFormer) that
is pre-trained with noise and kernel constraints, further serving as a
posterior sampler to provide both priors simultaneously. Then, we plug a simple
yet effective kernel-aware gradient term between adjacent sampling iterations
that guides the diffusion model to learn degradation consistency knowledge.
This also enables to joint refine the degradation model as well as HR images by
observing the previous denoised sample. With the MAP-based reverse diffusion
process, we show that BlindDiff advocates alternate optimization for blur
kernel estimation and HR image restoration in a mutual reinforcing manner.
Experiments on both synthetic and real-world datasets show that BlindDiff
achieves the state-of-the-art performance with significant model complexity
reduction compared to recent DM-based methods. Code will be available at
\url{https://github.com/lifengcs/BlindDiff} |
This paper proposes BlindDiff, a novel diffusion model-based blind image super-resolution method that integrates MAP-based optimization with diffusion models for robust and efficient super-resolution under unknown degradation settings. |
Most existing diffusion model-based super-resolution methods assume known degradation settings, limiting their applicability to real-world scenarios with complex and unknown degradations. BlindDiff addresses this limitation by jointly estimating the blur kernel and the high-resolution image in a mutually reinforcing manner. |
BlindDiff formulates the blind super-resolution problem under a maximum a posteriori (MAP) framework and unfolds it along the reverse diffusion process. It introduces a modulated conditional transformer (MCFormer) as the denoising network, trained with noise and kernel constraints to provide data and kernel priors. A kernel-aware gradient term guides the model to learn degradation consistency knowledge, enabling alternate optimization of blur kernels and HR images during the reverse process. |
BlindDiff achieves state-of-the-art performance on benchmark datasets, significantly outperforming existing DM-based methods in terms of FID and LPIPS.
BlindDiff maintains high performance on both isotropic and anisotropic Gaussian blur kernels, demonstrating its robustness to different degradation types.
BlindDiff demonstrates promising results on real-world images with unknown degradations, indicating its practical applicability. |
The computational cost of BlindDiff, although lower than other DM-based methods, is still higher than some CNN-based methods.
Future work could focus on extending BlindDiff to handle more complex real-world degradation scenarios, such as spatially variant blur. |
blind super-resolution, diffusion models, map optimization, modulated conditional transformer, kernel estimation |
2403.10191
Report |
Generative Region-Language Pretraining for Open-Ended Object Detection |
Chuang Lin, Yi Jiang, Lizhen Qu, Zehuan Yuan, Jianfei Cai |
In recent research, significant attention has been devoted to the
open-vocabulary object detection task, aiming to generalize beyond the limited
number of classes labeled during training and detect objects described by
arbitrary category names at inference. Compared with conventional object
detection, open vocabulary object detection largely extends the object
detection categories. However, it relies on calculating the similarity between
image regions and a set of arbitrary category names with a pretrained
vision-and-language model. This implies that, despite its open-set nature, the
task still needs the predefined object categories during the inference stage.
This raises the question: What if we do not have exact knowledge of object
categories during inference? In this paper, we call such a new setting as
generative open-ended object detection, which is a more general and practical
problem. To address it, we formulate object detection as a generative problem
and propose a simple framework named GenerateU, which can detect dense objects
and generate their names in a free-form way. Particularly, we employ Deformable
DETR as a region proposal generator with a language model translating visual
regions to object names. To assess the free-form object detection task, we
introduce an evaluation method designed to quantitatively measure the
performance of generative outcomes. Extensive experiments demonstrate strong
zero-shot detection performance of our GenerateU. For example, on the LVIS
dataset, our GenerateU achieves comparable results to the open-vocabulary
object detection method GLIP, even though the category names are not seen by
GenerateU during inference. Code is available at: https://
github.com/FoundationVision/GenerateU . |
Introduces "generative open-ended object detection," a new object detection paradigm that eliminates the need for predefined object categories during inference by formulating it as a generative problem. |
Addresses limitations of existing open-vocabulary object detection methods that still require predefined categories during inference, aiming for a more general and practical approach. |
Proposes GenerateU, a novel end-to-end framework comprising an open-world object detector and a language model. Leverages a small set of human-annotated object-language paired data and scales up vocabulary size with massive image-text pairs, using a pseudo-labeling method to enrich label diversity. |
GenerateU achieves comparable results to open-vocabulary object detection methods on zero-shot LVIS, despite not seeing object categories during inference.
End-to-end training of both image encoder and language model is crucial for optimal performance in generative open-ended object detection.
Beam search significantly improves recognition of rare object categories, effectively addressing the long-tail problem. |
Future work includes investigating the impact of training data scale on performance.
Exploring more sophisticated pseudo-labeling methods beyond the naive approach used in the paper is another promising direction. |
open-ended object detection, generative object detection, zero-shot learning, multimodal learning, vision and language |
2403.10179
Report |
Animate Your Motion: Turning Still Images into Dynamic Videos |
Mingxiao Li, Bo Wan, Marie-Francine Moens, Tinne Tuytelaars |
In recent years, diffusion models have made remarkable strides in
text-to-video generation, sparking a quest for enhanced control over video
outputs to more accurately reflect user intentions. Traditional efforts
predominantly focus on employing either semantic cues, like images or depth
maps, or motion-based conditions, like moving sketches or object bounding
boxes. Semantic inputs offer a rich scene context but lack detailed motion
specificity; conversely, motion inputs provide precise trajectory information
but miss the broader semantic narrative. For the first time, we integrate both
semantic and motion cues within a diffusion model for video generation, as
demonstrated in Fig 1. To this end, we introduce the Scene and Motion
Conditional Diffusion (SMCD), a novel methodology for managing multimodal
inputs. It incorporates a recognized motion conditioning module and
investigates various approaches to integrate scene conditions, promoting
synergy between different modalities. For model training, we separate the
conditions for the two modalities, introducing a two-stage training pipeline.
Experimental results demonstrate that our design significantly enhances video
quality, motion precision, and semantic coherence. |
The paper introduces Scene and Motion Conditional Diffusion (SMCD), a novel diffusion-based video generation model that leverages both scene and motion cues (images and bounding box sequences) alongside text prompts. |
Existing text-to-video generation methods often struggle to accurately reflect user intentions, relying solely on either semantic (images, depth maps) or motion-based (sketches, bounding boxes) conditions. SMCD addresses this by integrating both, allowing for more customized and controlled video generation. |
SMCD, built upon a pretrained text-to-video diffusion model, incorporates a motion integration module (MIM) for encoding box locations and a dual image integration module (DIIM) for embedding image conditions. It employs a two-stage training pipeline, focusing first on motion integration and then on image and temporal coherence. |
SMCD significantly outperforms existing methods in terms of video quality (FVD), demonstrating the effectiveness of incorporating both scene and motion conditions.
The model accurately grounds objects to their specified trajectories while preserving the semantic details of the input image.
Ablation studies highlight the importance of both MIM and DIIM, demonstrating that their synergistic integration within SMCD yields optimal results. |
Relying solely on bounding boxes for motion control can be insufficient as similar changes can result from camera movement, necessitating the incorporation of camera constraints in future work.
SMCD currently faces challenges in generating high-quality videos featuring humans, inheriting this limitation from the pretrained ModelScope backbone. |
video generation, controllable generation, diffusion models, multimodal learning, scene and motion conditioning |
2403.10166
Report |
SemanticHuman-HD: High-Resolution Semantic Disentangled 3D Human Generation |
Peng Zheng, Tao Liu, Zili Yi, Rui Ma |
With the development of neural radiance fields and generative models,
numerous methods have been proposed for learning 3D human generation from 2D
images. These methods allow control over the pose of the generated 3D human and
enable rendering from different viewpoints. However, none of these methods
explore semantic disentanglement in human image synthesis, i.e., they can not
disentangle the generation of different semantic parts, such as the body, tops,
and bottoms. Furthermore, existing methods are limited to synthesize images at
$512^2$ resolution due to the high computational cost of neural radiance
fields. To address these limitations, we introduce SemanticHuman-HD, the first
method to achieve semantic disentangled human image synthesis. Notably,
SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis
at $1024^2$ resolution, benefiting from our proposed 3D-aware super-resolution
module. By leveraging the depth maps and semantic masks as guidance for the
3D-aware super-resolution, we significantly reduce the number of sampling
points during volume rendering, thereby reducing the computational cost. Our
comparative experiments demonstrate the superiority of our method. The
effectiveness of each proposed component is also verified through ablation
studies. Moreover, our method opens up exciting possibilities for various
applications, including 3D garment generation, semantic-aware image synthesis,
controllable image synthesis, and out-of-domain image synthesis. |
This paper proposes SemanticHuman-HD, a novel method for high-resolution (1024 x 1024) 3D-aware human image synthesis with semantic disentanglement, allowing independent generation and manipulation of different semantic parts (e.g., body, tops, bottoms). |
Existing methods for 3D human image synthesis lack semantic disentanglement or are limited to lower resolutions, hindering applications like virtual try-on and garment generation. |
The method employs a two-stage training process: (1) synthesizing images, depth maps, semantic masks, and normal maps at 256 x 256 resolution using a semantic disentangled NeRF with local generators; (2) upsampling to 1024 x 1024 resolution using a novel 3D-aware super-resolution module guided by the depth and semantic information. |
SemanticHuman-HD achieves superior image quality (measured by FID and KID) compared to state-of-the-art methods at both 512 x 512 and 1024 x 1024 resolutions.
The method enables various applications, including semantic-aware virtual try-on, 3D garment generation, controllable image synthesis, and out-of-domain image synthesis.
The proposed 3D-aware super-resolution module significantly reduces computational cost by reducing the number of sampling points during volume rendering. |
The quality of synthesized results is limited by the diversity of poses and viewpoints in the training dataset.
Achieving realistic hand deformations remains a challenge. |
generative models, 3d human image synthesis, semantic disentanglement, neural radiance fields (nerf), super-resolution |
2403.10147
Report |
GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time |
Hao Li, Yuanyuan Gao, Chenming Wu, Dingwen Zhang, Yalun Dai, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han |
This paper presents GGRt, a novel approach to generalizable novel view
synthesis that alleviates the need for real camera poses, complexity in
processing high-resolution images, and lengthy optimization processes, thus
facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in
real-world scenarios. Specifically, we design a novel joint learning framework
that consists of an Iterative Pose Optimization Network (IPO-Net) and a
Generalizable 3D-Gaussians (G-3DG) model. With the joint learning mechanism,
the proposed framework can inherently estimate robust relative pose information
from the image observations and thus primarily alleviate the requirement of
real camera poses. Moreover, we implement a deferred back-propagation mechanism
that enables high-resolution training and inference, overcoming the resolution
constraints of previous methods. To enhance the speed and efficiency, we
further introduce a progressive Gaussian cache module that dynamically adjusts
during training and inference. As the first pose-free generalizable 3D-GS
framework, GGRt achieves inference at $\ge$ 5 FPS and real-time rendering at
$\ge$ 100 FPS. Through extensive experimentation, we demonstrate that our
method outperforms existing NeRF-based pose-free techniques in terms of
inference speed and effectiveness. It can also approach the real pose-based
3D-GS methods. Our contributions provide a significant leap forward for the
integration of computer vision and computer graphics into practical
applications, offering state-of-the-art results on LLFF, KITTI, and Waymo Open
datasets and enabling real-time rendering for immersive experiences. |
GGRt is the first pose-free generalizable 3D Gaussian splatting framework for novel view synthesis, achieving real-time rendering at over 100 FPS and inference speeds exceeding 5 FPS. |
Existing generalizable novel view synthesis methods suffer from limitations such as requiring real camera poses, struggling with high-resolution images, and lacking real-time rendering capabilities. This limits their applicability in real-world scenarios. |
GGRt consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model trained jointly. It utilizes a deferred back-propagation mechanism for high-resolution processing and a Gaussians cache module for efficiency. |
Outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness.
Achieves competitive performance compared to pose-based 3D-GS methods, even without camera pose prior.
Enables real-time rendering at over 100 FPS and inference at over 5 FPS, outperforming previous state-of-the-art. |
Relies on the assumption of static scenes, limiting its application in dynamic environments.
Future work includes exploring the integration of temporal information to handle dynamic objects. |
novel view synthesis, 3d gaussian splatting, pose-free, generalizable, real-time rendering |
2403.10133
Report |
E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance |
Tianrui Huang, Pu Cao, Lu Yang, Chun Liu, Mengjie Hu, Zhiwei Liu, Qing Song |
Diffusion-based image editing is a composite process of preserving the source
image content and generating new content or applying modifications. While
current editing approaches have made improvements under text guidance, most of
them have only focused on preserving the information of the input image,
disregarding the importance of editability and alignment to the target prompt.
In this paper, we prioritize the editability by proposing a zero-shot image
editing method, named \textbf{E}nhance \textbf{E}ditability for text-based
image \textbf{E}diting via \textbf{E}fficient \textbf{C}LIP guidance
(\textbf{E4C}), which only requires inference-stage optimization to explicitly
enhance the edibility and text alignment. Specifically, we develop a unified
dual-branch feature-sharing pipeline that enables the preservation of the
structure or texture of the source image while allowing the other to be adapted
based on the editing task. We further integrate CLIP guidance into our pipeline
by utilizing our novel random-gateway optimization mechanism to efficiently
enhance the semantic alignment with the target prompt. Comprehensive
quantitative and qualitative experiments demonstrate that our method
effectively resolves the text alignment issues prevalent in existing methods
while maintaining the fidelity to the source image, and performs well across a
wide range of editing tasks. |
Introduces E4C, a zero-shot text-guided image editing method enhancing editability and text alignment via efficient CLIP guidance, addressing limitations in handling diverse editing tasks and text alignment in existing methods. |
Existing methods struggle to handle both structure-consistent and non-rigid editing tasks and often prioritize preserving source information over aligning new content with the target prompt. |
Employs a dual-branch feature-sharing pipeline for adaptive preservation of source image information, combined with a random-gateway optimization mechanism for efficient CLIP guidance to enhance text alignment. |
Achieves superior visual quality across various editing tasks compared to existing methods.
Demonstrates higher CLIP score, indicating better text alignment, while maintaining comparable image fidelity.
Exhibits effectiveness in handling hard samples, like multi-object scenarios and complex shape/pose changes. |
Exhibits limitations in the human face domain, especially with high-resolution images.
Ambiguous language descriptions can lead to unreasonable visual representations. |
diffusion model, text-based image editing, clip guidance, image manipulation, zero-shot learning |
2403.10098
Report |
DiffMAC: Diffusion Manifold Hallucination Correction for High Generalization Blind Face Restoration |
Nan Gao, Jia Li, Huaibo Huang, Zhi Zeng, Ke Shang, Shuwu Zhang, Ran He |
Blind face restoration (BFR) is a highly challenging problem due to the
uncertainty of degradation patterns. Current methods have low generalization
across photorealistic and heterogeneous domains. In this paper, we propose a
Diffusion-Information-Diffusion (DID) framework to tackle diffusion manifold
hallucination correction (DiffMAC), which achieves high-generalization face
restoration in diverse degraded scenes and heterogeneous domains. Specifically,
the first diffusion stage aligns the restored face with spatial feature
embedding of the low-quality face based on AdaIN, which synthesizes
degradation-removal results but with uncontrollable artifacts for some hard
cases. Based on Stage I, Stage II considers information compression using
manifold information bottleneck (MIB) and finetunes the first diffusion model
to improve facial fidelity. DiffMAC effectively fights against blind
degradation patterns and synthesizes high-quality faces with attribute and
identity consistencies. Experimental results demonstrate the superiority of
DiffMAC over state-of-the-art methods, with a high degree of generalization in
real-world and heterogeneous settings. The source code and models will be
public. |
Proposes DiffMAC, a Diffusion-Information-Diffusion (DID) framework for high-generalization blind face restoration (BFR) across photorealistic and heterogeneous domains. |
Current BFR methods struggle with generalization across diverse degraded scenes and heterogeneous domains, especially for severely degraded images. |
DID uses two stages: 1) Aligns restored face with LQ face features using AdaIN-based diffusion. 2) Applies Manifold Information Bottleneck (MIB) for information compression and finetunes the diffusion model for fidelity improvement with identity preservation. |
Achieves high-fidelity BFR in photorealistic and heterogeneous domains, outperforming state-of-the-art methods.
Effectively tackles diffusion manifold hallucination correction by disentangling restoration-relevant and irrelevant information.
Demonstrates the effectiveness of MIB with identity information injection for controllable and high-quality BFR. |
Challenges remain in handling BFR for unseen scenarios with severely degraded facial contours.
Inference time is longer than some methods due to the two-stage design with MIB; exploring efficient distillation of DDIM sampling is planned. |
blind face restoration, diffusion models, information bottleneck, generative adversarial networks, image restoration |
2403.10071
Report |
Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling |
Baoquan Zhang, Huaibin Wang, Luo Chuyao, Xutao Li, Liang Guotao, Yunming Ye, Xiaochen Qi, Yao He |
Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in
image synthesis, which aims to represent an image with a discrete token
sequence. Existing studies effectively address this problem by learning a
discrete codebook from scratch and in a code-independent manner to quantize
continuous representations into discrete tokens. However, learning a codebook
from scratch and in a code-independent manner is highly challenging, which may
be a key reason causing codebook collapse, i.e., some code vectors can rarely
be optimized without regard to the relationship between codes and good codebook
priors such that die off finally. In this paper, inspired by pretrained
language models, we find that these language models have actually pretrained a
superior codebook via a large number of text corpus, but such information is
rarely exploited in VQIM. To this end, we propose a novel codebook transfer
framework with part-of-speech, called VQCT, which aims to transfer a
well-trained codebook from pretrained language models to VQIM for robust
codebook learning. Specifically, we first introduce a pretrained codebook from
language models and part-of-speech knowledge as priors. Then, we construct a
vision-related codebook with these priors for achieving codebook transfer.
Finally, a novel codebook transfer network is designed to exploit abundant
semantic relationships between codes contained in pretrained codebooks for
robust VQIM codebook learning. Experimental results on four datasets show that
our VQCT method achieves superior VQIM performance over previous
state-of-the-art methods. |
Proposes VQCT, a novel codebook transfer framework using part-of-speech, to improve Vector-Quantized Image Modeling (VQIM) by transferring pretrained codebooks from language models (e.g., CLIP) to enhance VQIM codebook learning and alleviate codebook collapse. |
VQIM suffers from codebook collapse where many code vectors remain unoptimized. Existing methods learn codebooks from scratch, neglecting potentially beneficial relationships between codes. This paper argues that leveraging pretrained language model codebooks can provide rich semantic information and relationships for more robust VQIM. |
1. Construct vision-related codebooks (adjective and noun) from pretrained language models using part-of-speech filtering. 2. Design a graph convolution-based codebook transfer network to transfer knowledge from these codebooks to VQIM. 3. Use the transferred codebooks for quantizing continuous image representations. |
VQCT outperforms state-of-the-art VQIM methods in image reconstruction tasks on four datasets.
VQCT demonstrates higher codebook utilization compared to baselines, indicating alleviation of codebook collapse.
VQCT shows promising results on downstream semantic image synthesis tasks. |
VQCT's performance improvement depends on the quality and relevance of the chosen pretrained language model.
Further exploration of better strategies for transferring codebook knowledge from language to vision domain is needed. |
vqim, codebook transfer, pretrained language models, image synthesis, codebook collapse |
2403.10050
Report |
Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing |
Tian-Xing Xu, Wenbo Hu, Yu-Kun Lai, Ying Shan, Song-Hai Zhang |
3D Gaussian splatting, emerging as a groundbreaking approach, has drawn
increasing attention for its capabilities of high-fidelity reconstruction and
real-time rendering. However, it couples the appearance and geometry of the
scene within the Gaussian attributes, which hinders the flexibility of editing
operations, such as texture swapping. To address this issue, we propose a novel
approach, namely Texture-GS, to disentangle the appearance from the geometry by
representing it as a 2D texture mapped onto the 3D surface, thereby
facilitating appearance editing. Technically, the disentanglement is achieved
by our proposed texture mapping module, which consists of a UV mapping MLP to
learn the UV coordinates for the 3D Gaussian centers, a local Taylor expansion
of the MLP to efficiently approximate the UV coordinates for the ray-Gaussian
intersections, and a learnable texture to capture the fine-grained appearance.
Extensive experiments on the DTU dataset demonstrate that our method not only
facilitates high-fidelity appearance editing but also achieves real-time
rendering on consumer-level devices, e.g. a single RTX 2080 Ti GPU. |
Texture-GS disentangles geometry and texture for 3D Gaussian Splatting, enabling real-time appearance editing like texture swapping. |
3D Gaussian Splatting, despite its efficiency, entangles appearance and geometry, hindering flexible editing. Texture-GS overcomes this limitation. |
It uses a UV mapping MLP with Taylor expansion for efficient ray-Gaussian intersection to UV mapping, representing appearance in a 2D texture. |
Reconstructs smooth, high-quality 2D texture maps from multi-view images.
Enables global texture swapping and fine-grained texture editing.
Achieves real-time rendering speed (58 FPS on RTX 2080 Ti) for interactive editing. |
Blurring at edges due to inaccurate Gaussian orientations impacting UV mapping.
Single UV space limits representation for scenes with multiple objects or complex geometries. |
3d gaussian splatting, texture mapping, neural rendering, appearance editing, real-time rendering |
2403.10004
Report |
ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images |
Xiangtian Xue, Jiasong Wu, Youyong Kong, Lotfi Senhadji, Huazhong Shu |
We present a novel image editing scenario termed Text-grounded Object
Generation (TOG), defined as generating a new object in the real image
spatially conditioned by textual descriptions. Existing diffusion models
exhibit limitations of spatial perception in complex real-world scenes, relying
on additional modalities to enforce constraints, and TOG imposes heightened
challenges on scene comprehension under the weak supervision of linguistic
information. We propose a universal framework ST-LDM based on Swin-Transformer,
which can be integrated into any latent diffusion model with training-free
backward guidance. ST-LDM encompasses a global-perceptual autoencoder with
adaptable compression scales and hierarchical visual features, parallel with
deformable multimodal transformer to generate region-wise guidance for the
subsequent denoising process. We transcend the limitation of traditional
attention mechanisms that only focus on existing visual features by introducing
deformable feature alignment to hierarchically refine spatial positioning fused
with multi-scale visual and linguistic information. Extensive Experiments
demonstrate that our model enhances the localization of attention mechanisms
while preserving the generative capabilities inherent to diffusion models. |
This paper introduces Text-grounded Object Generation (TOG), a novel image editing task focused on generating new objects in real images based on textual descriptions of visual and spatial attributes, and proposes ST-LDM, a universal framework to address this task. |
Existing diffusion models struggle with spatial understanding in complex scenes and rely on additional modalities for spatial control. TOG addresses this by leveraging the flexibility and naturalness of language for object placement in images. |
ST-LDM uses a Swin-Transformer-based autoencoder for adaptable latent representation and a parallel multimodal transformer to generate spatial guidance. It introduces deformable feature alignment to refine object placement using multi-scale visual and linguistic features and integrates with LDMs via training-free backward guidance. |
ST-LDM demonstrates superior performance compared to existing text-guided editing models, particularly in complex scenes.
Deformable feature alignment is shown to significantly improve object localization accuracy while preserving the generative capabilities of diffusion models.
Quantitative and qualitative evaluations on a newly constructed benchmark dataset showcase the effectiveness and robustness of the proposed approach. |
Current implementation requires separate input of appearance and spatial descriptions, which limits its practical application in real-world scenarios where integrated statements are common.
The editing process can sometimes lead to slight changes in irrelevant regions near the generated object, highlighting the need for further exploration of methods to maintain pixel-level fidelity. |
image editing, text-guided generation, deformable feature alignment, latent diffusion models, swin-transformer |
2403.09981
Report |
Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting |
Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu |
While text-to-3D and image-to-3D generation tasks have received considerable
attention, one important but under-explored field between them is controllable
text-to-3D generation, which we mainly focus on in this work. To address this
task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network
architecture designed to enhance existing pre-trained multi-view diffusion
models by integrating additional input conditions, such as edge, depth, normal,
and scribble maps. Our innovation lies in the introduction of a conditioning
module that controls the base diffusion model using both local and global
embeddings, which are computed from the input condition images and camera
poses. Once trained, MVControl is able to offer 3D diffusion guidance for
optimization-based 3D generation. And, 2) we propose an efficient multi-stage
3D generation pipeline that leverages the benefits of recent large
reconstruction models and score distillation algorithm. Building upon our
MVControl architecture, we employ a unique hybrid diffusion guidance method to
direct the optimization process. In pursuit of efficiency, we adopt 3D
Gaussians as our representation instead of the commonly used implicit
representations. We also pioneer the use of SuGaR, a hybrid representation that
binds Gaussians to mesh triangle faces. This approach alleviates the issue of
poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained
geometry on the mesh. Extensive experiments demonstrate that our method
achieves robust generalization and enables the controllable generation of
high-quality 3D content. |
This paper introduces MVControl, a novel neural network architecture for controllable text-to-3D generation, and proposes an efficient multi-stage pipeline for generating high-quality textured 3D meshes from 3D Gaussians. |
Controllable text-to-3D generation is an important but under-explored area, and existing methods are either time-consuming or struggle to produce high-quality results. This work addresses these limitations. |
MVControl, a multi-view variant of ControlNet, is trained on a large 3D dataset to enable controllable text-to-multi-view image generation. These images are then used to initialize a set of coarse 3D Gaussians, which are further optimized using a hybrid diffusion guidance approach and SuGaR regularization. Finally, a textured mesh is extracted and refined. |
MVControl effectively controls multi-view image generation, enabling fine-grained control over content and achieving view consistency.
The proposed 3D generation pipeline outperforms existing Gaussian-based mesh generation approaches, producing high-fidelity and detailed textured meshes.
The hybrid diffusion guidance approach combining MVControl and a 2D diffusion model effectively optimizes the geometry and texture of the generated 3D assets. |
The current implementation requires a separate 2D diffusion model for texture refinement.
Further exploration of different 3D Gaussian initialization strategies could improve efficiency. |
controllable 3d generation, gaussian splatting, sugar, multi-view diffusion models, score distillation sampling |
2403.09977
Report |
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba |
Xiaohuan Pei, Tao Huang, Chang Xu |
Prior efforts in light-weight model development mainly centered on CNN and
Transformer-based designs yet faced persistent challenges. CNNs adept at local
feature extraction compromise resolution while Transformers offer global reach
but escalate computational demands $\mathcal{O}(N^2)$. This ongoing trade-off
between accuracy and efficiency remains a significant hurdle. Recently, state
space models (SSMs), such as Mamba, have shown outstanding performance and
competitiveness in various tasks such as language modeling and computer vision,
while reducing the time complexity of global information extraction to
$\mathcal{O}(N)$. Inspired by this, this work proposes to explore the potential
of visual state space models in light-weight model design and introduce a novel
efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba
integrates a atrous-based selective scan approach by efficient skip sampling,
constituting building blocks designed to harness both global and local
representational features. Additionally, we investigate the integration between
SSM blocks and convolutions, and introduce an efficient visual state space
block combined with an additional convolution branch, which further elevate the
model performance. Experimental results show that, EfficientVMamba scales down
the computational complexity while yields competitive results across a variety
of vision tasks. For example, our EfficientVMamba-S with $1.3$G FLOPs improves
Vim-Ti with $1.5$G FLOPs by a large margin of $5.6\%$ accuracy on ImageNet.
Code is available at: \url{https://github.com/TerryPei/EfficientVMamba}. |
This paper introduces EfficientVMamba, a lightweight state-space model for vision tasks that efficiently balances global and local feature extraction by combining an atrous-based selective scan mechanism with convolutional branches. |
Existing lightweight models, based on CNNs or Transformers, struggle to achieve both global representation and computational efficiency. EfficientVMamba addresses this by utilizing the linear complexity of state space models for global context while integrating convolutions for local features. |
The authors propose Efficient 2D Scanning (ES2D) using skip sampling for efficient global representation. They introduce an Efficient Visual State Space (EVSS) block merging ES2D with a convolutional branch enhanced by Squeeze-and-Excitation. An 'inverted' insertion strategy prioritizes EVSS in early stages and convolutions in later stages. |
EfficientVMamba achieves state-of-the-art accuracy with reduced FLOPs on ImageNet classification compared to CNN-based and Transformer-based counterparts.
It shows superior performance on COCO object detection using RetinaNet, exceeding models with larger parameter counts.
EfficientVMamba demonstrates competitive results on ADE20K semantic segmentation with UperNet, highlighting its efficient and accurate segmentation capability. |
The computational design of SSMs is more complex than convolutions or self-attention, posing challenges for parallel processing.
Future work can explore further optimization of computational efficiency and scalability for visual state space models. |
light-weight architecture, efficient network, state space model, atrous selective scan, vision transformer |
2403.09939
Report |
Quantization Effects on Neural Networks Perception: How would quantization change the perceptual field of vision models? |
Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani, Alessandro Bruno |
Neural network quantization is an essential technique for deploying models on
resource-constrained devices. However, its impact on model perceptual fields,
particularly regarding class activation maps (CAMs), remains a significant area
of investigation. In this study, we explore how quantization alters the spatial
recognition ability of the perceptual field of vision models, shedding light on
the alignment between CAMs and visual saliency maps across various
architectures. Leveraging a dataset of 10,000 images from ImageNet, we
rigorously evaluate six diverse foundational CNNs: VGG16, ResNet50,
EfficientNet, MobileNet, SqueezeNet, and DenseNet. We uncover nuanced changes
in CAMs and their alignment with human visual saliency maps through systematic
quantization techniques applied to these models. Our findings reveal the
varying sensitivities of different architectures to quantization and underscore
its implications for real-world applications in terms of model performance and
interpretability. The primary contribution of this work revolves around
deepening our understanding of neural network quantization, providing insights
crucial for deploying efficient and interpretable models in practical settings. |
This paper investigates the impact of quantization on the perceptual fields of neural network vision models by analyzing how quantization affects Class Activation Maps (CAMs) and their alignment with human visual saliency maps. |
Quantization is essential for deploying models on resource-constrained devices, but its impact on model interpretability, particularly regarding CAMs, needs to be understood. |
The study uses a dataset of 10,000 ImageNet images and six foundational CNN architectures (VGG16, ResNet50, EfficientNet, MobileNet, SqueezeNet, DenseNet). The authors apply quantization techniques, generate CAMs and visual saliency maps, and compare them using metrics like Similarity, Kullback-Leibler Divergence, and Pearson Correlation. |
Quantization with int16 precision often yields a better balance between model efficiency and alignment with human perception compared to f32 and int8.
MobileNet and SqueezeNet demonstrate high robustness to quantization, maintaining consistent CAM alignment with visual saliency.
EfficientNet shows higher sensitivity to quantization, exhibiting more significant changes in CAMs and reduced alignment with human perception. |
The study primarily focuses on image classification tasks and a limited set of architectures.
Future work can explore the impact of quantization on other vision tasks and more complex architectures. |
neural network quantization, class activation maps, model interpretability, visual saliency, computer vision |
2403.09746
Report |
PICNIQ: Pairwise Comparisons for Natural Image Quality Assessment |
Nicolas Chahine, Sira Ferradans, Jean Ponce |
Blind image quality assessment (BIQA) approaches, while promising for
automating image quality evaluation, often fall short in real-world scenarios
due to their reliance on a generic quality standard applied uniformly across
diverse images. This one-size-fits-all approach overlooks the crucial
perceptual relationship between image content and quality, leading to a 'domain
shift' challenge where a single quality metric inadequately represents various
content types. Furthermore, BIQA techniques typically overlook the inherent
differences in the human visual system among different observers. In response
to these challenges, this paper introduces PICNIQ, an innovative pairwise
comparison framework designed to bypass the limitations of conventional BIQA by
emphasizing relative, rather than absolute, quality assessment. PICNIQ is
specifically designed to assess the quality differences between image pairs.
The proposed framework implements a carefully crafted deep learning
architecture, a specialized loss function, and a training strategy optimized
for sparse comparison settings. By employing psychometric scaling algorithms
like TrueSkill, PICNIQ transforms pairwise comparisons into
just-objectionable-difference (JOD) quality scores, offering a granular and
interpretable measure of image quality. We conduct our research using
comparison matrices from the PIQ23 dataset, which are published in this paper.
Our extensive experimental analysis showcases PICNIQ's broad applicability and
superior performance over existing models, highlighting its potential to set
new standards in the field of BIQA. |
This appendix presents supplementary information to the main paper "PICNIQ: Pairwise Comparisons for Natural Image Quality Assessment," showing examples of PICNIQ's preference predictions for image quality, with a focus on comparisons to the PIQ23 dataset. |
The appendix provides visual evidence and analysis to support the claims made in the main paper about PICNIQ's performance in predicting human image quality preferences. |
The appendix uses visual examples of image pairs, comparison matrices for different scenes and attributes, and probability distribution plots for PIQ23 dataset to illustrate PICNIQ's prediction capabilities. |
PICNIQ demonstrates more logical and precise image quality comparisons than previous methods, even for challenging cases.
The comparison matrices highlight PICNIQ's ability to differentiate between different scenes and attributes.
The PIQ23 dataset shows imbalances in its distribution, with a bias towards forced-choice pairs (0s and 1s). |
The appendix relies heavily on visual examples, which may be subjective and open to interpretation.
Further investigation is needed to address the distribution imbalances in the PIQ23 dataset. |
image quality assessment, pairwise comparisons, picniq, piq23, preference prediction |
2403.09669
Report |
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models |
Pum Jun Kim, Seojun Kim, Jaejun Yoo |
Image generative models have made significant progress in generating
realistic and diverse images, supported by comprehensive guidance from various
evaluation metrics. However, current video generative models struggle to
generate even short video clips, with limited tools that provide insights for
improvements. Current video evaluation metrics are simple adaptations of image
metrics by switching the embeddings with video embedding networks, which may
underestimate the unique characteristics of video. Our analysis reveals that
the widely used Frechet Video Distance (FVD) has a stronger emphasis on the
spatial aspect than the temporal naturalness of video and is inherently
constrained by the input size of the embedding networks used, limiting it to 16
frames. Additionally, it demonstrates considerable instability and diverges
from human evaluations. To address the limitations, we propose STREAM, a new
video evaluation metric uniquely designed to independently evaluate spatial and
temporal aspects. This feature allows comprehensive analysis and evaluation of
video generative models from various perspectives, unconstrained by video
length. We provide analytical and experimental evidence demonstrating that
STREAM provides an effective evaluation tool for both visual and temporal
quality of videos, offering insights into area of improvement for video
generative models. To the best of our knowledge, STREAM is the first evaluation
metric that can separately assess the temporal and spatial aspects of videos.
Our code is available at https://github.com/pro2nit/STREAM. |
This paper proposes STREAM, a novel video evaluation metric designed to independently assess the spatial and temporal aspects of videos generated by generative models. |
Existing video evaluation metrics, often adapted from image metrics, fail to adequately capture the unique characteristics of video data, particularly temporal consistency. This limits the development and analysis of increasingly sophisticated video generative models. |
STREAM leverages an image embedding network to encode individual video frames, enabling separate analysis of spatial and temporal aspects. STREAM-T evaluates temporal flow by analyzing the skewness of the power law distribution of frequency amplitudes over time. STREAM-S evaluates spatial quality through STREAM-F (fidelity) and STREAM-D (diversity) by adapting precision and recall calculations to video data. |
STREAM effectively captures visual and temporal degradation in both synthetic and real-world video data, showing consistent and interpretable results across various experiments.
Analysis of popular video generative models using STREAM reveals challenges in generating realistic and diverse videos, especially as video length increases.
Unlike FVD, which is limited by its embedding network, STREAM can evaluate videos of arbitrary length, supporting the development of long-form video generation models. |
While STREAM-T effectively evaluates temporal flow, it remains agnostic to the direction of time, potentially leading to limitations in specific scenarios.
Future work could explore incorporating a human judgment study to further validate and calibrate STREAM, particularly in terms of its ability to quantify video diversity. |
video generation, evaluation metric, generative models, computer vision, deep learning |
2403.09638
Report |
SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior |
Huan-ang Gao, Mingju Gao, Jiaju Li, Wenyi Li, Rong Zhi, Hao Tang, Hao Zhao |
Semantic image synthesis (SIS) shows good promises for sensor simulation.
However, current best practices in this field, based on GANs, have not yet
reached the desired level of quality. As latent diffusion models make
significant strides in image generation, we are prompted to evaluate
ControlNet, a notable method for its dense control capabilities. Our
investigation uncovered two primary issues with its results: the presence of
weird sub-structures within large semantic areas and the misalignment of
content with the semantic mask. Through empirical study, we pinpointed the
cause of these problems as a mismatch between the noised training data
distribution and the standard normal prior applied at the inference stage. To
address this challenge, we developed specific noise priors for SIS,
encompassing spatial, categorical, and a novel spatial-categorical joint prior
for inference. This approach, which we have named SCP-Diff, has yielded
exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on
ADE20K.The code and models can be accessed via the project page. |
This paper introduces SCP-Diff, a novel approach for photo-realistic semantic image synthesis using spatial-categorical joint priors with diffusion models. |
Current GAN-based semantic image synthesis methods struggle to achieve photorealism, and diffusion-based methods like ControlNet face challenges with sub-par image quality and misalignment with semantic masks. |
The authors propose pre-computed noise priors (spatial, categorical, and joint) derived from real image latents to guide the inference process of a finetuned ControlNet model, tackling the distribution mismatch between training and inference. |
SCP-Diff achieves state-of-the-art FID scores on Cityscapes (10.53) and ADE20K (12.66) datasets, significantly improving upon previous methods.
The joint prior effectively combines the strengths of spatial and categorical priors, resulting in better scene layout and adherence to semantic masks.
Quantitative analysis demonstrates that while improving quality, the introduction of priors has a minimal impact on the diversity of generated images. |
The performance on the COCO-Stuff dataset is on par with leading methods but not significantly better, potentially due to the dataset's diverse spatial resolutions.
Future research could explore incorporating correlations between spatial tokens and classes in the joint prior for potential further improvements. |
semantic image synthesis, diffusion models, noise priors, controlnet, photo-realistic image generation |
2403.09632
Report |
Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image |
Yiqun Mei, Yu Zeng, He Zhang, Zhixin Shu, Xuaner Zhang, Sai Bi, Jianming Zhang, HyunJoon Jung, Vishal M. Patel |
At the core of portrait photography is the search for ideal lighting and
viewpoint. The process often requires advanced knowledge in photography and an
elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric
relighting method that is capable of synthesizing novel viewpoints, and novel
lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN
(EG3D) to reconstruct geometry and appearance from an input portrait as a set
of 3D-aware features. We design a relighting module conditioned on a given
lighting to process these features, and predict a relit 3D representation in
the form of a tri-plane, which can render to an arbitrary viewpoint through
volume rendering. Besides viewpoint and lighting control, Holo-Relighting also
takes the head pose as a condition to enable head-pose-dependent lighting
effects. With these novel designs, Holo-Relighting can generate complex
non-Lambertian lighting effects (e.g., specular highlights and cast shadows)
without using any explicit physical lighting priors. We train Holo-Relighting
with data captured with a light stage, and propose two data-rendering
techniques to improve the data quality for training the volumetric relighting
system. Through quantitative and qualitative experiments, we demonstrate
Holo-Relighting can achieve state-of-the-arts relighting quality with better
photorealism, 3D consistency and controllability. |
This paper presents Holo-Relighting, a novel volumetric relighting method for headshot portraits that allows for controlling lighting, viewpoint, and head pose from a single image. |
Existing portrait relighting methods often lack view consistency or rely on simplified lighting models, limiting their expressiveness and realism. Holo-Relighting addresses these limitations, enabling more realistic and controllable portrait editing. |
The method leverages a pre-trained 3D GAN (EG3D) to extract 3D information from the input. It then employs a relighting network conditioned on the target lighting, head pose, and camera pose to generate a 3D representation (tri-plane features) with embedded illumination. Finally, it synthesizes novel view images through volume rendering. |
Holo-Relighting achieves state-of-the-art results on both free-view and 2D portrait relighting tasks, outperforming existing methods in perceptual quality, fidelity, and identity preservation.
The method demonstrates strong controllability, allowing for realistic manipulation of lighting direction and intensity, head pose, and viewpoint, as well as achieving effects like shadow diffusion.
The authors introduce novel data rendering techniques, including multi-view GAN inversion and portrait shading transfer, which improve the accuracy of 3D geometry encoding and contribute to the high quality of the relighting results. |
The current method is trained on headshot portraits and may not generalize well to full-body images.
Future work can explore incorporating dynamic details like hair movement and facial expressions to enhance realism further. |
volumetric relighting, portrait editing, 3d gan, gan inversion, view synthesis |
2403.09626
Report |
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding |
Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, Limin Wang |
Understanding videos is one of the fundamental directions in computer vision
research, with extensive efforts dedicated to exploring various architectures
such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state
space model, e.g., Mamba, shows promising traits to extend its success in long
sequence modeling to video modeling. To assess whether Mamba can be a viable
alternative to Transformers in the video understanding domain, in this work, we
conduct a comprehensive set of studies, probing different roles Mamba can play
in modeling videos, while investigating diverse tasks where Mamba could exhibit
superiority. We categorize Mamba into four roles for modeling videos, deriving
a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12
video understanding tasks. Our extensive experiments reveal the strong
potential of Mamba on both video-only and video-language tasks while showing
promising efficiency-performance trade-offs. We hope this work could provide
valuable data points and insights for future research on video understanding.
Code is public: https://github.com/OpenGVLab/video-mamba-suite. |
This paper presents a comprehensive study exploring the potential of State Space Models (SSMs), particularly Mamba, as a viable alternative to Transformers for video understanding tasks. |
SSMs, especially Mamba, offer linear scaling with sequence length, making them potentially more efficient for video modeling compared to Transformers. However, their effectiveness in video understanding remains largely unexplored. |
The authors introduce "Video Mamba Suite," comprising 14 SSM models/modules, and evaluate their performance on 12 video understanding tasks across 13 datasets. They explore four distinct roles of Mamba in video modeling: temporal models, temporal modules, multi-modal interaction models, and space-time sequence models. |
Mamba-based models demonstrate competitive or superior performance compared to Transformer counterparts across various video understanding tasks, including temporal action localization, temporal action segmentation, dense video captioning, action anticipation, and video temporal grounding.
Mamba exhibits strong capabilities in modeling long video sequences, evidenced by its superior performance in long-form video question answering.
Mamba models offer computational efficiency advantages over Transformers, particularly when processing videos with a large number of frames. |
The study primarily focuses on replacing Transformer blocks with Mamba blocks, leaving the exploration of SSM-based module designs for video understanding as future work.
Further investigation is needed to optimize the integration of SSMs, especially in multi-modal settings, where hyperparameter tuning can impact performance. |
video understanding, state space model, mamba, video modeling, temporal action localization |
2403.09625
Report |
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation |
Fangfu Liu, Hanyang Wang, Weiliang Chen, Haowen Sun, Yueqi Duan |
Recent years have witnessed the strong power of 3D generation models, which
offer a new level of creative flexibility by allowing users to guide the 3D
content generation process through a single image or natural language. However,
it remains challenging for existing 3D generation methods to create
subject-driven 3D content across diverse prompts. In this paper, we introduce a
novel 3D customization method, dubbed Make-Your-3D that can personalize
high-fidelity and consistent 3D content from only a single image of a subject
with text description within 5 minutes. Our key insight is to harmonize the
distributions of a multi-view diffusion model and an identity-specific 2D
generative model, aligning them with the distribution of the desired 3D
subject. Specifically, we design a co-evolution framework to reduce the
variance of distributions, where each model undergoes a process of learning
from the other through identity-aware optimization and subject-prior
optimization, respectively. Extensive experiments demonstrate that our method
can produce high-quality, consistent, and subject-specific 3D content with
text-driven modifications that are unseen in subject image. |
Presents Make-Your-3D, a novel co-evolution framework for fast and consistent subject-driven 3D content generation from a single image. |
Addresses the limitations of existing 3D generation methods in creating subject-specific content with text-driven modifications, enabling diverse and personalized 3D asset creation. |
Harmonizes the distributions of a 2D personalized model and a multi-view diffusion model with the target subject's distribution through identity-aware and subject-prior optimization. |
Generates high-fidelity 3D content with strong subject identity preservation and text-driven modifications.
Achieves significantly faster generation speed (5 minutes) compared to previous methods (3 hours).
Demonstrates robustness in open-vocabulary settings and surpasses baselines in qualitative and quantitative evaluations, including user studies. |
Current quality is limited by the backbone model (Stable Diffusion v1.5), which can be improved by using larger diffusion models like SDXL.
Future work will explore 3D scene-level personalization. |
3d generation, personalization, co-evolution, diffusion models, one-shot learning |
2403.09623
Report |
Score-Guided Diffusion for 3D Human Recovery |
Anastasis Stathopoulos, Ligong Han, Dimitris Metaxas |
We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for
solving inverse problems for 3D human pose and shape reconstruction. These
inverse problems involve fitting a human body model to image observations,
traditionally solved through optimization techniques. ScoreHMR mimics model
fitting approaches, but alignment with the image observation is achieved
through score guidance in the latent space of a diffusion model. The diffusion
model is trained to capture the conditional distribution of the human model
parameters given an input image. By guiding its denoising process with a
task-specific score, ScoreHMR effectively solves inverse problems for various
applications without the need for retraining the task-agnostic diffusion model.
We evaluate our approach on three settings/applications. These are: (i)
single-frame model fitting; (ii) reconstruction from multiple uncalibrated
views; (iii) reconstructing humans in video sequences. ScoreHMR consistently
outperforms all optimization baselines on popular benchmarks across all
settings. We make our code and models available at the
https://statho.github.io/ScoreHMR. |
This paper presents ScoreHMR, a method that uses diffusion models and score guidance to refine 3D human pose estimations from images and videos. |
Current methods for 3D human pose estimation, based on either regression or optimization, struggle to achieve both accuracy and image-model alignment. This work leverages the power of diffusion models to learn priors over human poses and use score guidance for more accurate and robust refinement. |
ScoreHMR utilizes a diffusion model trained on a dataset of human poses conditioned on images. Given an initial pose estimate from a regression network, it iteratively refines the pose in the latent space of the diffusion model using score guidance derived from image observations like 2D keypoints, multi-view consistency, or temporal smoothness. |
ScoreHMR outperforms existing optimization-based methods for fitting a 3D human body model to 2D keypoint detections on 3DPW and EMDB datasets.
It effectively refines multi-view predictions by enforcing cross-view consistency, achieving superior results compared to single-view reconstruction and optimization-based methods on Human3.6M and Mannequin Challenge datasets.
ScoreHMR significantly improves the temporal consistency of human motion in video sequences, leading to lower acceleration errors and smoother reconstructions on 3DPW and EMDB datasets. |
The reliance on pseudo-ground-truth pose annotations for training the diffusion model might limit the performance, especially for unusual poses not well represented in the training data.
The current implementation primarily focuses on refining the pose parameters of the SMPL model, and future work could explore extending ScoreHMR to jointly model and refine both pose and shape parameters. |
3d human pose estimation, diffusion models, score guidance, human mesh recovery, multi-view refinement |
2403.09622
Report |
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering |
Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan |
Visual text rendering poses a fundamental challenge for contemporary
text-to-image generation models, with the core problem lying in text encoder
deficiencies. To achieve accurate text rendering, we identify two crucial
requirements for text encoders: character awareness and alignment with glyphs.
Our solution involves crafting a series of customized text encoder, Glyph-ByT5,
by fine-tuning the character-aware ByT5 encoder using a meticulously curated
paired glyph-text dataset. We present an effective method for integrating
Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for
design image generation. This significantly enhances text rendering accuracy,
improving it from less than $20\%$ to nearly $90\%$ on our design image
benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph
rendering, achieving high spelling accuracy for tens to hundreds of characters
with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with
a small set of high-quality, photorealistic images featuring visual text, we
showcase a substantial improvement in scene text rendering capabilities in
open-domain real images. These compelling outcomes aim to encourage further
exploration in designing customized text encoders for diverse and challenging
tasks. |
This paper introduces Glyph-ByT5, a customized text encoder designed for generating accurate visual text in diffusion models, leading to the development of Glyph-SDXL for text-rich design images and scene text rendering. |
Accurate text rendering is crucial for various image generation applications, ranging from design materials to real-world scenes, and existing models often struggle with this task. |
The authors create a scalable glyph-text dataset using graphic rendering, pre-train ByT5 on this dataset for glyph-text alignment, and integrate it into SDXL with a region-wise cross-attention mechanism. |
Glyph-SDXL significantly outperforms commercial products and state-of-the-art models in design-text rendering accuracy.
The model achieves high spelling accuracy for paragraphs with automated multi-line layout.
Fine-tuning Glyph-SDXL on a hybrid design-to-scene dataset improves scene-text generation. |
The layout planning with GPT-4, while promising, still faces challenges in certain scenarios.
Future work includes expanding the dataset and exploring more advanced vision encoders. |
text rendering, diffusion models, text encoder, glyph-byt5, sdxl |
2403.09620
Report |
PosSAM: Panoptic Open-vocabulary Segment Anything |
Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, Fatih Porikli |
In this paper, we introduce an open-vocabulary panoptic segmentation model
that effectively unifies the strengths of the Segment Anything Model (SAM) with
the vision-language CLIP model in an end-to-end framework. While SAM excels in
generating spatially-aware masks, it's decoder falls short in recognizing
object class information and tends to oversegment without additional guidance.
Existing approaches address this limitation by using multi-stage techniques and
employing separate models to generate class-aware prompts, such as bounding
boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model
which leverages SAM's spatially rich features to produce instance-aware masks
and harnesses CLIP's semantically discriminative features for effective
instance classification. Specifically, we address the limitations of SAM and
propose a novel Local Discriminative Pooling (LDP) module leveraging
class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary
classification. Furthermore, we introduce a Mask-Aware Selective Ensembling
(MASE) algorithm that adaptively enhances the quality of generated masks and
boosts the performance of open-vocabulary classification during inference for
each image. We conducted extensive experiments to demonstrate our methods
strong generalization properties across multiple datasets, achieving
state-of-the-art performance with substantial improvements over SOTA
open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and
ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art
methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website:
https://vibashan.github.io/possam-web/. |
Introduces PosSAM, an open-vocabulary panoptic segmentation model unifying Segment Anything Model (SAM) with CLIP for end-to-end instance-aware mask generation and classification. |
Addresses limitations of SAM, which excels in class-agnostic masks but lacks instance and class awareness, hindering its use in open-vocabulary segmentation tasks. |
Leverages SAM's spatial features for mask generation, CLIP for semantic features, introduces a Local Discriminative Pooling (LDP) module for unbiased classification, and employs Mask-Aware Selective Ensembling (MASE) for robust inference. |
Achieves state-of-the-art performance on COCO to ADE20K and ADE20K to COCO zero-shot open-vocabulary panoptic segmentation, outperforming previous methods by a large margin.
Demonstrates strong generalization to unseen object categories, effectively segmenting novel objects with high accuracy.
Outperforms existing methods in open-vocabulary semantic segmentation tasks, highlighting its adaptability to diverse challenges. |
Reliance on CLIP backbone for semantic features limits potential for single, unified architecture.
Future work could explore integrating spatial and semantic awareness within a single backbone for improved efficiency and performance. |
open-vocabulary segmentation, panoptic segmentation, segment anything model (sam), clip, local discriminative pooling |
2403.09616
Report |
Explore In-Context Segmentation via Latent Diffusion Models |
Chaoyang Wang, Xiangtai Li, Henghui Ding, Lu Qi, Jiangning Zhang, Yunhai Tong, Chen Change Loy, Shuicheng Yan |
In-context segmentation has drawn more attention with the introduction of
vision foundation models. Most existing approaches adopt metric learning or
masked image modeling to build the correlation between visual prompts and input
image queries. In this work, we explore this problem from a new perspective,
using one representative generation model, the latent diffusion model (LDM). We
observe a task gap between generation and segmentation in diffusion models, but
LDM is still an effective minimalist for in-context segmentation. In
particular, we propose two meta-architectures and correspondingly design
several output alignment and optimization strategies. We have conducted
comprehensive ablation studies and empirically found that the segmentation
quality counts on output alignment and in-context instructions. Moreover, we
build a new and fair in-context segmentation benchmark that includes both image
and video datasets. Experiments validate the efficiency of our approach,
demonstrating comparable or even stronger results than previous specialist
models or visual foundation models. Our study shows that LDMs can also achieve
good enough results for challenging in-context segmentation tasks. |
This paper explores the potential of Latent Diffusion Models (LDMs) for in-context segmentation by proposing a minimalist LDM-based framework (Ref LDM-Seg) that uses visual prompts for guidance without relying on additional neural networks. |
This research is significant because it offers a novel perspective on in-context segmentation by leveraging the generative capabilities of LDMs, unlike traditional discriminative models or masked image modeling techniques. |
The authors propose two meta-architectures for Ref LDM-Seg, incorporating instruction extraction from visual prompts, output alignment strategies to bridge the gap between image and mask channels, and optimization methods in both pixel and latent spaces. |
LDMs, despite being designed for generation, can effectively perform in-context segmentation with promising results.
Visual prompts and output alignment are crucial for LDM-based segmentation, determining the success and quality of segmentation, respectively.
Ref LDM-Seg achieves comparable or even better performance than existing specialist models and generalist vision foundation models on a proposed in-context segmentation benchmark. |
The current work is limited by the scale of training data used, which could be addressed by scaling up training data and model parameters in the future.
Future research could explore advanced prompt encoder architectures and prompt engineering methods to further improve performance. |
in-context segmentation, latent diffusion model, visual prompt, few-shot learning, computer vision |
2403.09593
Report |
Renovating Names in Open-Vocabulary Segmentation Benchmarks |
Haiwen Huang, Songyou Peng, Dan Zhang, Andreas Geiger |
Names are essential to both human cognition and vision-language models.
Open-vocabulary models utilize class names as text prompts to generalize to
categories unseen during training. However, name qualities are often overlooked
and lack sufficient precision in existing datasets. In this paper, we address
this underexplored problem by presenting a framework for "renovating" names in
open-vocabulary segmentation benchmarks (RENOVATE). Through human study, we
demonstrate that the names generated by our model are more precise descriptions
of the visual segments and hence enhance the quality of existing datasets by
means of simple renaming. We further demonstrate that using our renovated names
enables training of stronger open-vocabulary segmentation models. Using
open-vocabulary segmentation for name quality evaluation, we show that our
renovated names lead to up to 16% relative improvement from the original names
on various benchmarks across various state-of-the-art models. We provide our
code and relabelings for several popular segmentation datasets (ADE20K,
Cityscapes, PASCAL Context) to the research community. |
This paper presents RENOVATE, a framework for improving the quality of class names in open-vocabulary segmentation benchmarks by leveraging foundation models to generate more precise and contextually relevant names. |
Existing open-vocabulary segmentation models struggle with imprecise names in benchmarks, hindering their ability to generalize to novel categories and leading to inaccurate model evaluation. RENOVATE addresses this issue by providing a scalable, principled approach to renaming. |
RENOVATE first uses an image captioning model and GPT-4 to generate a pool of candidate names enriched with contextual information. It then trains a renaming model to select the best-matching name for each segment based on visual-language alignment. |
Human preference study confirms that RENOVATE names are preferred over original names in 82% of cases.
Using RENOVATE names upgrades existing benchmarks by providing more fine-grained annotations, making them more challenging and realistic.
Training open-vocabulary models with RENOVATE names improves their performance on both source and target datasets, highlighting the importance of precise names for generalization. |
RENOVATE's reliance on foundation models could propagate existing biases into the new names, requiring careful verification in critical applications.
The exploration of design choices is not yet exhaustive, with potential for investigating alternative language models and VLM backbones for further improvement. |
vision-language models, open-vocabulary segmentation, dataset renaming, benchmark upgrading, name quality evaluation |
2403.09439
Report |
3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation |
Frank Zhang, Yibo Zhang, Quan Zheng, Rui Ma, Wei Hua, Hujun Bao, Weiwei Xu, Changqing Zou |
Text-driven 3D scene generation techniques have made rapid progress in recent
years. Their success is mainly attributed to using existing generative models
to iteratively perform image warping and inpainting to generate 3D scenes.
However, these methods heavily rely on the outputs of existing models, leading
to error accumulation in geometry and appearance that prevent the models from
being used in various scenarios (e.g., outdoor and unreal scenarios). To
address this limitation, we generatively refine the newly generated local views
by querying and aggregating global 3D information, and then progressively
generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF
as a unified representation of the 3D scene to constrain global 3D consistency,
and propose a generative refinement network to synthesize new contents with
higher quality by exploiting the natural image prior from 2D diffusion model as
well as the global 3D information of the current scene. Our extensive
experiments demonstrate that, in comparison to previous methods, our approach
supports wide variety of scene generation and arbitrary camera trajectories
with improved visual quality and 3D consistency. |
This paper introduces 3D-SceneDreamer, a novel framework for generating 3D scenes from text prompts while ensuring consistency across multiple views. |
Existing text-to-3D methods struggle to maintain consistency, especially in complex outdoor scenes, due to reliance on error-prone depth estimation and lack of global 3D understanding. |
The method uses a tri-planar feature-based NeRF for global 3D representation, progressively optimized through an incremental training strategy. A 3D-aware generative model refines novel views, leveraging pre-trained diffusion models. |
Outperforms state-of-the-art text-to-scene methods in visual quality and 3D consistency.
Successfully generates diverse indoor, outdoor, and unreal scenes with arbitrary camera trajectories.
Reconstructs high-quality 3D meshes and point clouds, demonstrating superior 3D consistency. |
Computationally intensive due to continuous optimization of the 3D representation and new content generation.
Future work could explore incorporating 3D Gaussian Splatting for improved efficiency. |
text-to-3d, scene generation, neural radiance fields, diffusion models, 3d consistency |
2403.09413
Report |
Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting |
Jaewoo Jung, Jisang Han, Honggyu An, Jiwon Kang, Seonghoon Park, Seungryong Kim |
3D Gaussian splatting (3DGS) has recently demonstrated impressive
capabilities in real-time novel view synthesis and 3D reconstruction. However,
3DGS heavily depends on the accurate initialization derived from
Structure-from-Motion (SfM) methods. When trained with randomly initialized
point clouds, 3DGS fails to maintain its ability to produce high-quality
images, undergoing large performance drops of 4-5 dB in PSNR. Through extensive
analysis of SfM initialization in the frequency domain and analysis of a 1D
regression task with multiple 1D Gaussians, we propose a novel optimization
strategy dubbed RAIN-GS (Relaxing Accurate Initialization Constraint for 3D
Gaussian Splatting), that successfully trains 3D Gaussians from random point
clouds. We show the effectiveness of our strategy through quantitative and
qualitative comparisons on multiple datasets, largely improving the performance
in all settings. Our project page and code can be found at
https://ku-cvlab.github.io/RAIN-GS. |
This paper introduces RAIN-GS, a novel optimization strategy for 3D Gaussian Splatting (3DGS) that eliminates the need for accurate point cloud initialization from SfM, enabling high-quality image rendering from randomly initialized point clouds. |
3DGS heavily relies on accurate point cloud initialization derived from SfM, limiting its applicability in scenarios where SfM struggles, such as scenes with symmetry, specular properties, or limited views. RAIN-GS addresses this limitation, broadening 3DGS's applicability. |
RAIN-GS combines two key components: 1) sparse-large-variance (SLV) initialization, starting with fewer Gaussians with larger initial covariances, and 2) progressive Gaussian low-pass filtering during rendering, guiding the model to learn low-frequency components first and progressively refine with high-frequency details. |
RAIN-GS achieves state-of-the-art results on the Mip-NeRF360, Tanks & Temples, and Deep Blending datasets, outperforming existing methods even without SfM initialization.
The strategy effectively reduces high-frequency artifacts and improves visual quality, as demonstrated in qualitative comparisons.
Ablation studies validate the effectiveness of both SLV initialization and progressive Gaussian low-pass filtering. |
RAIN-GS might not fully capture high-frequency details in areas where the rendering loss cannot distinguish between coarse approximations and high-frequency distributions.
The reliance on L1 rendering loss as the primary supervision signal might limit the method's ability to detect the need for further densification. |
3d gaussian splatting, novel view synthesis, structure-from-motion, point cloud initialization, progressive gaussian low-pass filtering |
2403.09338
Report |
LocalMamba: Visual State Space Model with Windowed Selective Scan |
Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu |
Recent advancements in state space models, notably Mamba, have demonstrated
significant progress in modeling long sequences for tasks like language
understanding. Yet, their application in vision tasks has not markedly
surpassed the performance of traditional Convolutional Neural Networks (CNNs)
and Vision Transformers (ViTs). This paper posits that the key to enhancing
Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling.
Traditional ViM approaches, which flatten spatial tokens, overlook the
preservation of local 2D dependencies, thereby elongating the distance between
adjacent tokens. We introduce a novel local scanning strategy that divides
images into distinct windows, effectively capturing local dependencies while
maintaining a global perspective. Additionally, acknowledging the varying
preferences for scan patterns across different network layers, we propose a
dynamic method to independently search for the optimal scan choices for each
layer, substantially improving performance. Extensive experiments across both
plain and hierarchical models underscore our approach's superiority in
effectively capturing image representations. For example, our model
significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs.
Code is available at: https://github.com/hunto/LocalMamba. |
This paper introduces LocalMamba, a novel approach for vision state space models that leverages windowed selective scanning and scan direction search to enhance the capture of local dependencies within images while maintaining global contextual understanding. |
Existing vision state space models struggle to effectively capture local 2D dependencies in images due to the inherent non-causal nature of 2D spatial data and the causal processing framework of SSMs. This work addresses this limitation to improve the performance of vision SSMs. |
The paper introduces a local scanning strategy that divides images into distinct windows to better capture local dependencies. It also proposes a dynamic method to search for the optimal scan direction for each layer, further boosting performance. |
LocalMamba models significantly outperform previous state-of-the-art methods like Vim and VMamba on ImageNet classification, object detection, and semantic segmentation tasks.
The proposed local scan mechanism effectively captures local dependencies, leading to improved performance even without scan direction search.
The scan direction search method identifies optimal scanning configurations for each layer, further enhancing the model's ability to capture both local and global visual cues. |
The computational framework of SSMs is currently more complex than convolution or self-attention, potentially hindering efficient parallel computation.
Current deep learning frameworks lack the same level of optimization for SSM computations as for more established architectures, limiting their speed. |
state space models, vision mamba, local scan, scan direction search, image recognition |
2403.09334
Report |
Video Editing via Factorized Diffusion Distillation |
Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman |
We introduce Emu Video Edit (EVE), a model that establishes a new
state-of-the art in video editing without relying on any supervised video
editing data. To develop EVE we separately train an image editing adapter and a
video generation adapter, and attach both to the same text-to-image model.
Then, to align the adapters towards video editing we introduce a new
unsupervised distillation procedure, Factorized Diffusion Distillation. This
procedure distills knowledge from one or more teachers simultaneously, without
any supervised data. We utilize this procedure to teach EVE to edit videos by
jointly distilling knowledge to (i) precisely edit each individual frame from
the image editing adapter, and (ii) ensure temporal consistency among the
edited frames using the video generation adapter. Finally, to demonstrate the
potential of our approach in unlocking other capabilities, we align additional
combinations of adapters |
Introduces \fullmodel, a state-of-the-art video editing model trained without supervised video editing data by aligning a pretrained image editing adapter and a video generation adapter. |
Addresses the challenge of scarce supervised video editing data, which hinders the development of robust and versatile video editing models. |
Trains image editing and video generation adapters separately, then aligns them using a novel unsupervised distillation procedure called \fullmethod, combining score distillation and adversarial losses. |
\fullmodel achieves state-of-the-art results on the Text Guided Video Editing (TGVE) benchmark.
The proposed method enables zero-shot video editing for tasks learned by the image editing adapter but not explicitly seen during alignment.
Demonstrates generalization by aligning other adapter combinations, showing potential for personalized and stylized image editing. |
Model performance limited by the capabilities of individual teacher models.
\fullmethod is currently reliant on pre-trained adapters and cannot train them from scratch. |
video editing, diffusion models, adapter alignment, unsupervised learning, distillation |
2403.09326
Report |
HeadEvolver: Text to Head Avatars via Locally Learnable Mesh Deformation |
Duotun Wang, Hengyu Meng, Zeyu Cai, Zhijing Shao, Qianxi Liu, Lin Wang, Mingming Fan, Ying Shan, Xiaohang Zhan, Zeyu Wang |
We present HeadEvolver, a novel framework to generate stylized head avatars
from text guidance. HeadEvolver uses locally learnable mesh deformation from a
template head mesh, producing high-quality digital assets for detail-preserving
editing and animation. To tackle the challenges of lacking fine-grained and
semantic-aware local shape control in global deformation through Jacobians, we
introduce a trainable parameter as a weighting factor for the Jacobian at each
triangle to adaptively change local shapes while maintaining global
correspondences and facial features. Moreover, to ensure the coherence of the
resulting shape and appearance from different viewpoints, we use pretrained
image diffusion models for differentiable rendering with regularization terms
to refine the deformation under text guidance. Extensive experiments
demonstrate that our method can generate diverse head avatars with an
articulated mesh that can be edited seamlessly in 3D graphics software,
facilitating downstream applications such as more efficient animation with
inherited blend shapes and semantic consistency. |
HeadEvolver, a novel framework for generating stylized 3D head avatars from text prompts using learnable local mesh deformations. |
Addresses limitations in existing text-to-3D avatar methods, particularly in achieving fine-grained semantic control over local shapes and ensuring compatibility with existing 3D graphics workflows. |
Deforms a template mesh by optimizing per-triangle weighted Jacobians guided by text prompts, leveraging stable diffusion models for differentiable rendering and regularization terms for shape fidelity. |
Generates high-quality head avatars with detailed facial features matching text descriptions.
Preserves semantic correspondences and attributes of the template mesh, enabling smooth integration with animation and editing tools.
Outperforms baseline methods in qualitative and quantitative comparisons, demonstrating superior mesh quality and text-alignment. |
Currently requires manifold mesh input and faces challenges in handling non-manifold structures like eyeballs.
Future work includes exploring cage-based representations for broader mesh compatibility and developing methods for automatically adding accessories like hair and glasses. |
text-to-3d, avatar generation, mesh deformation, differentiable rendering, stable diffusion |
2403.09281
Report |
CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification |
Yiming Ma, Victor Sanchez, Tanaya Guha |
The CLIP (Contrastive Language-Image Pretraining) model has exhibited
outstanding performance in recognition problems, such as zero-shot image
classification and object detection. However, its ability to count remains
understudied due to the inherent challenges of transforming counting--a
regression task--into a recognition task. In this paper, we investigate CLIP's
potential in counting, focusing specifically on estimating crowd sizes.
Existing classification-based crowd-counting methods have encountered issues,
including inappropriate discretization strategies, which impede the application
of CLIP and result in suboptimal performance. To address these challenges, we
propose the Enhanced Blockwise Classification (EBC) framework. In contrast to
previous methods, EBC relies on integer-valued bins that facilitate the
learning of robust decision boundaries. Within our model-agnostic EBC
framework, we introduce CLIP-EBC, the first fully CLIP-based crowd-counting
model capable of generating density maps. Comprehensive evaluations across
diverse crowd-counting datasets demonstrate the state-of-the-art performance of
our methods. Particularly, EBC can improve existing models by up to 76.9%.
Moreover, our CLIP-EBC model surpasses current crowd-counting methods,
achieving mean absolute errors of 55.0 and 6.3 on ShanghaiTech part A and part
B datasets, respectively. The code will be made publicly available. |
This paper introduces CLIP-EBC, the first fully CLIP-based crowd-counting model capable of generating density maps by reformulating counting as a blockwise classification problem. |
Existing crowd counting methods either struggle with the long-tail distribution of count values or fail to fully utilize the power of CLIP for density map estimation. |
The paper proposes an Enhanced Blockwise Classification (EBC) framework that leverages integer-valued bins for discretization, corrects noisy annotations in dense areas, and employs a Distance-Aware-Cross-Entropy (DACE) loss. Building on EBC, CLIP-EBC utilizes the CLIP architecture to extract image and text features, computing their similarity to generate probability maps and subsequently density maps. |
CLIP-EBC with ResNet backbone achieves state-of-the-art performance, surpassing existing methods on benchmarks like ShanghaiTech.
EBC framework significantly improves the performance of existing regression-based methods like CSRNet and DMCount, showing up to 76.9% reduction in RMSE.
Experiments confirm the benefits of dynamic bin granularity in EBC, balancing representative count value accuracy with increased sample size per bin. |
The paper primarily focuses on human counting, leaving the exploration of CLIP-EBC's capacity for counting other objects for future work.
Potential ethical concerns regarding privacy and bias in crowd counting applications require further investigation. |
crowd counting, clip, density map estimation, blockwise classification, deep learning |
2403.09195
Report |
SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration |
Yanfei Song, Bangzheng Pu, Peng Wang, Hongxu Jiang, Dong Dong, Yongxiang Cao, Yiqing Shen |
Segment Anything Model (SAM) has garnered significant attention in
segmentation tasks due to their zero-shot generalization ability. However, a
broader application of SAMs to real-world practice has been restricted by their
low inference speed and high computational memory demands, which mainly stem
from the attention mechanism. Existing work concentrated on optimizing the
encoder, yet has not adequately addressed the inefficiency of the attention
mechanism itself, even when distilled to a smaller model, which thus leaves
space for further improvement. In response, we introduce SAM-Lightening, a
variant of SAM, that features a re-engineered attention mechanism, termed
Dilated Flash Attention. It not only facilitates higher parallelism, enhancing
processing efficiency but also retains compatibility with the existing
FlashAttention. Correspondingly, we propose a progressive distillation to
enable an efficient knowledge transfer from the vanilla SAM without costly
training from scratch. Experiments on COCO and LVIS reveal that SAM-Lightening
significantly outperforms the state-of-the-art methods in both run-time
efficiency and segmentation accuracy. Specifically, it can achieve an inference
speed of 7 milliseconds (ms) per image, for images of size 1024*1024 pixels,
which is 30.1 times faster than the vanilla SAM and 2.1 times than the
state-of-the-art. Moreover, it takes only 244MB memory, which is 3.5\% of the
vanilla SAM. The code and weights are available at
https://anonymous.4open.science/r/SAM-LIGHTENING-BC25/. |
This paper introduces SAM-Lightening, a lightweight version of the Segment Anything Model (SAM) that achieves a 30x speedup in inference while maintaining segmentation accuracy. |
The original SAM, while powerful, suffers from slow inference speeds and high computational demands, limiting its practical application in areas like AR and mobile deployment. |
The authors achieve this by replacing the attention mechanism in SAM's image encoder with a novel Dilated Flash Attention mechanism and employing a dynamic layer-wise distillation technique for efficient knowledge transfer from the original SAM. |
SAM-Lightening achieves an inference speed of 7 milliseconds per image for 1024x1024 resolution, outperforming prior state-of-the-art methods.
It significantly reduces memory consumption, requiring only 3.5% of the memory used by the original SAM.
The model maintains comparable segmentation accuracy to the original SAM, even on complex datasets like LVIS. |
The impact of FlashAttention on inference speed is dependent on hardware and input size, sometimes resulting in slightly slower inference.
Future work could explore integrating pruning and quantization techniques for further optimization. |
segment anything model, knowledge distillation, efficient attention mechanisms, image segmentation, real-time processing |
2403.09176
Report |
Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts |
Byeongjun Park, Hyojun Go, Jin-Young Kim, Sangmin Woo, Seokil Ham, Changick Kim |
Diffusion models have achieved remarkable success across a range of
generative tasks. Recent efforts to enhance diffusion model architectures have
reimagined them as a form of multi-task learning, where each task corresponds
to a denoising task at a specific noise level. While these efforts have focused
on parameter isolation and task routing, they fall short of capturing detailed
inter-task relationships and risk losing semantic information, respectively. In
response, we introduce Switch Diffusion Transformer (Switch-DiT), which
establishes inter-task relationships between conflicting tasks without
compromising semantic information. To achieve this, we employ a sparse
mixture-of-experts within each transformer block to utilize semantic
information and facilitate handling conflicts in tasks through parameter
isolation. Additionally, we propose a diffusion prior loss, encouraging similar
tasks to share their denoising paths while isolating conflicting ones. Through
these, each transformer block contains a shared expert across all tasks, where
the common and task-specific denoising paths enable the diffusion model to
construct its beneficial way of synergizing denoising tasks. Extensive
experiments validate the effectiveness of our approach in improving both image
quality and convergence rate, and further analysis demonstrates that Switch-DiT
constructs tailored denoising paths across various generation scenarios. |
This paper introduces Switch-DiT, a novel diffusion model architecture that improves image generation quality and training convergence by synergizing denoising tasks through a Sparse Mixture-of-Experts (SMoE) approach. |
Existing diffusion models struggle to efficiently handle conflicting optimization directions among denoising tasks across different noise levels, leading to slow convergence and potentially lower image quality. |
Switch-DiT integrates SMoE layers into each transformer block, using a timestep-based gating network to isolate parameters between conflicting tasks while sharing information through common denoising paths. It also introduces a diffusion prior loss to stabilize training and enforce inter-task relationships. |
Switch-DiT consistently outperforms baseline DiT and DTR models in terms of FID, IS, Precision, and Recall across different model sizes on FFHQ and ImageNet datasets.
It achieves faster convergence rates compared to baselines, indicating more efficient diffusion training.
Analysis reveals that Switch-DiT constructs tailored denoising paths based on model size and dataset, demonstrating its adaptability to different generation scenarios. |
The current implementation employs a fixed routing policy inherited from DTR, potentially limiting its ability to fully capture nuanced inter-task relationships.
Future work includes exploring scalable SMoE configurations and adaptive routing policies tailored to specific generation scenarios to further enhance performance. |
diffusion models, mixture-of-experts, multi-task learning, image generation, transformer |
2403.09140
Report |
Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior |
Cheng Chen, Xiaofeng Yang, Fan Yang, Chengzeng Feng, Zhoujie Fu, Chuan-Sheng Foo, Guosheng Lin, Fayao Liu |
Recent works on text-to-3d generation show that using only 2D diffusion
supervision for 3D generation tends to produce results with inconsistent
appearances (e.g., faces on the back view) and inaccurate shapes (e.g., animals
with extra legs). Existing methods mainly address this issue by retraining
diffusion models with images rendered from 3D data to ensure multi-view
consistency while struggling to balance 2D generation quality with 3D
consistency. In this paper, we present a new framework Sculpt3D that equips the
current pipeline with explicit injection of 3D priors from retrieved reference
objects without re-training the 2D diffusion model. Specifically, we
demonstrate that high-quality and diverse 3D geometry can be guaranteed by
keypoints supervision through a sparse ray sampling approach. Moreover, to
ensure accurate appearances of different views, we further modulate the output
of the 2D diffusion model to the correct patterns of the template views without
altering the generated object's style. These two decoupled designs effectively
harness 3D information from reference objects to generate 3D objects while
preserving the generation quality of the 2D diffusion model. Extensive
experiments show our method can largely improve the multi-view consistency
while retaining fidelity and diversity. Our project page is available at:
https://stellarcheng.github.io/Sculpt3D/. |
Sculpt3D, a novel text-to-3D generation framework, explicitly integrates 3D shape and appearance priors from retrieved reference objects to enhance multi-view consistency without retraining the 2D diffusion model. |
Existing text-to-3D methods often produce inconsistent appearances and inaccurate shapes due to relying solely on 2D diffusion supervision. Sculpt3D addresses this by effectively leveraging 3D priors while preserving the high quality of 2D diffusion models. |
Sculpt3D retrieves semantically matching 3D templates and utilizes them in two ways: 1) Sparse keypoint supervision from the template guides 3D shape generation, allowing creative point growth and pruning during optimization. 2) An image adapter aligns the template's appearance with the generated object's style, then modulates the 2D diffusion output to correct appearance inconsistencies across views. |
Sculpt3D generates high-fidelity 3D objects with superior multi-view consistency compared to previous state-of-the-art methods.
The sparse keypoint supervision enables Sculpt3D to produce diverse shapes that adapt to the template while retaining the 2D diffusion model's creative freedom.
The appearance modulation effectively corrects view-specific inconsistencies without altering the overall style or geometry of the generated object. |
Sculpt3D's reliance on 3D priors can be limiting if the initial retrieved shape falls outside the scope of the dataset.
Generating accurate initial shapes for retrieval remains challenging and presents an area for future improvement. |
text-to-3d generation, multi-view consistency, 3d prior, retrieval augmentation, diffusion models |
2403.09093
Report |
Desigen: A Pipeline for Controllable Design Template Generation |
Haohan Weng, Danqing Huang, Yu Qiao, Zheng Hu, Chin-Yew Lin, Tong Zhang, C. L. Philip Chen |
Templates serve as a good starting point to implement a design (e.g., banner,
slide) but it takes great effort from designers to manually create. In this
paper, we present Desigen, an automatic template creation pipeline which
generates background images as well as harmonious layout elements over the
background. Different from natural images, a background image should preserve
enough non-salient space for the overlaying layout elements. To equip existing
advanced diffusion-based models with stronger spatial control, we propose two
simple but effective techniques to constrain the saliency distribution and
reduce the attention weight in desired regions during the background generation
process. Then conditioned on the background, we synthesize the layout with a
Transformer-based autoregressive generator. To achieve a more harmonious
composition, we propose an iterative inference strategy to adjust the
synthesized background and layout in multiple rounds. We constructed a design
dataset with more than 40k advertisement banners to verify our approach.
Extensive experiments demonstrate that the proposed pipeline generates
high-quality templates comparable to human designers. More than a single-page
design, we further show an application of presentation generation that outputs
a set of theme-consistent slides. The data and code are available at
https://whaohan.github.io/desigen. |
Presents "Desigen", an automatic design template creation pipeline that generates both background images and harmonious layout elements using text descriptions and layout specifications. |
Automates the laborious process of manual design template creation, enabling efficient generation of visually appealing and accessible designs. |
Utilizes a diffusion-based background generator with spatial control mechanisms (salient attention constraint and attention reduction), followed by a Transformer-based layout generator. An iterative inference strategy refines both background and layout for harmonious composition. |
Generates backgrounds with significantly lower salient ratios compared to baseline T2I models, indicating more space for layout elements.
Synthesizes layouts that achieve superior visual accessibility (lower occlusion with backgrounds) while maintaining good alignment and minimal overlap.
Demonstrates the capability to generate theme-consistent slide decks by varying layout masks while maintaining relevant background content. |
Current implementation primarily focuses on simple layouts with a limited number of elements.
Further exploration of incorporating graphic design principles for enhanced aesthetics and usability. |
design template generation, text-to-image synthesis, layout generation, spatial control, diffusion models |
2403.09065
Report |
When Semantic Segmentation Meets Frequency Aliasing |
Linwei Chen, Lin Gu, Ying Fu |
Despite recent advancements in semantic segmentation, where and what pixels
are hard to segment remains largely unexplored. Existing research only
separates an image into easy and hard regions and empirically observes the
latter are associated with object boundaries. In this paper, we conduct a
comprehensive analysis of hard pixel errors, categorizing them into three
types: false responses, merging mistakes, and displacements. Our findings
reveal a quantitative association between hard pixels and aliasing, which is
distortion caused by the overlapping of frequency components in the Fourier
domain during downsampling. To identify the frequencies responsible for
aliasing, we propose using the equivalent sampling rate to calculate the
Nyquist frequency, which marks the threshold for aliasing. Then, we introduce
the aliasing score as a metric to quantify the extent of aliasing. While
positively correlated with the proposed aliasing score, three types of hard
pixels exhibit different patterns. Here, we propose two novel de-aliasing
filter (DAF) and frequency mixing (FreqMix) modules to alleviate aliasing
degradation by accurately removing or adjusting frequencies higher than the
Nyquist frequency. The DAF precisely removes the frequencies responsible for
aliasing before downsampling, while the FreqMix dynamically selects
high-frequency components within the encoder block. Experimental results
demonstrate consistent improvements in semantic segmentation and low-light
instance segmentation tasks. The code is available at:
https://github.com/Linwei-Chen/Seg-Aliasing. |
This paper analyzes the phenomenon of aliasing in semantic segmentation and proposes two novel modules to address it: the de-aliasing filter (DAF) and the frequency mixing module (FreqMix). |
Aliasing, a signal distortion arising from undersampling, poses significant challenges in semantic segmentation by hindering accurate boundary prediction. This paper aims to understand and mitigate this issue. |
The paper introduces the concept of equivalent sampling rate (ESR) to accurately calculate the Nyquist frequency and quantifies aliasing levels using an 'aliasing score'. It proposes DAF to remove aliasing frequencies and FreqMix to dynamically balance low and high-frequency components during feature extraction. |
The study reveals a strong positive correlation between hard-to-segment pixels and the proposed aliasing score.
DAF, by accurately removing aliasing frequencies, consistently improves segmentation accuracy compared to traditional blur filters.
FreqMix further enhances performance by dynamically balancing frequency components within the encoder block. |
The equivalent sampling rate is a heuristic estimation and lacks rigorous theoretical guarantees.
The research focuses on semantic segmentation, leaving its application to instance and panoptic segmentation unexplored. |
semantic segmentation, aliasing, de-aliasing filter, frequency mixing, hard pixels |
2403.09055
Report |
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control |
Jaerin Lee, Daniel Sungho Jung, Kanggeon Lee, Kyoung Mu Lee |
The enormous success of diffusion models in text-to-image synthesis has made
them promising candidates for the next generation of end-user applications for
image generation and editing. Previous works have focused on improving the
usability of diffusion models by reducing the inference time or increasing user
interactivity by allowing new, fine-grained controls such as region-based text
prompts. However, we empirically find that integrating both branches of works
is nontrivial, limiting the potential of diffusion models. To solve this
incompatibility, we present StreamMultiDiffusion, the first real-time
region-based text-to-image generation framework. By stabilizing fast inference
techniques and restructuring the model into a newly proposed multi-prompt
stream batch architecture, we achieve $\times 10$ faster panorama generation
than existing solutions, and the generation speed of 1.57 FPS in region-based
text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a
new paradigm for interactive image generation named semantic palette, where
high-quality images are generated in real-time from given multiple hand-drawn
regions, encoding prescribed semantic meanings (e.g., eagle, girl). Our code
and demo application are available at
https://github.com/ironjr/StreamMultiDiffusion. |
This paper introduces StreamMultiDiffusion, the first real-time region-based text-to-image generation framework achieving a generation speed of 1.57 FPS on a single RTX 2080 Ti GPU. |
Existing diffusion models struggle to simultaneously achieve fast inference and fine-grained control, limiting their real-world applicability. This framework aims to overcome this limitation by enabling real-time interactive image generation with region-based text prompts. |
The paper stabilizes fast inference techniques like Latent Consistency Models (LCM) and restructures MultiDiffusion into a novel multi-prompt stream batch architecture. This pipeline processes multiple image regions with different text prompts concurrently, hiding latency and maximizing throughput. |
StreamMultiDiffusion achieves x10 faster panorama generation compared to existing solutions.
The framework stabilizes fast sampling in region-based generation, improving compatibility between LCM and MultiDiffusion.
It introduces "semantic palette," a novel interactive image generation paradigm where users "paint" images in real-time using text prompts as brushes. |
The current implementation still relies on a small number (4-6) of denoising steps.
Achieving perfect mask-tight image synthesis remains a challenge despite improved fidelity with one-step white background bootstrapping. |
diffusion models, real-time image generation, region-based image synthesis, interactive image editing, semantic palette |
2403.08933
Report |
Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images |
Giuseppe Cartella, Vittorio Cuculo, Marcella Cornia, Rita Cucchiara |
Creating high-quality and realistic images is now possible thanks to the
impressive advancements in image generation. A description in natural language
of your desired output is all you need to obtain breathtaking results. However,
as the use of generative models grows, so do concerns about the propagation of
malicious content and misinformation. Consequently, the research community is
actively working on the development of novel fake detection techniques,
primarily focusing on low-level features and possible fingerprints left by
generative models during the image generation process. In a different vein, in
our work, we leverage human semantic knowledge to investigate the possibility
of being included in frameworks of fake image detection. To achieve this, we
collect a novel dataset of partially manipulated images using diffusion models
and conduct an eye-tracking experiment to record the eye movements of different
observers while viewing real and fake stimuli. A preliminary statistical
analysis is conducted to explore the distinctive patterns in how humans
perceive genuine and altered images. Statistical findings reveal that, when
perceiving counterfeit samples, humans tend to focus on more confined regions
of the image, in contrast to the more dispersed observational pattern observed
when viewing genuine images. Our dataset is publicly available at:
https://github.com/aimagelab/unveiling-the-truth. |
This paper explores the differences in human gaze patterns when viewing real and partially manipulated images (created using diffusion models) to investigate whether human visual perception can contribute to fake image detection. |
With the rise of advanced image generation techniques, detecting fake images, especially those subtly manipulated, is crucial to combat misinformation. This study explores the potential of leveraging human semantic knowledge for this task. |
The authors collected a dataset of real images and generated three types of fake counterparts using different diffusion-based editing techniques. They conducted an eye-tracking experiment to record participants' gaze patterns while viewing real and fake images and statistically analyzed the collected data, focusing on saliency map entropy. |
Human observers tend to focus on more confined regions when viewing fake images compared to more dispersed patterns observed with real images.
Statistical analysis, including Kolmogorov-Smirnov, Cramér-von Mises, and Mann-Whitney U tests, reveals significant differences in saliency map entropy distributions between real and fake images, supporting the observed gaze pattern differences.
The study suggests that human gaze information can potentially be integrated into automatic fake image detection systems to improve their accuracy. |
The study primarily focuses on partially manipulated images, and future work should investigate if similar gaze patterns exist for entirely generated images.
Further research is needed to develop concrete methods for incorporating human gaze information into existing fake detection models. |
deepfakes, gaze tracking, visual perception, human in the loop, fake image detection |
2403.08902
Report |
Envision3D: One Image to 3D with Anchor Views Interpolation |
Yatian Pang, Tanghui Jia, Yujun Shi, Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Xing Zhou, Francis E. H. Tay, Li Yuan |
We present Envision3D, a novel method for efficiently generating high-quality
3D content from a single image. Recent methods that extract 3D content from
multi-view images generated by diffusion models show great potential. However,
it is still challenging for diffusion models to generate dense multi-view
consistent images, which is crucial for the quality of 3D content extraction.
To address this issue, we propose a novel cascade diffusion framework, which
decomposes the challenging dense views generation task into two tractable
stages, namely anchor views generation and anchor views interpolation. In the
first stage, we train the image diffusion model to generate global consistent
anchor views conditioning on image-normal pairs. Subsequently, leveraging our
video diffusion model fine-tuned on consecutive multi-view images, we conduct
interpolation on the previous anchor views to generate extra dense views. This
framework yields dense, multi-view consistent images, providing comprehensive
3D information. To further enhance the overall generation quality, we introduce
a coarse-to-fine sampling strategy for the reconstruction algorithm to robustly
extract textured meshes from the generated dense images. Extensive experiments
demonstrate that our method is capable of generating high-quality 3D content in
terms of texture and geometry, surpassing previous image-to-3D baseline
methods. |
Envision3D is a novel method for generating high-quality 3D content from a single image by generating and leveraging dense, multi-view consistent images. |
Generating 3D content from a single image is essential for various applications, and existing methods struggle to generate sufficiently dense and consistent multi-view images for high-quality 3D extraction. |
The paper proposes a cascade diffusion framework. Stage 1 generates consistent anchor views and their normal maps using a multi-view attention mechanism, cross-domain attention, and an Instruction Representation Injection (IRI) module. Stage 2 interpolates between anchor views using a fine-tuned video diffusion model. Finally, a coarse-to-fine sampling strategy refines 3D content extraction using an SDF-based reconstruction method. |
Envision3D generates denser and higher-quality multi-view consistent images compared to baselines.
The method produces superior 3D content with higher fidelity texture and geometry compared to existing image-to-3D methods.
Ablation studies confirm the effectiveness of increasing view count and using the proposed coarse-to-fine sampling strategy. |
The reliance on a pre-trained normal prediction model in Stage 1 could limit generalization ability.
Future work can explore alternative reconstruction methods or combine Envision3D with other 3D generation techniques to further enhance results. |
3d generation, diffusion models, multi-view consistency, textured mesh, single image to 3d |
2403.08857
Report |
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation |
Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, Wei Liu |
Text-to-image (T2I) generation models have significantly advanced in recent
years. However, effective interaction with these models is challenging for
average users due to the need for specialized prompt engineering knowledge and
the inability to perform multi-turn image generation, hindering a dynamic and
iterative creation process. Recent attempts have tried to equip Multi-modal
Large Language Models (MLLMs) with T2I models to bring the user's natural
language instructions into reality. Hence, the output modality of MLLMs is
extended, and the multi-turn generation quality of T2I models is enhanced
thanks to the strong multi-modal comprehension ability of MLLMs. However, many
of these works face challenges in identifying correct output modalities and
generating coherent images accordingly as the number of output modalities
increases and the conversations go deeper. Therefore, we propose DialogGen, an
effective pipeline to align off-the-shelf MLLMs and T2I models to build a
Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image
generation. It is composed of drawing prompt alignment, careful training data
curation, and error correction. Moreover, as the field of MIDS flourishes,
comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms
of output modality correctness and multi-modal output coherence. To address
this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a
comprehensive bilingual benchmark designed to assess the ability of MLLMs to
generate accurate and coherent multi-modal content that supports image editing.
It contains two evaluation metrics to measure the model's ability to switch
modalities and the coherence of the output images. Our extensive experiments on
DialogBen and user study demonstrate the effectiveness of DialogGen compared
with other State-of-the-Art models. |
This paper introduces DialogGen, a pipeline to align Multi-modal Large Language Models (MLLMs) and Text-to-Image (T2I) models for multi-turn text-to-image generation in Multi-modal Interactive Dialogue Systems (MIDS), and DialogBen, a bilingual benchmark to evaluate such systems. |
Effective interaction with T2I models is challenging for average users due to the need for specialized prompt engineering knowledge. Existing MLLMs face difficulties in identifying correct output modalities and generating coherent images in multi-turn settings. |
DialogGen leverages drawing prompt alignment, curated bilingual training data, and error correction. DialogBen includes 9957 multi-modal conversations and evaluates modality switching accuracy and generation coherence using Visual Question Answering (VQA). |
DialogGen achieves high modality switching accuracy, outperforming baselines like NExT-GPT and SEED-LLaMA.
DialogGen with error correction significantly improves performance, especially with limited training data diversity.
Bilingual training further enhances DialogGen's modality switching accuracy. |
Resource requirement for re-captioning T2I training data can be high.
Future work can explore aligning training data with human preferences. |
text-to-image generation, multi-modal interactive dialogue systems, multi-modal large language models, benchmarking, error correction |
2403.08551
Report |
GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting |
Xinjie Zhang, Xingtong Ge, Tongda Xu, Dailan He, Yan Wang, Hongwei Qin, Guo Lu, Jing Geng, Jun Zhang |
Implicit neural representations (INRs) recently achieved great success in
image representation and compression, offering high visual quality and fast
rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are
available. However, this requirement often hinders their use on low-end devices
with limited memory. In response, we propose a groundbreaking paradigm of image
representation and compression by 2D Gaussian Splatting, named GaussianImage.
We first introduce 2D Gaussian to represent the image, where each Gaussian has
8 parameters including position, covariance and color. Subsequently, we unveil
a novel rendering algorithm based on accumulated summation. Remarkably, our
method with a minimum of 3$\times$ lower GPU memory usage and 5$\times$ faster
fitting time not only rivals INRs (e.g., WIRE, I-NGP) in representation
performance, but also delivers a faster rendering speed of 1500-2000 FPS
regardless of parameter size. Furthermore, we integrate existing vector
quantization technique to build an image codec. Experimental results
demonstrate that our codec attains rate-distortion performance comparable to
compression-based INRs such as COIN and COIN++, while facilitating decoding
speeds of approximately 1000 FPS. Additionally, preliminary proof of concept
shows that our codec surpasses COIN and COIN++ in performance when using
partial bits-back coding. Code will be available at
https://github.com/Xinjie-Q/GaussianImage. |
Presents GaussianImage, a novel image representation and compression paradigm using 2D Gaussian Splatting, achieving faster rendering and less memory usage than INR methods. |
Addresses limitations of Implicit Neural Representations (INRs) such as high GPU memory consumption, slow decoding speed, and long training times, hindering deployment on low-end devices. |
Represents images using 2D Gaussians, each with 8 learnable parameters. Introduces an accumulated summation-based rasterization, replacing depth-based sorting and alpha-blending. Develops an image codec by integrating vector quantization for Gaussian attribute compression. |
Achieves 1500-2000 FPS rendering speed regardless of parameter size, outperforming INR methods like WIRE and I-NGP.
Requires 3x lower GPU memory than competitive INR methods while maintaining comparable image representation quality.
Attains rate-distortion performance comparable to INR-based codecs like COIN and COIN++ with a significantly faster decoding speed around 1000 FPS. |
Encoding speed is slower than VAE-based codecs, leaving room for improvement in image fitting and Gaussian compression.
Current compression performance lags behind traditional/VAE-based codecs, necessitating development of specialized Gaussian-based compression algorithms. |
2d gaussian splatting, image representation, image compression, neural image codec, fast rendering |
2403.08498
Report |
Gaussian Splatting in Style |
Abhishek Saroha, Mariia Gladkova, Cecilia Curreli, Tarun Yenamandra, Daniel Cremers |
Scene stylization extends the work of neural style transfer to three spatial
dimensions. A vital challenge in this problem is to maintain the uniformity of
the stylized appearance across a multi-view setting. A vast majority of the
previous works achieve this by optimizing the scene with a specific style
image. In contrast, we propose a novel architecture trained on a collection of
style images, that at test time produces high quality stylized novel views. Our
work builds up on the framework of 3D Gaussian splatting. For a given scene, we
take the pretrained Gaussians and process them using a multi resolution hash
grid and a tiny MLP to obtain the conditional stylised views. The explicit
nature of 3D Gaussians give us inherent advantages over NeRF-based methods
including geometric consistency, along with having a fast training and
rendering regime. This enables our method to be useful for vast practical use
cases such as in augmented or virtual reality applications. Through our
experiments, we show our methods achieve state-of-the-art performance with
superior visual quality on various indoor and outdoor real-world data. |
Introduces Gaussian Splatting in Style (GSS), a novel method for real-time neural scene stylization based on 3D Gaussian splatting. |
Real-time scene stylization is crucial for applications like AR/VR, and existing methods often lack speed or consistency. GSS addresses this by leveraging the efficiency and explicit nature of 3D Gaussian representations. |
GSS uses pre-trained 3D Gaussians and a 2D stylization module (AdaIN). It learns a mapping from Gaussian positions and style image latents to stylized RGB colors using a multi-resolution hash grid and a tiny MLP. This allows for view-dependent color prediction without sacrificing rendering speed. |
GSS achieves state-of-the-art performance in short-term and long-term view consistency, outperforming NeRF-based methods.
Qualitative results show GSS excels in preserving content details and faithfully transferring style features, surpassing baselines in accuracy and visual quality.
GSS renders stylized novel views at approximately 157 FPS, significantly faster than other methods due to its efficient 3DGS backbone. |
The current method relies on pre-trained 3D Gaussians, limiting its application to scenes with available 3DGS models.
Further exploration of alternative 2D stylization techniques or incorporating semantic information could enhance stylization quality. |
scene stylization, gaussian splatting, 3dgs, real-time rendering, novel view synthesis |
2403.08436
Report |
PFStorer: Personalized Face Restoration and Super-Resolution |
Tuomas Varanka, Tapani Toivonen, Soumya Tripathy, Guoying Zhao, Erman Acar |
Recent developments in face restoration have achieved remarkable results in
producing high-quality and lifelike outputs. The stunning results however often
fail to be faithful with respect to the identity of the person as the models
lack necessary context. In this paper, we explore the potential of personalized
face restoration with diffusion models. In our approach a restoration model is
personalized using a few images of the identity, leading to tailored
restoration with respect to the identity while retaining fine-grained details.
By using independent trainable blocks for personalization, the rich prior of a
base restoration model can be exploited to its fullest. To avoid the model
relying on parts of identity left in the conditioning low-quality images, a
generative regularizer is employed. With a learnable parameter, the model
learns to balance between the details generated based on the input image and
the degree of personalization. Moreover, we improve the training pipeline of
face restoration models to enable an alignment-free approach. We showcase the
robust capabilities of our approach in several real-world scenarios with
multiple identities, demonstrating our method's ability to generate
fine-grained details with faithful restoration. In the user study we evaluate
the perceptual quality and faithfulness of the genereated details, with our
method being voted best 61% of the time compared to the second best with 25% of
the votes. |
This paper proposes PFStorer, a personalized face restoration method using diffusion models that leverages a few high-quality reference images to restore low-quality face images while preserving identity. |
Face restoration is ill-posed, with multiple plausible solutions. Existing methods often fail to retain the identity or generate fine-grained details, especially in challenging real-world scenarios. |
PFStorer personalizes a pre-trained face restoration diffusion model by fine-tuning it with reference images. It utilizes independent trainable blocks for personalization, preserving the base model's priors. A generative regularizer forces the model to learn a robust identity representation solely from reference images. Additionally, the base model training pipeline is improved with an alignment-free approach and robust noise generation. |
PFStorer outperforms state-of-the-art methods in preserving identity features, evidenced by quantitative metrics and a user study.
The method demonstrates robustness in handling real-world degradations, variations in pose and illumination.
Learnable adapters and the generative regularizer are crucial for balancing personalization and retaining restoration quality. |
The output is limited by the quality and appearance variations present in the provided reference images.
PFStorer inherits limitations of diffusion models, such as slow sampling speed and occasional artifacts. |
face restoration, diffusion models, personalization, generative regularization, alignment-free |
2403.08381
Report |
Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models |
Pengze Zhang, Hubery Yin, Chen Li, Xiaohua Xie |
Most diffusion models assume that the reverse process adheres to a Gaussian
distribution. However, this approximation has not been rigorously validated,
especially at singularities, where t=0 and t=1. Improperly dealing with such
singularities leads to an average brightness issue in applications, and limits
the generation of images with extreme brightness or darkness. We primarily
focus on tackling singularities from both theoretical and practical
perspectives. Initially, we establish the error bounds for the reverse process
approximation, and showcase its Gaussian characteristics at singularity time
steps. Based on this theoretical insight, we confirm the singularity at t=1 is
conditionally removable while it at t=0 is an inherent property. Upon these
significant conclusions, we propose a novel plug-and-play method SingDiffusion
to address the initial singular time step sampling, which not only effectively
resolves the average brightness issue for a wide range of diffusion models
without extra training efforts, but also enhances their generation capability
in achieving notable lower FID scores. |
The paper proposes SingDiffusion, a plug-and-play method to address the singularity issue at the initial time step in diffusion models, which leads to an average brightness issue in generated images. |
Most diffusion models ignore singularities at t=0 and t=1, resulting in an inability to generate images with extreme brightness and darkness. Existing solutions require model-specific retraining, limiting their practicality. |
The authors prove the approximate Gaussian characteristics of the reverse diffusion process at all time steps. They analyze the singularities and devise SingDiffusion, which trains a separate model for the initial sampling step (t=1) using x-prediction and seamlessly integrates with existing pre-trained diffusion models for subsequent steps. |
SingDiffusion effectively resolves the average brightness issue, allowing for the generation of both bright and dark images.
SingDiffusion improves the FID scores of existing diffusion models, demonstrating enhanced image quality.
SingDiffusion is a once-trained, plug-and-play module compatible with a wide range of pre-trained models and plugins like ControlNet. |
The current training data only includes image-prompt pairs, limiting its application to other domains like audio generation.
The normalization operation for classifier-free guidance at the initial time step could be further improved. |
diffusion models, singularity, average brightness, image generation, plug-and-play |
2403.08277
Report |
VIGFace: Virtual Identity Generation Model for Face Image Synthesis |
Minsoo Kim, Min-Cheol Sagong, Gi Pyo Nam, Junghyun Cho, Ig-Jae Kim |
Deep learning-based face recognition continues to face challenges due to its
reliance on huge datasets obtained from web crawling, which can be costly to
gather and raise significant real-world privacy concerns. To address this
issue, we propose VIGFace, a novel framework capable of generating synthetic
facial images. Initially, we train the face recognition model using a real face
dataset and create a feature space for both real and virtual IDs where virtual
prototypes are orthogonal to other prototypes. Subsequently, we generate
synthetic images by using the diffusion model based on the feature space. Our
proposed framework provides two significant benefits. Firstly, it allows for
creating virtual facial images without concerns about portrait rights,
guaranteeing that the generated virtual face images are clearly differentiated
from existing individuals. Secondly, it serves as an effective augmentation
method by incorporating real existing images. Further experiments demonstrate
the efficacy of our framework, achieving state-of-the-art results from both
perspectives without any external data. |
Presents VIGFace, a novel framework for generating synthetic facial images of virtual identities for face recognition, addressing privacy concerns and data scarcity. |
Real face datasets for face recognition raise privacy concerns, are costly to obtain, and can contain label inaccuracies and biases. |
Trains a face recognition model with real data, incorporates virtual identity prototypes, and utilizes a diffusion model to generate synthetic images based on the feature space of the trained model. |
Generated virtual face images demonstrate high intra-class variance and inter-class diversity.
FR model trained solely on VIGFace virtual images achieves state-of-the-art performance, comparable to models trained on real datasets.
Using VIGFace for data augmentation, by combining its generated images with real data, further improves FR model performance. |
The paper focuses on frontal face images, and extending the approach to handle pose variations could be explored.
Investigating the generalization capability of FR models trained on VIGFace to other downstream tasks or datasets is a potential future direction. |
face recognition, diffusion model, image generation, data augmentation, synthetic data |
2403.08268
Report |
Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts |
Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, Qifeng Chen |
Despite recent advances in image-to-video generation, better controllability
and local animation are less explored. Most existing image-to-video methods are
not locally aware and tend to move the entire scene. However, human artists may
need to control the movement of different objects or regions. Additionally,
current I2V methods require users not only to describe the target motion but
also to provide redundant detailed descriptions of frame contents. These two
issues hinder the practical utilization of current I2V tools. In this paper, we
propose a practical framework, named Follow-Your-Click, to achieve image
animation with a simple user click (for specifying what to move) and a short
motion prompt (for specifying how to move). Technically, we propose the
first-frame masking strategy, which significantly improves the video generation
quality, and a motion-augmented module equipped with a short motion prompt
dataset to improve the short prompt following abilities of our model. To
further control the motion speed, we propose flow-based motion magnitude
control to control the speed of target movement more precisely. Our framework
has simpler yet precise user control and better generation performance than
previous methods. Extensive experiments compared with 7 baselines, including
both commercial tools and research methods on 8 metrics, suggest the
superiority of our approach. Project Page: https://follow-your-click.github.io/ |
This paper introduces Follow-Your-Click, a novel framework for open-domain regional image animation controlled by a user click (specifying the region to animate) and a short motion prompt (describing the desired motion). |
Current image-to-video generation methods lack local animation control, requiring detailed scene descriptions and struggling to follow short motion prompts. This limits their practical use for animators who need precise control over object motion. |
The framework leverages a pre-trained image LDM and incorporates several key components: a user click converted to a mask using SAM, first-frame masking training for improved temporal consistency, a motion-augmented module trained on a short prompt dataset (WebVid-Motion) for enhanced prompt following, and flow-based motion magnitude control for accurate motion speed adjustment. |
Follow-Your-Click demonstrates superior regional animation capabilities compared to existing open-sourced and commercial baselines, as shown in qualitative comparisons and quantitative evaluations using metrics like FVD, temporal consistency, and text alignment.
Ablation studies validate the effectiveness of each proposed component, such as first-frame masking for enhanced temporal coherence and the motion-augmented module for improved short prompt following.
The framework shows potential for applications like multi-region animation and integration with ControlNet for precise motion control using pose conditioning. |
The approach still faces limitations in generating complex and large human motions, potentially due to dataset bias and the complexity of such movements.
Future work could explore incorporating more sophisticated motion control mechanisms and expanding the diversity of motion in the training dataset. |
image animation, text-to-video generation, diffusion models, regional control, short prompt following |
2403.08255
Report |
Make Me Happier: Evoking Emotions Through Image Diffusion Models |
Qing Lin, Jingfeng Zhang, Yew Soon Ong, Mengmi Zhang |
Despite the rapid progress in image generation, emotional image editing
remains under-explored. The semantics, context, and structure of an image can
evoke emotional responses, making emotional image editing techniques valuable
for various real-world applications, including treatment of psychological
disorders, commercialization of products, and artistic design. For the first
time, we present a novel challenge of emotion-evoked image generation, aiming
to synthesize images that evoke target emotions while retaining the semantics
and structures of the original scenes. To address this challenge, we propose a
diffusion model capable of effectively understanding and editing source images
to convey desired emotions and sentiments. Moreover, due to the lack of emotion
editing datasets, we provide a unique dataset consisting of 340,000 pairs of
images and their emotion annotations. Furthermore, we conduct human
psychophysics experiments and introduce four new evaluation metrics to
systematically benchmark all the methods. Experimental results demonstrate that
our method surpasses all competitive baselines. Our diffusion model is capable
of identifying emotional cues from original images, editing images that elicit
desired emotions, and meanwhile, preserving the semantic structure of the
original images. All code, model, and data will be made public. |
This paper introduces the novel problem of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while preserving the semantics and structures of original scenes. |
Emotion-evoked image editing has applications in various fields, including treatment of psychological disorders, product commercialization, and artistic design. |
This paper proposes EmoEditor, a novel diffusion model with a dual-branch architecture that integrates emotion-conditioned global context and local emotional cues from source images. It also introduces a new dataset EmoPair, consisting of 340,000 image pairs with emotion annotations. |
EmoEditor outperforms existing image editing methods in human psychophysics experiments, successfully evoking desired emotions in viewers.
The proposed method preserves structural coherence and semantic consistency with source images while effectively manipulating emotional content.
EmoEditor generalizes to challenging scenarios, including within-valence emotion editing and transforming emotionally neutral images. |
The model faces limitations in accurately handling fine-grained details on small faces within crowded scenes.
Generating emotion-evoked images without exacerbating semantic and structural disparities between source and target images remains a challenge. |
image generation, emotion ai, diffusion models, image editing, computer vision |
2403.08108
Report |
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection |
Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Fei Wen, Hugo Latapie, Mohsen Imani |
Task-oriented object detection aims to find objects suitable for
accomplishing specific tasks. As a challenging task, it requires simultaneous
visual data processing and reasoning under ambiguous semantics. Recent
solutions are mainly all-in-one models. However, the object detection backbones
are pre-trained without text supervision. Thus, to incorporate task
requirements, their intricate models undergo extensive learning on a highly
imbalanced and scarce dataset, resulting in capped performance, laborious
training, and poor generalizability. In contrast, we propose TaskCLIP, a more
natural two-stage design composed of general object detection and task-guided
object selection. Particularly for the latter, we resort to the recently
successful large Vision-Language Models (VLMs) as our backbone, which provides
rich semantic knowledge and a uniform embedding space for images and texts.
Nevertheless, the naive application of VLMs leads to sub-optimal quality, due
to the misalignment between embeddings of object images and their visual
attributes, which are mainly adjective phrases. To this end, we design a
transformer-based aligner after the pre-trained VLMs to re-calibrate both
embeddings. Finally, we employ a trainable score function to post-process the
VLM matching results for object selection. Experimental results demonstrate
that our TaskCLIP outperforms the state-of-the-art DETR-based model TOIST by
3.5% and only requires a single NVIDIA RTX 4090 for both training and
inference. |
TaskCLIP, a novel two-stage framework for task-oriented object detection that leverages pre-trained Vision-Language Models (VLMs) for efficient and effective object selection. |
Existing all-in-one models for task-oriented object detection suffer from data scarcity and imbalance, leading to capped performance, laborious training, and poor generalizability. |
TaskCLIP first performs general object detection. Then, it leverages pre-trained VLMs like CLIP to match image patches with task-relevant visual attributes, generated by an LLM. A transformer-based aligner recalibrates the embedding space, and a score function guides object selection. |
TaskCLIP outperforms the state-of-the-art DETR-based model TOIST by 3.5% mAP@0.5 on the COCO-Tasks dataset.
It requires only a single NVIDIA RTX 4090 GPU for both training and inference, demonstrating higher training efficiency.
A select-by-grouping mechanism effectively mitigates the high false negative rate caused by imbalanced training samples. |
TaskCLIP can be sensitive to the quality of bounding boxes generated by the object detection network.
The model might misidentify objects with misleading appearances even after embedding recalibration. |
task-oriented object detection, vision-language models, clip, transformer, coco-tasks |
2403.07874
Report |
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension |
Lei Zhu, Fangyun Wei, Yanye Lu |
In this work, we investigate the potential of a large language model (LLM) to
directly comprehend visual signals without the necessity of fine-tuning on
multi-modal datasets. The foundational concept of our method views an image as
a linguistic entity, and translates it to a set of discrete words derived from
the LLM's vocabulary. To achieve this, we present the Vision-to-Language
Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a
``foreign language'' with the combined aid of an encoder-decoder, the LLM
vocabulary, and a CLIP model. With this innovative image encoding, the LLM
gains the ability not only for visual comprehension but also for image
denoising and restoration in an auto-regressive fashion-crucially, without any
fine-tuning. We undertake rigorous experiments to validate our method,
encompassing understanding tasks like image recognition, image captioning, and
visual question answering, as well as image denoising tasks like inpainting,
outpainting, deblurring, and shift restoration. Code and models are available
at https://github.com/zh460045050/V2L-Tokenizer. |
This paper introduces the Vision-to-Language Tokenizer (V2L Tokenizer), a novel approach that enables a frozen large language model (LLM) to comprehend and process visual signals directly without requiring fine-tuning on multi-modal datasets. |
This method is crucial for expanding the capabilities of LLMs to encompass visual comprehension and generation without the need for resource-intensive fine-tuning. |
The V2L Tokenizer translates images into a set of discrete tokens drawn from the LLM's vocabulary, viewing images as a "foreign language." It employs an encoder-decoder structure with two quantizers and leverages the LLM's vocabulary and CLIP for semantic mapping. |
The V2L Tokenizer outperforms previous methods in few-shot image classification tasks, demonstrating its ability to enable LLMs to understand visual concepts.
It excels in image denoising tasks like inpainting and deblurring, showcasing its capacity to generate high-quality visual content.
The approach effectively bridges the gap between visual and language modalities, allowing LLMs to perform tasks like image captioning and visual question answering. |
The performance of image generation tasks, while promising, can be further enhanced, potentially through LLM fine-tuning or alternative optimization strategies.
The reliance on a pre-trained CLIP model introduces a dependency on external resources, and exploring CLIP-free alternatives could be a future direction. |
large language models, vision-to-language, image understanding, image denoising, tokenization |
2403.07860
Report |
Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation |
Shihao Zhao, Shaozhe Hao, Bojia Zi, Huaizhe Xu, Kwan-Yee K. Wong |
Text-to-image generation has made significant advancements with the
introduction of text-to-image diffusion models. These models typically consist
of a language model that interprets user prompts and a vision model that
generates corresponding images. As language and vision models continue to
progress in their respective domains, there is a great potential in exploring
the replacement of components in text-to-image diffusion models with more
advanced counterparts. A broader research objective would therefore be to
investigate the integration of any two unrelated language and generative vision
models for text-to-image generation. In this paper, we explore this objective
and propose LaVi-Bridge, a pipeline that enables the integration of diverse
pre-trained language models and generative vision models for text-to-image
generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and
plug-and-play approach without requiring modifications to the original weights
of the language and vision models. Our pipeline is compatible with various
language models and generative vision models, accommodating different
structures. Within this framework, we demonstrate that incorporating superior
modules, such as more advanced language models or generative vision models,
results in notable improvements in capabilities like text alignment or image
quality. Extensive evaluations have been conducted to verify the effectiveness
of LaVi-Bridge. Code is available at
https://github.com/ShihaoZhaoZSH/LaVi-Bridge. |
This paper introduces LaVi-Bridge, a flexible pipeline for text-to-image generation that allows seamless integration of diverse pre-trained language and generative vision models. |
The rapid progress in deep language and vision models poses a challenge for text-to-image generation in terms of integrating advanced models into existing text-to-image diffusion models. This paper bridges this gap by providing a framework for integrating any two unrelated language and vision models. |
LaVi-Bridge leverages LoRA and adapters to establish connections between pre-trained language and vision models without modifying their original weights. This allows for a plug-and-play approach where different models can be easily swapped and tested. |
Integrating superior models under LaVi-Bridge leads to improved performance, such as enhanced semantic understanding with advanced language models (e.g., Llama-2) or improved image quality with more powerful generative vision models (e.g., PixArt's transformer).
The study demonstrated that LaVi-Bridge is compatible with various language model structures (encoder-only, encoder-decoder, decoder-only) and generative vision model structures (U-Net-based and Transformer-based).
LaVi-Bridge requires only a relatively small dataset for fine-tuning the LoRA and adapter components, making it efficient in terms of training data and computational resources. |
Training with LaVi-Bridge on the same models and weights as an existing text-to-image diffusion model may not lead to significant improvements and might even slightly decrease performance.
The paper primarily focuses on combining existing models and does not delve into the exploration of novel language or vision models specifically designed for text-to-image generation. |
text-to-image generation, diffusion models, language models, generative vision models, lora |
2403.07764
Report |
Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model |
Yuxuan Zhang, Lifu Wei, Qing Zhang, Yiren Song, Jiaming Liu, Huaxia Li, Xu Tang, Yao Hu, Haibo Zhao |
Current makeup transfer methods are limited to simple makeup styles, making
them difficult to apply in real-world scenarios. In this paper, we introduce
Stable-Makeup, a novel diffusion-based makeup transfer method capable of
robustly transferring a wide range of real-world makeup, onto user-provided
faces. Stable-Makeup is based on a pre-trained diffusion model and utilizes a
Detail-Preserving (D-P) makeup encoder to encode makeup details. It also
employs content and structural control modules to preserve the content and
structural information of the source image. With the aid of our newly added
makeup cross-attention layers in U-Net, we can accurately transfer the detailed
makeup to the corresponding position in the source image. After
content-structure decoupling training, Stable-Makeup can maintain content and
the facial structure of the source image. Moreover, our method has demonstrated
strong robustness and generalizability, making it applicable to varioustasks
such as cross-domain makeup transfer, makeup-guided text-to-image generation
and so on. Extensive experiments have demonstrated that our approach delivers
state-of-the-art (SOTA) results among existing makeup transfer methods and
exhibits a highly promising with broad potential applications in various
related fields. |
Stable-Makeup, a novel diffusion-based makeup transfer method that robustly transfers diverse real-world makeup styles onto user-provided faces, addressing limitations of existing GAN-based methods in handling high-detail and creative cosmetics. |
Existing makeup transfer methods struggle with complex real-world makeup styles, limiting their practicality for diverse and intricate designs. This work aims to overcome this limitation and enable high-quality makeup transfer for a broader range of styles. |
Stable-Makeup leverages a pre-trained diffusion model and introduces: 1) Detail-Preserving Makeup Encoder to capture intricate makeup details, 2) Makeup Cross-attention Layers to align makeup features with facial regions, 3) Content and Structural Control Modules to maintain source image fidelity. The method is trained on a newly created dataset of 20k paired images with diverse makeup styles. |
Stable-Makeup demonstrates state-of-the-art performance, outperforming existing methods in transferring both light and heavy makeup with superior detail preservation.
Quantitative evaluations using CLIP-I, DINO-I, SSIM, and L2-M metrics confirm superior makeup transfer capability and content-structure preservation.
User studies validate the perceptual quality of Stable-Makeup, highlighting its ability to generate realistic and aesthetically pleasing makeup transfer results. |
Potential inconsistencies in facial structure within the training dataset, arising from limitations of text-based editing methods, might impact the model's performance.
Future work includes refining data selection and exploring 3D makeup transfer to further enhance the method's capabilities. |
makeup transfer, diffusion models, detail preservation, content-structure control, real-world makeup |
2403.07711
Report |
SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces |
Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, Yutaka Matsuo |
Given the remarkable achievements in image generation through diffusion
models, the research community has shown increasing interest in extending these
models to video generation. Recent diffusion models for video generation have
predominantly utilized attention layers to extract temporal features. However,
attention layers are limited by their memory consumption, which increases
quadratically with the length of the sequence. This limitation presents
significant challenges when attempting to generate longer video sequences using
diffusion models. To overcome this challenge, we propose leveraging state-space
models (SSMs). SSMs have recently gained attention as viable alternatives due
to their linear memory consumption relative to sequence length. In the
experiments, we first evaluate our SSM-based model with UCF101, a standard
benchmark of video generation. In addition, to investigate the potential of
SSMs for longer video generation, we perform an experiment using the MineRL
Navigate dataset, varying the number of frames to 64, 200, and 400. In these
settings, our SSM-based model can considerably save memory consumption for
longer sequences, while maintaining competitive FVD scores to the
attention-based models. Our codes are available at
https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models. |
This paper introduces a novel temporal state-space model (SSM) layer to replace the memory-intensive attention mechanism in video diffusion models (VDMs) for efficient video generation. |
Existing VDMs heavily rely on attention layers for capturing temporal features, leading to quadratic memory consumption with sequence length, hindering longer video generation. |
The proposed temporal SSM layer leverages bidirectional SSMs to capture comprehensive temporal dynamics, augmented by a multi-layer perceptron (MLP) to enhance information integration across dimensions. |
The SSM-based VDM achieves competitive or superior video generation quality (FVD score) compared to attention-based models on UCF101.
The SSM-based model demonstrates superior memory efficiency, enabling training on 400-frame MineRL Navigate videos, while attention-based methods fail due to memory limitations.
Ablation studies highlight the critical role of bidirectional SSMs and MLPs in achieving high-quality video generation. |
The study primarily focuses on unconditional video generation, leaving extensions to conditional generation for future work.
Exploring alternative SSM architectures and their impact on long-term video generation is a promising research direction. |
video generation, diffusion models, state-space models, attention mechanism, memory efficiency |
2403.07605
Report |
Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image Generation |
Michael Ogezi, Ning Shi |
In text-to-image generation, using negative prompts, which describe
undesirable image characteristics, can significantly boost image quality.
However, producing good negative prompts is manual and tedious. To address
this, we propose NegOpt, a novel method for optimizing negative prompt
generation toward enhanced image generation, using supervised fine-tuning and
reinforcement learning. Our combined approach results in a substantial increase
of 25% in Inception Score compared to other approaches and surpasses
ground-truth negative prompts from the test set. Furthermore, with NegOpt we
can preferentially optimize the metrics most important to us. Finally, we
construct Negative Prompts DB, a dataset of negative prompts. |
NegOpt, a novel method for optimizing negative prompts in text-to-image generation, aiming to improve image quality by guiding the model away from undesirable characteristics. |
Generating high-quality negative prompts manually is tedious and challenging. This method automates the process and leads to significant improvements in image aesthetics and fidelity. |
A two-step approach: 1) Fine-tuning a sequence-to-sequence model on a dataset of normal and corresponding negative prompts, 2) Employing reinforcement learning to further optimize the model based on a reward function that considers aesthetics, prompt alignment, and image fidelity. |
Achieves a 25% increase in Inception Score compared to other methods, indicating improved image quality and diversity.
Outperforms ground-truth negative prompts from the test set, demonstrating the model's ability to learn effective patterns.
Allows for preferential optimization of specific image qualities, such as aesthetics, by adjusting the weights in the reward function. |
The dataset used to train the model may contain inherent biases, potentially leading to biased image generation.
There is a risk of misuse, where the method could be exploited to generate harmful or misleading content. |
text-to-image generation, negative prompts, prompt optimization, reinforcement learning, image quality |
2403.07589
Report |
PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution |
Honghao Chen, Xiangxiang Chu, Yongjian Ren, Xin Zhao, Kaiqi Huang |
Recently, some large kernel convnets strike back with appealing performance
and efficiency. However, given the square complexity of convolution, scaling up
kernels can bring about an enormous amount of parameters and the proliferated
parameters can induce severe optimization problem. Due to these issues, current
CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e.,
51x5 + 5x51) and start to saturate as the kernel size continues growing. In
this paper, we delve into addressing these vital issues and explore whether we
can continue scaling up kernels for more performance gains. Inspired by human
vision, we propose a human-like peripheral convolution that efficiently reduces
over 90% parameter count of dense grid convolution through parameter sharing,
and manage to scale up kernel size to extremely large. Our peripheral
convolution behaves highly similar to human, reducing the complexity of
convolution from O(K^2) to O(logK) without backfiring performance. Built on
this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK
outperforms modern vision Transformers and ConvNet architectures like Swin,
ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet
classification, semantic segmentation on ADE20K and object detection on MS
COCO. For the first time, we successfully scale up the kernel size of CNNs to
an unprecedented 101x101 and demonstrate consistent improvements. |
Proposes Peripheral Convolution, a new convolution form inspired by human peripheral vision, to reduce parameter complexity in large kernel CNNs, enabling extremely large kernels (e.g., 101x101). |
Large kernel CNNs are effective but suffer from quadratic parameter complexity, limiting their scalability. Peripheral convolution addresses this by efficiently reducing parameters while maintaining performance. |
Peripheral convolution uses parameter sharing in the kernel's peripheral regions with exponentially increasing granularity, mimicking human vision. It also incorporates kernel-wise positional embedding to compensate for detail blurring caused by sharing. |
Dense grid convolution consistently outperforms stripe convolution across different kernel sizes.
PeLK, built on peripheral convolution, achieves state-of-the-art performance on ADE20K, MS COCO, and ImageNet, surpassing Swin Transformer and ConvNeXt.
Peripheral convolution enables scaling kernel size to 101x101 with consistent performance gains, demonstrating its effectiveness. |
Exploring even larger kernel sizes and input resolutions might be computationally expensive.
The optimal kernel size configuration might need adjustments based on specific tasks and datasets. |
convolutional neural networks, large kernel convolution, peripheral vision, parameter efficiency, effective receptive field |
2403.07547
Report |
SMURF: Continuous Dynamics for Motion-Deblurring Radiance Fields |
Jungho Lee, Dogyoon Lee, Minhyeok Lee, Donghyung Kim, Sangyoun Lee |
Neural radiance fields (NeRF) has attracted considerable attention for their
exceptional ability in synthesizing novel views with high fidelity. However,
the presence of motion blur, resulting from slight camera movements during
extended shutter exposures, poses a significant challenge, potentially
compromising the quality of the reconstructed 3D scenes. While recent studies
have addressed this issue, they do not consider the continuous dynamics of
camera movements during image acquisition, leading to inaccurate scene
reconstruction. Additionally, these methods are plagued by slow training and
rendering speed. To effectively handle these issues, we propose sequential
motion understanding radiance fields (SMURF), a novel approach that employs
neural ordinary differential equation (Neural-ODE) to model continuous camera
motion and leverages the explicit volumetric representation method for faster
training and robustness to motion-blurred input images. The core idea of the
SMURF is continuous motion blurring kernel (CMBK), a unique module designed to
model a continuous camera movements for processing blurry inputs. Our model,
rigorously evaluated against benchmark datasets, demonstrates state-of-the-art
performance both quantitatively and qualitatively. |
This paper introduces SMURF, a novel method leveraging continuous dynamics for reconstructing sharp 3D scenes from motion-blurred images using neural radiance fields. |
Existing methods for handling motion blur in NeRF either neglect the continuous nature of camera motion or suffer from slow training and rendering speeds. SMURF addresses both limitations. |
SMURF employs a continuous motion blur kernel (CMBK) based on Neural-ODEs to model camera motion as a continuous function. It utilizes a tensor factorization-based representation (TensoRF) for faster training and robustness to blur. Two regularization techniques ensure accurate ray warping. |
SMURF achieves state-of-the-art quantitative results on synthetic and real-world datasets, outperforming previous methods in PSNR, SSIM, and LPIPS.
The method significantly reduces training and rendering time compared to existing techniques.
Qualitative evaluation through novel view rendering demonstrates SMURF's ability to reconstruct detailed 3D scenes and accurately restore sharp features. |
The current backbone, while faster than some, could be further sped up using newer rasterization-based methods like 3D Gaussian Splatting.
Future work could explore extending the continuous dynamics approach to handle object motion blur in addition to camera motion blur. |
neural radiance fields, motion deblurring, continuous dynamics, neural odes, view synthesis |
2403.07508
Report |
MoAI: Mixture of All Intelligence for Large Language and Vision Models |
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro |
The rise of large language models (LLMs) and instruction tuning has led to
the current trend of instruction-tuned large language and vision models
(LLVMs). This trend involves either meticulously curating numerous instruction
tuning datasets tailored to specific objectives or enlarging LLVMs to manage
vast amounts of vision language (VL) data. However, current LLVMs have
disregarded the detailed and comprehensive real-world scene understanding
available from specialized computer vision (CV) models in visual perception
tasks such as segmentation, detection, scene graph generation (SGG), and
optical character recognition (OCR). Instead, the existing LLVMs rely mainly on
the large capacity and emergent capabilities of their LLM backbones. Therefore,
we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages
auxiliary visual information obtained from the outputs of external
segmentation, detection, SGG, and OCR models. MoAI operates through two newly
introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the
outputs of the external CV models, the MoAI-Compressor aligns and condenses
them to efficiently use relevant auxiliary visual information for VL tasks.
MoAI-Mixer then blends three types of intelligence (1) visual features, (2)
auxiliary features from the external CV models, and (3) language features by
utilizing the concept of Mixture of Experts. Through this integration, MoAI
significantly outperforms both open-source and closed-source LLVMs in numerous
zero-shot VL tasks, particularly those related to real-world scene
understanding such as object existence, positions, relations, and OCR without
enlarging the model size or curating extra visual instruction tuning datasets. |
Introduces MoAI, a new large language and vision model that leverages auxiliary visual information from external CV models and blends three types of intelligence: visual features, auxiliary features, and language features. |
Current LLVMs overlook the detailed real-world scene understanding offered by specialized CV models. MoAI aims to address this by incorporating these models to enhance visual perception capabilities in VL tasks. |
MoAI utilizes a MoAI-Compressor to process and condense verbalized outputs from external CV models (segmentation, detection, SGG, OCR). A MoAI-Mixer, inspired by MoE, then blends these auxiliary features with visual and language features from the backbone MLM. |
MoAI significantly outperforms open-source and closed-source LLVMs in zero-shot VL tasks, particularly those requiring real-world scene understanding.
It achieves this without increasing model size or curating additional visual instruction tuning datasets.
Ablation studies confirm the importance of each external CV model and the effectiveness of the MoAI-Compressor and MoAI-Mixer. |
MoAI is currently tailored for real-world scene understanding and could be extended to incorporate more CV models for broader capabilities.
Future work includes incorporating robust, unbiased, and explainable CV models for more precise and reliable outputs. |
large language and vision models, mixture of experts, computer vision, real-world scene understanding, visual perception |
2403.07500
Report |
Block-wise LoRA: Revisiting Fine-grained LoRA for Effective Personalization and Stylization in Text-to-Image Generation |
Likun Li, Haoqi Zeng, Changpeng Yang, Haozhe Jia, Di Xu |
The objective of personalization and stylization in text-to-image is to
instruct a pre-trained diffusion model to analyze new concepts introduced by
users and incorporate them into expected styles. Recently, parameter-efficient
fine-tuning (PEFT) approaches have been widely adopted to address this task and
have greatly propelled the development of this field. Despite their popularity,
existing efficient fine-tuning methods still struggle to achieve effective
personalization and stylization in T2I generation. To address this issue, we
propose block-wise Low-Rank Adaptation (LoRA) to perform fine-grained
fine-tuning for different blocks of SD, which can generate images faithful to
input prompts and target identity and also with desired style. Extensive
experiments demonstrate the effectiveness of the proposed method. |
This paper proposes block-wise Low-Rank Adaptation (LoRA) for Stable Diffusion, which selectively fine-tunes specific blocks of the model for improved personalization and stylization in text-to-image generation. |
Existing efficient fine-tuning methods, particularly LoRA, struggle to effectively combine personalization (e.g., a specific person's face) and stylization (e.g., a cartoon style) in generated images. |
The authors divide the Stable Diffusion U-Net into blocks (in-blocks, mid-block, out-blocks) and selectively apply LoRA fine-tuning to different blocks, exploring which block combinations yield the best results for combining character identity and artistic style. |
Block-wise LoRA outperforms standard LoRA and LoCon in generating images with consistent personalized identities and stylized appearances.
Fine-tuning the top input and output blocks of the U-Net with style LoRA, while using full-block LoRA for character identity, achieved the best balance between personalization and stylization.
The study provides insights into the roles of different U-Net blocks in the image generation process, showing that bottom blocks are less important for preserving target information. |
The work primarily focuses on LoRA and could explore applying block-wise fine-tuning to other PEFT methods.
The impact of different block combinations on generation quality needs further investigation to develop a more principled approach for block selection. |
text-to-image generation, stable diffusion, personalization, stylization, low-rank adaptation (lora) |
2403.07494
Report |
SemGauss-SLAM: Dense Semantic Gaussian Splatting SLAM |
Siting Zhu, Renjie Qin, Guangming Wang, Jiuming Liu, Hesheng Wang |
We propose SemGauss-SLAM, the first semantic SLAM system utilizing 3D
Gaussian representation, that enables accurate 3D semantic mapping, robust
camera tracking, and high-quality rendering in real-time. In this system, we
incorporate semantic feature embedding into 3D Gaussian representation, which
effectively encodes semantic information within the spatial layout of the
environment for precise semantic scene representation. Furthermore, we propose
feature-level loss for updating 3D Gaussian representation, enabling
higher-level guidance for 3D Gaussian optimization. In addition, to reduce
cumulative drift and improve reconstruction accuracy, we introduce
semantic-informed bundle adjustment leveraging semantic associations for joint
optimization of 3D Gaussian representation and camera poses, leading to more
robust tracking and consistent mapping. Our SemGauss-SLAM method demonstrates
superior performance over existing dense semantic SLAM methods in terms of
mapping and tracking accuracy on Replica and ScanNet datasets, while also
showing excellent capabilities in novel-view semantic synthesis and 3D semantic
mapping. |
This supplementary material provides further details and experimental results for SemGauss-SLAM, a dense semantic SLAM system using Gaussian Splatting. |
The work addresses the limitations of existing dense SLAM methods by introducing semantic information to improve accuracy and efficiency in 3D reconstruction and semantic mapping. |
SemGauss-SLAM leverages a 3D Gaussian scene representation and incorporates semantic information into the bundle adjustment process for joint optimization of camera poses and scene representation. |
The method achieves state-of-the-art performance on Replica and ScanNet datasets, demonstrating significant improvement in tracking accuracy and semantic segmentation compared to existing methods.
It maintains a fast runtime, outperforming other radiance field-based SLAM methods while providing semantic mapping capabilities.
The proposed approach achieves high-quality reconstruction, capturing fine details and exhibiting smoother surfaces compared to baselines. |
The authors acknowledge that the reliance on a limited number of semantic categories poses a constraint on the system's applicability to more diverse environments.
Future work will focus on incorporating object-level semantic understanding and exploring dynamic scene reconstruction. |
slam, semantic slam, gaussian splatting, 3d reconstruction, semantic mapping |
2403.07487
Report |
Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM |
Zeyu Zhang, Akide Liu, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang |
Human motion generation stands as a significant pursuit in generative
computer vision, while achieving long-sequence and efficient motion generation
remains challenging. Recent advancements in state space models (SSMs), notably
Mamba, have showcased considerable promise in long sequence modeling with an
efficient hardware-aware design, which appears to be a promising direction to
build motion generation model upon it. Nevertheless, adapting SSMs to motion
generation faces hurdles since the lack of a specialized design architecture to
model motion sequence. To address these challenges, we propose Motion Mamba, a
simple and efficient approach that presents the pioneering motion generation
model utilized SSMs. Specifically, we design a Hierarchical Temporal Mamba
(HTM) block to process temporal data by ensemble varying numbers of isolated
SSM modules across a symmetric U-Net architecture aimed at preserving motion
consistency between frames. We also design a Bidirectional Spatial Mamba (BSM)
block to bidirectionally process latent poses, to enhance accurate motion
generation within a temporal frame. Our proposed method achieves up to 50% FID
improvement and up to 4 times faster on the HumanML3D and KIT-ML datasets
compared to the previous best diffusion-based method, which demonstrates strong
capabilities of high-quality long sequence motion modeling and real-time human
motion generation. See project website
https://steve-zeyu-zhang.github.io/MotionMamba/ |
This paper presents Motion Mamba, a novel framework for efficient and long-sequence human motion generation using selective state space models (SSMs). |
Existing motion generation models, particularly diffusion-based ones, struggle with long-range sequence generation and suffer from slow inference speeds. Motion Mamba addresses these limitations. |
The model utilizes a U-Net architecture with novel Hierarchical Temporal Mamba (HTM) blocks for temporal modeling and Bidirectional Spatial Mamba (BSM) blocks for enhanced spatial representation learning. It leverages the efficiency of SSMs for long-sequence modeling and fast inference. |
Motion Mamba achieves up to 50% improvement in FID scores compared to previous state-of-the-art methods.
It demonstrates significantly faster inference speeds, being up to 4 times faster than prior approaches.
The effectiveness of the proposed framework is validated through comprehensive experiments and user studies on benchmark datasets like HumanML3D and KIT-ML. |
The model's performance could be further investigated under more complex and diverse motion generation scenarios.
Exploring the integration of additional modalities, such as audio or visual cues, could enhance the model's generative capabilities. |
human motion generation, selective state space models, latent diffusion models, long-sequence modeling, efficient inference |
2403.07392
Report |
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions |
Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, Yifeng Shi |
Although Vision Transformer (ViT) has achieved significant success in
computer vision, it does not perform well in dense prediction tasks due to the
lack of inner-patch information interaction and the limited diversity of
feature scale. Most existing studies are devoted to designing vision-specific
transformers to solve the above problems, which introduce additional
pre-training costs. Therefore, we present a plain, pre-training-free, and
feature-enhanced ViT backbone with Convolutional Multi-scale feature
interaction, named ViT-CoMer, which facilitates bidirectional interaction
between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has
the following advantages: (1) We inject spatial pyramid multi-receptive field
convolutional features into the ViT architecture, which effectively alleviates
the problems of limited local information interaction and single-feature
representation in ViT. (2) We propose a simple and efficient CNN-Transformer
bidirectional fusion interaction module that performs multi-scale fusion across
hierarchical features, which is beneficial for handling dense prediction tasks.
(3) We evaluate the performance of ViT-CoMer across various dense prediction
tasks, different frameworks, and multiple advanced pre-training. Notably, our
ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and
62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art
methods. We hope ViT-CoMer can serve as a new backbone for dense prediction
tasks to facilitate future research. The code will be released at
https://github.com/Traffic-X/ViT-CoMer. |
This paper presents ViT-CoMer, a plain, pre-training-free, feature-enhanced ViT backbone for dense prediction tasks by facilitating bidirectional interaction between CNN and transformer. |
ViT doesn't perform well on dense prediction tasks due to the lack of inner-patch information interaction and limited feature scale diversity. Existing solutions introduce extra pre-training costs. |
ViT-CoMer integrates a multi-scale convolutional feature interaction module, including MRFP to provide multi-scale spatial information and CTI for bidirectional multi-scale feature fusion between CNN and Transformer. |
ViT-CoMer outperforms existing ViT-based methods and achieves comparable results to state-of-the-art methods on object detection, instance segmentation, and semantic segmentation.
It effectively leverages various open-source pre-trained ViT weights for improved performance.
ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data and 62.1% mIoU on ADE20K val, comparable to SOTA methods. |
The improvement from integrating the approach with hierarchical vision transformers like Swin is less significant compared to plain ViT.
Future work could explore more efficient interaction mechanisms between CNN and Transformer. |
vision transformer, dense prediction, object detection, instance segmentation, semantic segmentation |
2403.07371
Report |
Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models |
Phuong Dam, Jihoon Jeong, Anh Tran, Daeyoung Kim |
This study discusses the critical issues of Virtual Try-On in contemporary
e-commerce and the prospective metaverse, emphasizing the challenges of
preserving intricate texture details and distinctive features of the target
person and the clothes in various scenarios, such as clothing texture and
identity characteristics like tattoos or accessories. In addition to the
fidelity of the synthesized images, the efficiency of the synthesis process
presents a significant hurdle. Various existing approaches are explored,
highlighting the limitations and unresolved aspects, e.g., identity information
omission, uncontrollable artifacts, and low synthesis speed. It then proposes a
novel diffusion-based solution that addresses garment texture preservation and
user identity retention during virtual try-on. The proposed network comprises
two primary modules - a warping module aligning clothing with individual
features and a try-on module refining the attire and generating missing parts
integrated with a mask-aware post-processing technique ensuring the integrity
of the individual's identity. It demonstrates impressive results, surpassing
the state-of-the-art in speed by nearly 20 times during inference, with
superior fidelity in qualitative assessments. Quantitative evaluations confirm
comparable performance with the recent SOTA method on the VITON-HD and
Dresscode datasets. |
This paper introduces a novel diffusion-based virtual try-on method that excels in preserving both garment texture and user identity while being significantly faster than previous state-of-the-art methods. |
Virtual Try-On is crucial for e-commerce and the metaverse, but existing solutions struggle to balance garment detail, user identity preservation, and synthesis speed. |
The proposed method utilizes a two-module architecture with a warping module for aligning garments and a try-on module for refinement and missing part generation. A mask-aware post-processing technique ensures identity preservation and artifact reduction. |
The method achieves state-of-the-art results in qualitative evaluations, demonstrating superior detail and identity preservation compared to previous methods.
Quantitative results show comparable or better performance than existing methods on standard benchmarks (VITON-HD and DressCode).
The proposed method is significantly faster than the current state-of-the-art, achieving an inference speed over 17 times faster. |
The method relies on a relatively complex post-processing step, which could be streamlined in future work.
Future research could focus on generalizing the approach to a wider range of clothing styles and body types. |
virtual try-on, diffusion models, identity preservation, time efficiency, mask-aware post-processing |
2403.07304
Report |
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models |
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang |
Large Multimodal Model (LMM) is a hot research topic in the computer vision
area and has also demonstrated remarkable potential across multiple
disciplinary fields. A recent trend is to further extend and enhance the
perception capabilities of LMMs. The current methods follow the paradigm of
adapting the visual task outputs to the format of the language model, which is
the main component of a LMM. This adaptation leads to convenient development of
such LMMs with minimal modifications, however, it overlooks the intrinsic
characteristics of diverse visual tasks and hinders the learning of perception
capabilities. To address this issue, we propose a novel LMM architecture named
Lumen, a Large multimodal model with versatile vision-centric capability
enhancement. We decouple the LMM's learning of perception capabilities into
task-agnostic and task-specific stages. Lumen first promotes fine-grained
vision-language concept alignment, which is the fundamental capability for
various visual tasks. Thus the output of the task-agnostic stage is a shared
representation for all the tasks we address in this paper. Then the
task-specific decoding is carried out by flexibly routing the shared
representation to lightweight task decoders with negligible training efforts.
Benefiting from such a decoupled design, our Lumen surpasses existing LMM-based
approaches on the COCO detection benchmark with a clear margin and exhibits
seamless scalability to additional visual tasks. Furthermore, we also conduct
comprehensive ablation studies and generalization evaluations for deeper
insights. The code will be released at https://github.com/SxJyJay/Lumen. |
This paper presents Lumen, a Large multimodal model that enhances the vision-centric capabilities of LMMs by decoupling task-agnostic and task-specific learning. |
Existing LMMs are limited in their ability to perform diverse vision-centric tasks due to their reliance on language-oriented output formats and lack of focus on intrinsic visual task characteristics. |
Lumen first performs task-agnostic vision-language dense alignment by matching instructions with image regions, generating a heatmap. Then, lightweight, task-specific decoders use this heatmap to generate final outputs for tasks like object detection, segmentation, and pose estimation. |
Lumen significantly outperforms existing LMM-based methods on object detection and achieves comparable results to specialist models on other tasks.
It demonstrates strong generalization ability, performing well on unseen datasets and tasks like object counting.
Ablation studies validate the importance of the multi-task training, dense alignment architecture, and input size choices. |
The convergence speed may be limited by the optimization difficulty of using a single special token for querying image regions.
Future work could explore vision encoders that can handle high-resolution inputs while maintaining semantic coherence with language modalities. |
large multimodal models, vision-centric capabilities, object detection, instance segmentation, pose estimation |
2403.07234
Report |
It's All About Your Sketch: Democratising Sketch Control in Diffusion Models |
Subhadeep Koley, Ayan Kumar Bhunia, Deeptanshu Sekhri, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song |
This paper unravels the potential of sketches for diffusion models,
addressing the deceptive promise of direct sketch control in generative AI. We
importantly democratise the process, enabling amateur sketches to generate
precise images, living up to the commitment of "what you sketch is what you
get". A pilot study underscores the necessity, revealing that deformities in
existing models stem from spatial-conditioning. To rectify this, we propose an
abstraction-aware framework, utilising a sketch adapter, adaptive time-step
sampling, and discriminative guidance from a pre-trained fine-grained
sketch-based image retrieval model, working synergistically to reinforce
fine-grained sketch-photo association. Our approach operates seamlessly during
inference without the need for textual prompts; a simple, rough sketch akin to
what you and I can create suffices! We welcome everyone to examine results
presented in the paper and its supplementary. Contributions include
democratising sketch control, introducing an abstraction-aware framework, and
leveraging discriminative guidance, validated through extensive experiments. |
This paper introduces an abstraction-aware framework for sketch-conditioned image generation using diffusion models. It enables accurate image generation from amateur sketches, moving beyond the limitations of existing methods that rely on precise edgemaps or textual prompts. |
Existing sketch-to-image diffusion models often produce deformed outputs from freehand sketches due to their reliance on spatial conditioning. They also heavily depend on textual prompts, which can be limiting and lead to trade-offs between text coherence and sketch fidelity. |
The proposed framework utilizes a sketch adapter to convert input sketches into equivalent textual embeddings, guiding the denoising process through cross-attention. An adaptive time-step sampling strategy caters to different sketch abstraction levels, and a discriminative guidance mechanism leverages a pre-trained fine-grained sketch-based image retrieval model to enhance sketch-photo association. |
The method successfully generates photorealistic images from amateur sketches without relying on textual prompts during inference.
It outperforms existing sketch-to-image generation methods in terms of FID-C, FGM, and MOS, demonstrating superior generation quality and sketch fidelity.
The framework shows strong generalization ability, successfully handling sketches from unseen datasets, diverse stroke styles, and partially complete sketches. |
The model may struggle with categorical ambiguity when similar-looking objects have abstract or deformed sketches.
Future work could explore incorporating class labels or additional conditioning signals to mitigate this limitation. |
sketch-to-image generation, diffusion models, abstraction-aware, discriminative guidance, generative ai |
2403.07071
Report |
LISO: Lidar-only Self-Supervised 3D Object Detection |
Stefan Baur, Frank Moosmann, Andreas Geiger |
3D object detection is one of the most important components in any
Self-Driving stack, but current state-of-the-art (SOTA) lidar object detectors
require costly & slow manual annotation of 3D bounding boxes to perform well.
Recently, several methods emerged to generate pseudo ground truth without human
supervision, however, all of these methods have various drawbacks: Some methods
require sensor rigs with full camera coverage and accurate calibration, partly
supplemented by an auxiliary optical flow engine. Others require expensive
high-precision localization to find objects that disappeared over multiple
drives. We introduce a novel self-supervised method to train SOTA lidar object
detection networks which works on unlabeled sequences of lidar point clouds
only, which we call trajectory-regularized self-training. It utilizes a SOTA
self-supervised lidar scene flow network under the hood to generate, track, and
iteratively refine pseudo ground truth. We demonstrate the effectiveness of our
approach for multiple SOTA object detection networks across multiple real-world
datasets. Code will be released. |
This paper introduces LISO, a novel self-supervised learning method for 3D object detection using only LiDAR point cloud sequences. |
Current state-of-the-art LiDAR object detectors heavily rely on expensive and time-consuming manual annotations of 3D bounding boxes. LISO aims to overcome this limitation by providing a self-supervised training approach. |
LISO leverages a self-supervised LiDAR scene flow network to generate initial pseudo ground truth (pgt) of moving objects. This pgt is iteratively refined through a trajectory-regularized self-training process which trains a single-frame object detector. |
LISO outperforms existing self-supervised methods on four different autonomous driving datasets (Waymo Open Dataset, KITTI, Argoverse 2, and Nuscenes).
The method demonstrates its ability to generalize from detecting moving objects to detecting movable objects.
Ablation studies confirm the importance of motion cues from scene flow and trajectory-regularized self-training for achieving good performance. |
LISO currently lacks the ability to distinguish between different object classes.
Future work could focus on generating class labels for the detected objects, potentially by incorporating motion or size characteristics. |
self-supervised learning, lidar, object detection, 3d object detection, autonomous driving |
2403.06977
Report |
VideoMamba: State Space Model for Efficient Video Understanding |
Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao |
Addressing the dual challenges of local redundancy and global dependencies in
video understanding, this work innovatively adapts the Mamba to the video
domain. The proposed VideoMamba overcomes the limitations of existing 3D
convolution neural networks and video transformers. Its linear-complexity
operator enables efficient long-term modeling, which is crucial for
high-resolution long video understanding. Extensive evaluations reveal
VideoMamba's four core abilities: (1) Scalability in the visual domain without
extensive dataset pretraining, thanks to a novel self-distillation technique;
(2) Sensitivity for recognizing short-term actions even with fine-grained
motion differences; (3) Superiority in long-term video understanding,
showcasing significant advancements over traditional feature-based models; and
(4) Compatibility with other modalities, demonstrating robustness in
multi-modal contexts. Through these distinct advantages, VideoMamba sets a new
benchmark for video understanding, offering a scalable and efficient solution
for comprehensive video understanding. All the code and models are available at
https://github.com/OpenGVLab/VideoMamba. |
This paper proposes VideoMamba, a purely State Space Model (SSM)-based video understanding model inspired by Mamba for NLP, offering linear complexity for efficient long-term video modeling. |
Existing methods like 3D CNNs and video transformers struggle to address both local redundancy and global dependencies in video understanding, particularly for long, high-resolution videos. VideoMamba offers a more efficient and scalable solution. |
VideoMamba adapts the bidirectional Mamba block to process 3D video sequences, introducing a novel self-distillation technique to enhance scalability and exploring various spatiotemporal scan methods. |
VideoMamba achieves state-of-the-art results on ImageNet-1K with 84.0% top-1 accuracy, outperforming isotropic architectures by significant margins.
It outperforms attention-based methods on Kinetics-400 and Something-Something V2, demonstrating effectiveness in both scene-related and temporal-related action recognition.
VideoMamba shows significant superiority over feature-based methods on long-term video understanding benchmarks (Breakfast, COIN, LVU), achieving state-of-the-art performance with end-to-end training. |
Scalability of VideoMamba has not been fully explored, such as extending to larger model sizes and integrating with other modalities or large language models.
Further validation is needed for hour-level video understanding tasks. |
video understanding, state space model, mamba, long-term video modeling, self-distillation |
2403.06976
Report |
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion |
Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu |
Image inpainting, the process of restoring corrupted images, has seen
significant advancements with the advent of diffusion models (DMs). Despite
these advancements, current DM adaptations for inpainting, which involve
modifications to the sampling strategy or the development of
inpainting-specific DMs, frequently suffer from semantic inconsistencies and
reduced image quality. Addressing these challenges, our work introduces a novel
paradigm: the division of masked image features and noisy latent into separate
branches. This division dramatically diminishes the model's learning load,
facilitating a nuanced incorporation of essential masked image information in a
hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play
dual-branch model engineered to embed pixel-level masked image features into
any pre-trained DM, guaranteeing coherent and enhanced image inpainting
outcomes. Additionally, we introduce BrushData and BrushBench to facilitate
segmentation-based inpainting training and performance assessment. Our
extensive experimental analysis demonstrates BrushNet's superior performance
over existing models across seven key metrics, including image quality, mask
region preservation, and textual coherence. |
This paper introduces BrushNet, a plug-and-play image inpainting model that leverages a dual-branch diffusion approach to enhance semantic consistency and image quality. |
Existing diffusion-based image inpainting methods often struggle with semantic mismatches and reduced image quality due to limitations in mask processing and information integration. |
BrushNet employs a dual-branch architecture: one branch processes noisy latent features, while the other extracts masked image features using a VAE encoder and a frozen pre-trained diffusion model without text cross-attention. These features are then hierarchically integrated into the main diffusion model for coherent inpainting. A blurred blending strategy is also introduced to improve the preservation of unmasked regions. |
BrushNet outperforms previous state-of-the-art methods on both random and segmentation-based inpainting tasks, as demonstrated by quantitative evaluations using Image Reward, HPS v2, Aesthetic Score, PSNR, LPIPS, MSE, and CLIP Similarity metrics.
The dual-branch design allows for flexible control over the inpainting process, including the choice of base diffusion model and the level of unmasked region preservation.
BrushNet demonstrates strong generalization across various image domains, including natural images, paintings, anime, and illustrations. |
The quality of inpainted images is dependent on the base diffusion model used.
Unusually shaped masks or misaligned text prompts can still pose challenges for the model. |
image inpainting, diffusion models, image generation, plug-and-play, dual-branch diffusion |
2403.06973
Report |
Bayesian Diffusion Models for 3D Shape Reconstruction |
Haiyang Xu, Yu Lei, Zeyuan Chen, Xiang Zhang, Yue Zhao, Yilin Wang, Zhuowen Tu |
We present Bayesian Diffusion Models (BDM), a prediction algorithm that
performs effective Bayesian inference by tightly coupling the top-down (prior)
information with the bottom-up (data-driven) procedure via joint diffusion
processes. We show the effectiveness of BDM on the 3D shape reconstruction
task. Compared to prototypical deep learning data-driven approaches trained on
paired (supervised) data-labels (e.g. image-point clouds) datasets, our BDM
brings in rich prior information from standalone labels (e.g. point clouds) to
improve the bottom-up 3D reconstruction. As opposed to the standard Bayesian
frameworks where explicit prior and likelihood are required for the inference,
BDM performs seamless information fusion via coupled diffusion processes with
learned gradient computation networks. The specialty of our BDM lies in its
capability to engage the active and effective information exchange and fusion
of the top-down and bottom-up processes where each itself is a diffusion
process. We demonstrate state-of-the-art results on both synthetic and
real-world benchmarks for 3D shape reconstruction. |
Presents Bayesian Diffusion Models (BDM), a novel statistical inference algorithm that couples diffusion-based bottom-up (data-driven) and top-down (prior) processes for improved 3D shape reconstruction. |
Addresses the limitations of traditional Bayesian inference methods in leveraging large-scale datasets and complex deep learning models, particularly in scenarios with limited paired data-labels. |
Introduces two fusion strategies: BDM-M (Merging), a learnable paradigm that implicitly merges knowledge from prior and reconstruction models, and BDM-B (Blending), a training-free method that explicitly combines point clouds from both processes. |
Demonstrates state-of-the-art results on synthetic (ShapeNet-R2N2) and real-world (Pix3D) 3D shape reconstruction benchmarks.
Shows significant improvement over baseline methods, particularly when training data for reconstruction is scarce.
Ablation studies confirm the effectiveness of prior integration timing, duration, and ratio in enhancing reconstruction quality. |
BDM currently requires both prior and data-driven processes to be diffusion-based.
The explicit point cloud representation used in BDM-B may limit its applicability to implicit representations. |
bayesian inference, diffusion models, 3d shape reconstruction, prior integration, deep learning |
2403.06952
Report |
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data |
Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal |
Recent text-to-image (T2I) generation models have demonstrated impressive
capabilities in creating images from text descriptions. However, these T2I
generation models often fall short of generating images that precisely match
the details of the text inputs, such as incorrect spatial relationship or
missing objects. In this paper, we introduce SELMA: Skill-Specific Expert
Learning and Merging with Auto-Generated Data, a novel paradigm to improve the
faithfulness of T2I models by fine-tuning models on automatically generated,
multi-skill image-text datasets, with skill-specific expert learning and
merging. First, SELMA leverages an LLM's in-context learning capability to
generate multiple datasets of text prompts that can teach different skills, and
then generates the images with a T2I model based on the prompts. Next, SELMA
adapts the T2I model to the new skills by learning multiple single-skill LoRA
(low-rank adaptation) experts followed by expert merging. Our independent
expert fine-tuning specializes multiple models for different skills, and expert
merging helps build a joint multi-skill T2I model that can generate faithful
images given diverse text prompts, while mitigating the knowledge conflict from
different datasets. We empirically demonstrate that SELMA significantly
improves the semantic alignment and text faithfulness of state-of-the-art T2I
diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human
preference metrics (PickScore, ImageReward, and HPS), as well as human
evaluation. Moreover, fine-tuning with image-text pairs auto-collected via
SELMA shows comparable performance to fine-tuning with ground truth data.
Lastly, we show that fine-tuning with images from a weaker T2I model can help
improve the generation quality of a stronger T2I model, suggesting promising
weak-to-strong generalization in T2I models. |
This paper introduces SELMA, a novel paradigm that leverages automatically generated, multi-skill image-text datasets to improve the faithfulness of text-to-image (T2I) generation models. |
Existing T2I models often struggle to generate images that precisely match the details of text inputs. SELMA addresses this by fine-tuning models with skill-specific expert learning and merging, enabling more accurate image generation. |
SELMA uses a four-stage pipeline: (1) Skill-specific prompt generation using an LLM, (2) Image generation from these prompts using a T2I model, (3) Fine-tuning the T2I model with skill-specific LoRA experts on these image-text pairs, and (4) Merging the LoRA experts to obtain a multi-skill T2I model. |
SELMA significantly improves the faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks and human preference metrics.
Fine-tuning with SELMA's auto-collected image-text pairs shows comparable performance to fine-tuning with ground truth data.
Fine-tuning with images from a weaker T2I model can enhance a stronger T2I model's generation quality, indicating weak-to-strong generalization potential. |
SELMA relies on a strong image generator and an instruction-following LLM.
While SELMA enhances text-image alignment, it doesn't guarantee that the resulting model will follow every detail of the text prompts. |
text-to-image generation, faithfulness, lora, expert merging, synthetic data |
2403.06951
Report |
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations |
Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang Chen, Qian He, Yongdong Zhang |
The diffusion-based text-to-image model harbors immense potential in
transferring reference style. However, current encoder-based approaches
significantly impair the text controllability of text-to-image models while
transferring styles. In this paper, we introduce DEADiff to address this issue
using the following two strategies: 1) a mechanism to decouple the style and
semantics of reference images. The decoupled feature representations are first
extracted by Q-Formers which are instructed by different text descriptions.
Then they are injected into mutually exclusive subsets of cross-attention
layers for better disentanglement. 2) A non-reconstructive learning method. The
Q-Formers are trained using paired images rather than the identical target, in
which the reference image and the ground-truth image are with the same style or
semantics. We show that DEADiff attains the best visual stylization results and
optimal balance between the text controllability inherent in the text-to-image
model and style similarity to the reference image, as demonstrated both
quantitatively and qualitatively. Our project page is
https://tianhao-qi.github.io/DEADiff/. |
DEADiff is introduced, an encoder-based diffusion model for stylized image generation that maintains text controllability through style and semantic decoupling. |
Existing encoder-based methods for style transfer in diffusion models often compromise the model's ability to accurately follow text prompts due to semantic interference from the style image. |
DEADiff uses two Q-Formers with a non-reconstructive learning paradigm to extract disentangled style and content representations, injecting them into separate cross-attention layers of the diffusion U-Net. |
DEADiff successfully generates stylized images while remaining faithful to text prompts, surpassing previous methods in balancing style accuracy and text controllability.
Quantitative and qualitative comparisons, including a user study, demonstrate DEADiff's superior performance in generating high-quality stylized images that adhere to text prompts.
Ablation studies confirm the contribution of each component in DEADiff, highlighting the importance of style and semantic decoupling for effective stylized image generation with text control. |
Future work could focus on further improving style similarity to match the reference image more closely.
Exploring the decoupling of more granular, instance-level semantic information is another promising direction. |
stylized image generation, text-to-image synthesis, diffusion models, style and semantic decoupling, text controllability |
2403.06912
Report |
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization |
Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu |
Radiance fields have demonstrated impressive performance in synthesizing
novel views from sparse input views, yet prevailing methods suffer from high
training costs and slow inference speed. This paper introduces DNGaussian, a
depth-regularized framework based on 3D Gaussian radiance fields, offering
real-time and high-quality few-shot novel view synthesis at low costs. Our
motivation stems from the highly efficient representation and surprising
quality of the recent 3D Gaussian Splatting, despite it will encounter a
geometry degradation when input views decrease. In the Gaussian radiance
fields, we find this degradation in scene geometry primarily lined to the
positioning of Gaussian primitives and can be mitigated by depth constraint.
Consequently, we propose a Hard and Soft Depth Regularization to restore
accurate scene geometry under coarse monocular depth supervision while
maintaining a fine-grained color appearance. To further refine detailed
geometry reshaping, we introduce Global-Local Depth Normalization, enhancing
the focus on small local depth changes. Extensive experiments on LLFF, DTU, and
Blender datasets demonstrate that DNGaussian outperforms state-of-the-art
methods, achieving comparable or better results with significantly reduced
memory cost, a $25 \times$ reduction in training time, and over $3000 \times$
faster rendering speed. |
This paper introduces DNGaussian, a novel view synthesis method using depth-regularized 3D Gaussian radiance fields for real-time, high-quality results with low training costs. |
Existing radiance field methods for novel view synthesis are computationally expensive and slow, while recent 3D Gaussian Splatting, though efficient, suffers geometry degradation with sparse input views. |
DNGaussian leverages monocular depth estimates to regularize the 3D Gaussian field using: (1) Hard and Soft Depth Regularization to refine Gaussian positions and opacities and (2) Global-Local Depth Normalization to prioritize small, local depth variations. |
DNGaussian achieves comparable or better novel view synthesis quality than state-of-the-art methods on LLFF, DTU, and Blender datasets.
It significantly reduces memory cost and training time (25x faster) compared to existing techniques.
DNGaussian achieves real-time rendering speeds exceeding 300 FPS. |
Performance degrades with increasing input views due to monocular depth errors.
Challenges remain in representing solid color planes and specular regions. |
novel view synthesis, 3d gaussian radiance fields, depth regularization, few-shot learning, real-time rendering |
2403.06908
Report |
FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization |
Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, Eric Xing |
3D Gaussian splatting has achieved very impressive performance in real-time
novel view synthesis. However, it often suffers from over-reconstruction during
Gaussian densification where high-variance image regions are covered by a few
large Gaussians only, leading to blur and artifacts in the rendered images. We
design a progressive frequency regularization (FreGS) technique to tackle the
over-reconstruction issue within the frequency space. Specifically, FreGS
performs coarse-to-fine Gaussian densification by exploiting low-to-high
frequency components that can be easily extracted with low-pass and high-pass
filters in the Fourier space. By minimizing the discrepancy between the
frequency spectrum of the rendered image and the corresponding ground truth, it
achieves high-quality Gaussian densification and alleviates the
over-reconstruction of Gaussian splatting effectively. Experiments over
multiple widely adopted benchmarks (e.g., Mip-NeRF360, Tanks-and-Temples and
Deep Blending) show that FreGS achieves superior novel view synthesis and
outperforms the state-of-the-art consistently. |
Presents FreGS, an innovative 3D Gaussian splatting technique that uses progressive frequency regularization to mitigate over-reconstruction during Gaussian densification, enhancing novel view synthesis. |
3D Gaussian splatting, while offering real-time rendering for novel view synthesis, often suffers from over-reconstruction artifacts. This paper addresses this limitation for higher quality rendering. |
FreGS employs progressive frequency regularization using a frequency annealing technique. It extracts low and high-frequency components with filters in the Fourier space and minimizes discrepancies between the rendered and ground truth image spectra. This process progressively refines Gaussian densification. |
FreGS consistently outperforms state-of-the-art methods like 3D-GS and Mip-NeRF360 in quantitative metrics like PSNR, SSIM, and LPIPS.
The method generates higher quality novel view synthesis with fewer artifacts and finer details compared to existing techniques.
Ablation studies confirm the individual contribution of frequency regularization and frequency annealing to the overall performance gain. |
The current implementation of FreGS is focused on static scenes; handling dynamic scenes remains a challenge.
Further investigation is needed to optimize the computational cost of frequency transformations for even faster rendering. |
novel view synthesis, 3d gaussian splatting, frequency regularization, frequency annealing, gaussian densification |
2403.06866
Report |
QUASAR: QUality and Aesthetics Scoring with Advanced Representations |
Sergey Kastryulin, Denis Prokopenko, Artem Babenko, Dmitry V. Dylov |
This paper introduces a new data-driven, non-parametric method for image
quality and aesthetics assessment, surpassing existing approaches and requiring
no prompt engineering or fine-tuning. We eliminate the need for expressive
textual embeddings by proposing efficient image anchors in the data. Through
extensive evaluations of 7 state-of-the-art self-supervised models, our method
demonstrates superior performance and robustness across various datasets and
benchmarks. Notably, it achieves high agreement with human assessments even
with limited data and shows high robustness to the nature of data and their
pre-processing pipeline. Our contributions offer a streamlined solution for
assessment of images while providing insights into the perception of visual
information. |
Introduces QUASAR, a data-driven, non-parametric method for unified image quality and aesthetics assessment using image anchors and pre-trained self-supervised models, eliminating the need for prompt engineering or fine-tuning. |
Addresses the limitations of existing IQA and IAA methods, especially prompt-based approaches, by providing a more robust and generalizable solution that leverages the power of foundation models. |
1. Employs image embeddings as anchors representing high and low quality/aesthetics. 2. Uses a pre-trained Image Encoder (explores various self-supervised models) to extract embeddings. 3. Applies an Aggregation Function to compute representative centroids from anchor embeddings. 4. Calculates a score based on cosine similarity between input image embedding and the centroids. |
QUASAR outperforms existing non-parametric IQA methods and achieves comparable performance to learning-based IAA methods.
Demonstrates robustness to the choice of anchor data and pre-processing pipeline, unlike CLIP-IQA.
Achieves high agreement with human assessments even with a limited number of anchor samples. |
Computational cost associated with anchor embedding generation for large datasets.
Potential bias introduced by the choice of anchor data, necessitating careful selection and potential for future work in adaptive anchor selection. |
image quality assessment, image aesthetics assessment, foundation models, self-supervised learning, non-parametric methods |
2403.06793
Report |
Boosting Image Restoration via Priors from Pre-trained Models |
Xiaogang Xu, Shu Kong, Tao Hu, Zhe Liu, Hujun Bao |
Pre-trained models with large-scale training data, such as CLIP and Stable
Diffusion, have demonstrated remarkable performance in various high-level
computer vision tasks such as image understanding and generation from language
descriptions. Yet, their potential for low-level tasks such as image
restoration remains relatively unexplored. In this paper, we explore such
models to enhance image restoration. As off-the-shelf features (OSF) from
pre-trained models do not directly serve image restoration, we propose to learn
an additional lightweight module called Pre-Train-Guided Refinement Module
(PTG-RM) to refine restoration results of a target restoration network with
OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying
Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention
(PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations,
while PTG-CSA enhances spatial-channel attention for restoration-related
learning. Extensive experiments demonstrate that PTG-RM, with its compact size
($<$1M parameters), effectively enhances restoration performance of various
models across different tasks, including low-light enhancement, deraining,
deblurring, and denoising. |
This paper proposes a novel Pre-Train-Guided Refinement Module (PTG-RM) that leverages off-the-shelf features (OSF) from pre-trained models like CLIP and Stable Diffusion to enhance image restoration networks. |
Existing image restoration networks struggle to achieve significant performance improvements by simply modifying network structures or increasing model parameters. This work explores a new approach of leveraging rich information contained within pre-trained models to enhance restoration quality. |
PTG-RM is a lightweight plugin module that refines the output of a target restoration network using OSF. It consists of two components: PTG-SVE (Spatial Varying Enhancement) which determines optimal short- and long-range operations based on OSF, and PTG-CSA (Channel-Spatial Attention) which enhances spatial-channel attention using OSF guidance. |
PTG-RM significantly improves the performance of various state-of-the-art restoration networks across different tasks, including low-light enhancement, deraining, deblurring, and denoising.
The method demonstrates robust generalization ability, enhancing performance even when the refinement module is trained on a different dataset than the target restoration network.
User studies confirm that PTG-RM leads to subjectively better restoration results compared to baseline methods. |
The extent of improvement provided by PTG-RM varies across different experiments and seems to depend on the target network's capacity and task complexity.
Future work aims to explore more effective distillation frameworks for extracting refined restoration feature priors from pre-trained models to further improve performance. |
image restoration, pre-trained models, clip, stable diffusion, refinement module |
2403.06775
Report |
FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation |
Pengchong Qiao, Lei Shang, Chang Liu, Baigui Sun, Xiangyang Ji, Jie Chen |
Subject-driven generation has garnered significant interest recently due to
its ability to personalize text-to-image generation. Typical works focus on
learning the new subject's private attributes. However, an important fact has
not been taken seriously that a subject is not an isolated new concept but
should be a specialization of a certain category in the pre-trained model. This
results in the subject failing to comprehensively inherit the attributes in its
category, causing poor attribute-related generations. In this paper, motivated
by object-oriented programming, we model the subject as a derived class whose
base class is its semantic category. This modeling enables the subject to
inherit public attributes from its category while learning its private
attributes from the user-provided example. Specifically, we propose a
plug-and-play method, Subject-Derived regularization (SuDe). It constructs the
base-derived class modeling by constraining the subject-driven generated images
to semantically belong to the subject's category. Extensive experiments under
three baselines and two backbones on various subjects show that our SuDe
enables imaginative attribute-related generations while maintaining subject
fidelity. Codes will be open sourced soon at FaceChain
(https://github.com/modelscope/facechain). |
This paper presents a novel perspective for subject-driven generation by modeling a subject as a derived class of its semantic category, allowing it to inherit public attributes while learning private attributes from user-provided examples. |
One-shot subject-driven generation struggles to create imaginative images, especially for attribute-related prompts, due to the limited information available in a single example image. This paper addresses this challenge by leveraging the pre-trained model's knowledge of the subject's category. |
The paper proposes Subject Derivation regularization (SuDe), a plug-and-play method that constrains subject-driven generated images to semantically belong to the subject's category using the implicit classifier within the diffusion model. |
SuDe significantly improves attribute-related generations, enabling the generation of images that better align with attribute-related prompts.
The method maintains subject fidelity, ensuring that the generated images still resemble the user-provided subject example.
SuDe is effective when combined with different baselines and backbones, demonstrating its versatility and generalizability. |
The method inherits limitations of the pre-trained diffusion model, such as struggling with text characters on subjects.
SuDe's performance may be limited for prompts that describe attributes indirectly related to the subject or its category. |
subject-driven generation, text-to-image synthesis, diffusion models, one-shot learning, attribute editing |
2403.06764
Report |
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models |
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang |
In this study, we identify the inefficient attention phenomena in Large
Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5,
QwenVL-Chat and Video-LLaVA. We find out that the attention computation over
visual tokens is of extreme inefficiency in the deep layers of popular LVLMs,
suggesting a need for a sparser approach compared to textual data handling. To
this end, we introduce FastV, a versatile plug-and-play method designed to
optimize computational efficiency by learning adaptive attention patterns in
early layers and pruning visual tokens in subsequent ones. Our evaluations
demonstrate FastV's ability to dramatically reduce computational costs (e.g., a
45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a
wide range of image and video understanding tasks. The computational efficiency
and performance trade-off of FastV are highly customizable and
pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve
a lower budget than that of a 7B-parameter model, while still maintaining
superior performance. We believe FastV has practical values for deployment of
LVLMs in edge devices and commercial models. Code is released at
https://github.com/pkunlp-icler/FastV. |
This paper identifies inefficient visual attention in Large Vision-Language Models (LVLMs) and proposes FastV, a plug-and-play method to reduce inference budget without sacrificing performance. |
LVLMs are computationally expensive, and understanding how they process visual information is crucial for optimizing their efficiency. |
The paper analyzes attention patterns in LVLMs and finds that image tokens receive disproportionately low attention in deep layers. FastV leverages this by dynamically pruning less important image tokens based on attention scores. |
FastV significantly reduces computational cost (e.g., 45% reduction in FLOPs for LLaVA-1.5-13B) without performance loss on various vision-language tasks.
FastV enables LVLMs to process higher resolution images with the same token budget, improving performance.
FastV demonstrates superior performance-efficiency trade-off compared to training with fewer visual tokens. |
The theoretical FLOPs reduction may differ from actual inference budget due to factors like hardware and framework optimization.
Further investigation is needed to understand the differences in how image and text tokens contribute to LLM processing. |
large vision-language models, inference optimization, attention mechanism, token pruning, computational efficiency |
2403.06738
Report |
V3D: Video Diffusion Models are Effective 3D Generators |
Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, Huaping Liu |
Automatic 3D generation has recently attracted widespread attention. Recent
methods have greatly accelerated the generation speed, but usually produce
less-detailed objects due to limited model capacity or 3D data. Motivated by
recent advancements in video diffusion models, we introduce V3D, which
leverages the world simulation capacity of pre-trained video diffusion models
to facilitate 3D generation. To fully unleash the potential of video diffusion
to perceive the 3D world, we further introduce geometrical consistency prior
and extend the video diffusion model to a multi-view consistent 3D generator.
Benefiting from this, the state-of-the-art video diffusion model could be
fine-tuned to generate 360degree orbit frames surrounding an object given a
single image. With our tailored reconstruction pipelines, we can generate
high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method
can be extended to scene-level novel view synthesis, achieving precise control
over the camera path with sparse input views. Extensive experiments demonstrate
the superior performance of the proposed approach, especially in terms of
generation quality and multi-view consistency. Our code is available at
https://github.com/heheyas/V3D |
\approach is a novel 3D generation framework leveraging the world simulation capacity of pre-trained video diffusion models for high-quality object and scene generation. |
Existing 3D generation methods suffer from limitations like slow optimization, limited model capacity, or reliance on 3D datasets. This work leverages pre-trained video diffusion models' ability to perceive the 3D world and generate consistent multi-view images, leading to high-quality 3D content creation. |
The method involves fine-tuning video diffusion models on 3D datasets with geometrical consistency priors. For object generation, it fine-tunes on 360° orbit videos. For scene-level synthesis, it integrates a PixelNeRF encoder to accommodate multiple input images and control camera poses. Reconstruction is done using tailored pipelines with space-carving initialization for 3D Gaussians or mesh extraction refined with image-level losses. |
\approach achieves state-of-the-art performance in both object-centric and scene-level 3D generation.
It generates high-quality 3D objects within 3 minutes, outperforming existing methods in terms of fidelity and alignment.
For novel view synthesis, \approach demonstrates superior multi-view consistency and reconstruction quality compared to previous methods. |
The method may struggle with complex objects or scenes, leading to inconsistencies or unreasonable geometries.
Future work includes addressing failure cases and further improving the multi-view consistency of generated content. |
video diffusion models, single image to 3d, novel view synthesis, 3d generation, multi-view consistency |
2403.06702
Report |
Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization |
Jinlu Zhang, Yiyi Zhou, Qiancheng Zheng, Xiaoxiong Du, Gen Luo, Jun Peng, Xiaoshuai Sun, Rongrong Ji |
Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging
research hot spot in machine learning, which still suffers from low efficiency
and poor quality. In this paper, we propose an End-to-End Efficient and
Effective network for fast and accurate T3D face generation and manipulation,
termed $E^3$-FaceNet. Different from existing complex generation paradigms,
$E^3$-FaceNet resorts to a direct mapping from text instructions to 3D-aware
visual space. We introduce a novel Style Code Enhancer to enhance cross-modal
semantic alignment, alongside an innovative Geometric Regularization objective
to maintain consistency across multi-view generations. Extensive experiments on
three benchmark datasets demonstrate that $E^3$-FaceNet can not only achieve
picture-like 3D face generation and manipulation, but also improve inference
speed by orders of magnitudes. For instance, compared with Latent3D,
$E^3$-FaceNet speeds up the five-view generations by almost 470 times, while
still exceeding in generation quality. Our code are released at
https://github.com/Aria-Zhangjl/E3-FaceNet. |
Proposes $E^3$-FaceNet, an end-to-end efficient and effective network for fast and accurate text-to-3D-aware face generation and manipulation. |
Existing methods suffer from low efficiency and poor quality, often relying on complex multi-stage pipelines and test-time tuning. |
Directly maps text instructions to 3D-aware visual space using a StyleNeRF-based architecture. Introduces a Style Code Enhancer for semantic alignment and a Geometric Regularization objective for multi-view consistency. |
Achieves state-of-the-art generation quality on three benchmark datasets, surpassing existing T3D face methods.
Significantly faster inference speed compared to other T3D methods, up to 470 times faster than Latent3D.
Enables accurate and efficient text-driven 3D face manipulation. |
Relies on a pre-trained StyleNeRF model, limiting its generalizability to unseen domains.
The diversity of generated 3D faces can be further improved. |
generative model, cross-modal mapping, text-to-3d face generation, 3d face manipulation, geometric regularization |
2403.06517
Report |
Active Generation for Image Classification |
Tao Huang, Jiaqi Liu, Shan You, Chang Xu |
Recently, the growing capabilities of deep generative models have underscored
their potential in enhancing image classification accuracy. However, existing
methods often demand the generation of a disproportionately large number of
images compared to the original dataset, while having only marginal
improvements in accuracy. This computationally expensive and time-consuming
process hampers the practicality of such approaches. In this paper, we propose
to address the efficiency of image generation by focusing on the specific needs
and characteristics of the model. With a central tenet of active learning, our
method, named ActGen, takes a training-aware approach to image generation. It
aims to create images akin to the challenging or misclassified samples
encountered by the current model and incorporates these generated images into
the training set to augment model performance. ActGen introduces an attentive
image guidance technique, using real images as guides during the denoising
process of a diffusion model. The model's attention on class prompt is
leveraged to ensure the preservation of similar foreground object while
diversifying the background. Furthermore, we introduce a gradient-based
generation guidance method, which employs two losses to generate more
challenging samples and prevent the generated images from being too similar to
previously generated ones. Experimental results on the CIFAR and ImageNet
datasets demonstrate that our method achieves better performance with a
significantly reduced number of generated images. |
This paper presents ActGen, a training-aware approach for enhancing image classification accuracy by actively generating images mimicking challenging or misclassified samples using diffusion models. |
Existing methods for augmenting image classification with synthetic data lack efficiency, often generating large amounts of redundant data for marginal improvements. |
ActGen identifies misclassified images as prototypes for hard samples and utilizes attentive image guidance and gradient-based guidance within the diffusion model to generate diverse, challenging augmentations. |
ActGen significantly improves classification accuracy on ImageNet and CIFAR datasets with a reduced number of generated images compared to previous methods.
The attentive image guidance method, incorporating real image guidance and selective guidance with attention masks, ensures fidelity and background diversity in generated images.
Gradient-based guidance, utilizing contrastive and adversarial losses, further enhances the diversity and classification difficulty of synthetic images. |
The computational cost of ActGen, while significantly lower than previous methods, remains higher than traditional training.
Future research can explore extending ActGen to other domains beyond image classification. |
data augmentation, image classification, image generation, diffusion models, active learning |
2403.06505
Report |
Vosh: Voxel-Mesh Hybrid Representation for Real-Time View Synthesis |
Chenhao Zhang, Yongyang Zhou, Lei Zhang |
The neural radiance field (NeRF) has emerged as a prominent methodology for
synthesizing realistic images of novel views. While neural radiance
representations based on voxels or mesh individually offer distinct advantages,
excelling in either rendering quality or speed, each has limitations in the
other aspect. In response, we propose a pioneering hybrid representation named
Vosh, seamlessly combining both voxel and mesh components in hybrid rendering
for view synthesis. Vosh is meticulously crafted by optimizing the voxel grid
of NeRF, strategically with selected voxels replaced by mesh. Therefore, it
excels in fast rendering scenes with simple geometry and textures through its
mesh component, while simultaneously enabling high-quality rendering in
intricate regions by leveraging voxel component. The flexibility of Vosh is
showcased through the ability to adjust hybrid ratios, providing users the
ability to control the balance between rendering quality and speed based on
flexible usage. Experimental results demonstrates that our method achieves
commendable trade-off between rendering quality and speed, and notably has
real-time performance on mobile devices. |
Presents Vosh, a novel hybrid representation combining voxels and meshes, for real-time view synthesis with Neural Radiance Fields (NeRF). |
Addresses limitations in existing NeRF methods that struggle to balance high-quality rendering with real-time performance on mobile devices. |
Constructs a hybrid representation by: 1) Training an initial high-resolution voxel grid. 2) Converting suitable voxels into a mesh using differentiable surface rendering. 3) Optimizing both voxel and mesh components via hybrid rendering and voxel adjustment. |
Achieves real-time rendering on mobile devices, including laptops and smartphones.
Demonstrates superior rendering quality compared to mesh-based methods, particularly in representing complex scenes.
Offers a controllable balance between rendering speed and quality through voxel adjustment and hybrid ratios. |
Shares limitations with SNeRG and MERF, such as challenges in modeling view-dependent colors for translucent objects.
Potential degradation in mesh optimization quality can impact overall rendering quality. |
neural radiance field, view synthesis, real-time rendering, hybrid representation, mobile devices |
2403.06403
Report |
PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models |
Qingdong He, Jinlong Peng, Zhengkai Jiang, Xiaobin Hu, Jiangning Zhang, Qiang Nie, Yabiao Wang, Chengjie Wang |
Recent success of vision foundation models have shown promising performance
for the 2D perception tasks. However, it is difficult to train a 3D foundation
network directly due to the limited dataset and it remains under explored
whether existing foundation models can be lifted to 3D space seamlessly. In
this paper, we present PointSeg, a novel training-free paradigm that leverages
off-the-shelf vision foundation models to address 3D scene perception tasks.
PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to
align their corresponding pixels across frames. Concretely, we design a
two-branch prompts learning structure to construct the 3D point-box prompts
pairs, combining with the bidirectional matching strategy for accurate point
and proposal prompts generation. Then, we perform the iterative post-refinement
adaptively when cooperated with different vision foundation models. Moreover,
we design a affinity-aware merging algorithm to improve the final ensemble
masks. PointSeg demonstrates impressive segmentation performance across various
datasets, all without training. Specifically, our approach significantly
surpasses the state-of-the-art specialist model by 13.4$\%$, 11.3$\%$, and
12$\%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets, respectively. On top
of that, PointSeg can incorporate with various segmentation models and even
surpasses the supervised methods. |
PointSeg, a novel training-free paradigm leveraging off-the-shelf vision foundation models for 3D scene segmentation. |
Addresses the limitations of training 3D foundation models due to limited datasets and explores the potential of applying existing VFMs to 3D tasks. |
Utilizes a two-branch prompts learning structure to generate 3D point-box prompts pairs, refined by bidirectional matching. Employs iterative post-refinement on 2D masks and affinity-aware merging for accurate 3D segmentation. |
Significantly outperforms state-of-the-art specialist models on ScanNet, ScanNet++, and KITTI-360 datasets (11.3%-13.4% mAP improvement).
Demonstrates robust generalization ability across diverse indoor and outdoor 3D scenarios.
Effectively incorporates and benefits from various segmentation foundation models, showing improvement transfer from 2D to 3D. |
Performance can be affected by the accuracy of the underlying 2D foundation models.
Future work includes exploring more 3D tasks using foundation models. |
3d scene segmentation, foundation models, zero-shot learning, vision foundation models (vfms), point cloud segmentation |
2403.06400
Report |
DivCon: Divide and Conquer for Progressive Text-to-Image Generation |
Yuhao Jia, Wenhan Tan |
Diffusion-driven text-to-image (T2I) generation has achieved remarkable
advancements. To further improve T2I models' capability in numerical and
spatial reasoning, the layout is employed as an intermedium to bridge large
language models and layout-based diffusion models. However, these methods still
struggle with generating images from textural prompts with multiple objects and
complicated spatial relationships. To tackle this challenge, we introduce a
divide-and-conquer approach which decouples the T2I generation task into simple
subtasks. Our approach divides the layout prediction stage into numerical \&
spatial reasoning and bounding box prediction. Then, the layout-to-image
generation stage is conducted in an iterative manner to reconstruct objects
from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K
benchmarks and our approach outperforms previous state-of-the-art models with
notable margins. In addition, visual results demonstrate that our approach
significantly improves the controllability and consistency in generating
multiple objects from complex textural prompts. |
This paper proposes DivCon, a novel divide-and-conquer approach for text-to-image generation that enhances numerical and spatial reasoning capabilities by dividing the task into simpler subtasks. |
Current text-to-image generation models struggle to accurately generate images from text prompts with multiple objects and complex spatial relationships. DivCon addresses this challenge by decomposing the task, leading to improved accuracy and fidelity in image generation. |
DivCon divides layout prediction into two steps: (1) numerical and spatial reasoning using LLMs and (2) bounding box prediction. Layout-to-image generation is also a two-step iterative process: (1) initial image synthesis and consistency evaluation and (2) refinement focusing on low-fidelity objects. |
DivCon significantly outperforms previous state-of-the-art models in numerical and spatial accuracy on HRS and NSR-1K benchmarks.
DivCon generates more accurate layouts with less object overlap compared to baselines.
Qualitative results showcase DivCon's superior performance in handling complex prompts with multiple objects and intricate spatial arrangements. |
DivCon still faces challenges in generating objects from certain pattern layouts, particularly those involving significant object overlap.
Future work could focus on developing more sophisticated layout-conditioned image generation models to better handle overlapping bounding boxes. |
text-to-image generation, large language models, diffusion models, divide and conquer, layout-based generation |
2403.06381
Report |
Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models |
Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, Tiviatis Sim, Kenji Kawaguchi |
Recent advancements in diffusion models have notably improved the perceptual
quality of generated images in text-to-image synthesis tasks. However,
diffusion models often struggle to produce images that accurately reflect the
intended semantics of the associated text prompts. We examine cross-attention
layers in diffusion models and observe a propensity for these layers to
disproportionately focus on certain tokens during the generation process,
thereby undermining semantic fidelity. To address the issue of dominant
attention, we introduce attention regulation, a computation-efficient
on-the-fly optimization approach at inference time to align attention maps with
the input text prompt. Notably, our method requires no additional training or
fine-tuning and serves as a plug-in module on a model. Hence, the generation
capacity of the original model is fully preserved. We compare our approach with
alternative approaches across various datasets, evaluation metrics, and
diffusion models. Experiment results show that our method consistently
outperforms other baselines, yielding images that more faithfully reflect the
desired concepts with reduced computation overhead. Code is available at
https://github.com/YaNgZhAnG-V5/attention_regulation. |
The paper introduces 'attention regulation,' a method to improve the semantic fidelity of text-to-image synthesis in diffusion models by adjusting attention maps during inference. |
Diffusion models, while good at generating high-quality images, often struggle to accurately represent the semantics of the input text prompt, leading to missing or misrepresented objects. |
The method formulates attention map editing as a constrained optimization problem, minimizing the difference between edited and original maps while promoting attention to target tokens. |
Attention regulation improves semantic alignment, as evidenced by higher CLIP scores and object detection success rates compared to baseline methods.
The method is computationally efficient, adding only a 48% overhead to inference time, significantly less than other approaches.
Attention regulation maintains its effectiveness across various diffusion models (Stable Diffusion 1.4, 1.5, 2, and 2.1) and datasets. |
The method may generate images that deviate from human knowledge or fuse concepts in undesired ways due to limitations in the diffusion model's learned features.
Future work could explore methods to align the model's understanding of features with human knowledge to further improve semantic fidelity. |
diffusion models, text-to-image synthesis, semantic fidelity, attention mechanism, constrained optimization |
2403.06356
Report |
Video Generation with Consistency Tuning |
Chaoyi Wang, Yaozhe Song, Yafeng Zhang, Jun Pei, Lijie Xia, Jianpo Liu |
Currently, various studies have been exploring generation of long videos.
However, the generated frames in these videos often exhibit jitter and noise.
Therefore, in order to generate the videos without these noise, we propose a
novel framework composed of four modules: separate tuning module, average
fusion module, combined tuning module, and inter-frame consistency module. By
applying our newly proposed modules subsequently, the consistency of the
background and foreground in each video frames is optimized. Besides, the
experimental results demonstrate that videos generated by our method exhibit a
high quality in comparison of the state-of-the-art methods. |
This paper introduces a novel framework for generating long videos with enhanced consistency and reduced noise, addressing the issue of jitter and noise in existing video generation methods. |
Generating high-quality long videos is a challenging task with limitations in existing methods. This work aims to improve the consistency and quality of generated video frames. |
The framework consists of four key modules: 1) Separate Tuning Module for extracting foreground and background, 2) Average Fusion Module for optimizing consistency, 3) Combined Tuning Module for fine-tuning with focus on foreground and background, and 4) Inter-frame Consistency Module for ensuring temporal smoothness. |
Initial experiments utilizing the first two modules demonstrate promising results in generating videos with improved consistency.
Visual comparisons with state-of-the-art methods highlight the effectiveness of the proposed approach.
Further experiments incorporating the remaining modules are underway to showcase the full potential of the framework. |
Currently, only the first two modules have been experimentally validated.
Quantitative evaluation metrics for video quality are not yet provided. |
video generation, diffusion models, consistency tuning, long videos, deep learning |
2403.06269
Report |
FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing |
Youyuan Zhang, Xuan Ju, James J. Clark |
Diffusion models have demonstrated remarkable capabilities in text-to-image
and text-to-video generation, opening up possibilities for video editing based
on textual input. However, the computational cost associated with sequential
sampling in diffusion models poses challenges for efficient video editing.
Existing approaches relying on image generation models for video editing suffer
from time-consuming one-shot fine-tuning, additional condition extraction, or
DDIM inversion, making real-time applications impractical. In this work, we
propose FastVideoEdit, an efficient zero-shot video editing approach inspired
by Consistency Models (CMs). By leveraging the self-consistency property of
CMs, we eliminate the need for time-consuming inversion or additional condition
extraction, reducing editing time. Our method enables direct mapping from
source video to target video with strong preservation ability utilizing a
special variance schedule. This results in improved speed advantages, as fewer
sampling steps can be used while maintaining comparable generation quality.
Experimental results validate the state-of-the-art performance and speed
advantages of FastVideoEdit across evaluation metrics encompassing editing
speed, temporal consistency, and text-video alignment. |
This paper introduces FastVideoEdit, an efficient and zero-shot video editing approach based on consistency models, for high-quality text-driven video editing. |
Existing text-driven video editing methods relying on diffusion models often suffer from high computational costs due to sequential sampling or additional condition extraction steps, making them impractical for real-time applications. FastVideoEdit tackles this challenge by leveraging the efficiency and content-preserving nature of consistency models. |
FastVideoEdit utilizes the self-consistency property of consistency models to allow direct mapping between source and target videos without DDIM inversion. It introduces a special variance schedule and incorporates techniques like Batch Attention Control, background preservation via latent replacement, and TokenFlow for enhanced temporal consistency and background preservation. |
FastVideoEdit achieves state-of-the-art performance on the TGVE 2023 dataset across metrics including temporal consistency, text-video alignment, and editing speed.
The method significantly reduces editing time compared to previous approaches by eliminating the need for DDIM inversion and additional condition extraction.
FastVideoEdit demonstrates superior background preservation capabilities compared to existing methods, particularly when editing foreground object attributes. |
The performance of FastVideoEdit may require fine-tuning of hyperparameters for each specific video editing task.
While generally effective, there is no guarantee of successful editing for every case, as performance can be influenced by factors like input data quality and the complexity of the edit. |
video editing, diffusion models, consistency models, text-to-video editing, zero-shot learning |
2403.06243
Report |
BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering |
Xinmin Qiu, Congying Han, Zicheng Zhang, Bonan Li, Tiande Guo, Pingyu Wang, Xuecheng Nie |
Developing blind video deflickering (BVD) algorithms to enhance video
temporal consistency, is gaining importance amid the flourish of image
processing and video generation. However, the intricate nature of video data
complicates the training of deep learning methods, leading to high resource
consumption and instability, notably under severe lighting flicker. This
underscores the critical need for a compact representation beyond pixel values
to advance BVD research and applications. Inspired by the classic scale-time
equalization (STE), our work introduces the histogram-assisted solution, called
BlazeBVD, for high-fidelity and rapid BVD. Compared with STE, which directly
corrects pixel values by temporally smoothing color histograms, BlazeBVD
leverages smoothed illumination histograms within STE filtering to ease the
challenge of learning temporal data using neural networks. In technique,
BlazeBVD begins by condensing pixel values into illumination histograms that
precisely capture flickering and local exposure variations. These histograms
are then smoothed to produce singular frames set, filtered illumination maps,
and exposure maps. Resorting to these deflickering priors, BlazeBVD utilizes a
2D network to restore faithful and consistent texture impacted by lighting
changes or localized exposure issues. BlazeBVD also incorporates a lightweight
3D network to amend slight temporal inconsistencies, avoiding the resource
consumption issue. Comprehensive experiments on synthetic, real-world and
generated videos, showcase the superior qualitative and quantitative results of
BlazeBVD, achieving inference speeds up to 10x faster than state-of-the-arts. |
BlazeBVD, a histogram-assisted blind video deflickering method that uses deflickering priors from Scale-Time Equalization (STE) to simplify the complexity and resource demands of deflickering. |
Existing deep learning methods for blind video deflickering (BVD) are computationally expensive and struggle with severe lighting flicker, demanding a more compact representation than pixel values. |
BlazeBVD prepares deflickering priors (filtered illumination map, singular frames set, exposure maps) from STE. It then uses a Global Flicker Removal Module (GFRM) guided by the filtered illumination map and a Local Flicker Removal Module (LFRM) based on optical flow warping and exposure maps. Finally, a lightweight spatio-temporal network enhances temporal consistency. |
BlazeBVD achieves superior qualitative and quantitative results on synthetic, real-world, and generated videos, outperforming state-of-the-art methods.
It effectively tackles both illumination fluctuations and over-/under-exposure challenges, preserving texture details.
BlazeBVD achieves inference speeds up to 10x faster than previous methods due to its efficient histogram-based representation and modular design. |
Inaccurate optical flow estimation in LFRM can lead to minor artifacts.
Balancing faithfulness and coherence in generated videos requires further investigation. |
video deflickering, histogram, temporal consistency, scale-time equalization, exposure correction |
2403.06213
Report |
$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections |
Roy Miles, Ismail Elezi, Jiankang Deng |
Knowledge distillation is an effective method for training small and
efficient deep learning models. However, the efficacy of a single method can
degenerate when transferring to other tasks, modalities, or even other
architectures. To address this limitation, we propose a novel constrained
feature distillation method. This method is derived from a small set of core
principles, which results in two emerging components: an orthogonal projection
and a task-specific normalisation. Equipped with both of these components, our
transformer models can outperform all previous methods on ImageNet and reach up
to a 4.4% relative improvement over the previous state-of-the-art methods. To
further demonstrate the generality of our method, we apply it to object
detection and image generation, whereby we obtain consistent and substantial
performance improvements over state-of-the-art. Code and models are publicly
available: https://github.com/roymiles/vkd |
The paper presents $V_kD$, a novel knowledge distillation method using orthogonal projections to maximize knowledge transfer by preserving intra-batch feature similarity. |
Existing knowledge distillation methods often rely on heuristics, lack adaptability to diverse tasks, and introduce significant computational overhead. |
The method utilizes an orthogonal projection layer, derived from the principle of preserving feature similarity, and efficiently implemented via projection onto the Stiefel manifold. It also introduces task-specific normalization to improve performance in both discriminative and generative tasks. |
$V_kD$ achieves state-of-the-art performance on ImageNet, outperforming previous methods by up to 4.4%.
It demonstrates consistent improvements in object detection tasks using ViDT architecture.
For data-limited image generation, $V_kD$ with feature whitening outperforms KD-DLGAN without needing auxiliary diversity losses. |
The paper mainly evaluates the method on visual tasks; further exploration in other domains is needed.
Investigating the impact of different kernel choices for the similarity preservation constraint could be beneficial. |
knowledge distillation, orthogonal projection, feature similarity, task-specific normalization, vision transformers |
2403.06168
Report |
DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation |
Xiaobin Hu, Xu Peng, Donghao Luo, Xiaozhong Ji, Jinlong Peng, Zhengkai Jiang, Jiangning Zhang, Taisong Jin, Chengjie Wang, Rongrong Ji |
Due to the difficulty and labor-consuming nature of getting highly accurate
or matting annotations, there only exists a limited amount of highly accurate
labels available to the public. To tackle this challenge, we propose a
DiffuMatting which inherits the strong Everything generation ability of
diffusion and endows the power of "matting anything". Our DiffuMatting can 1).
act as an anything matting factory with high accurate annotations 2). be
well-compatible with community LoRAs or various conditional control approaches
to achieve the community-friendly art design and controllable generation.
Specifically, inspired by green-screen-matting, we aim to teach the diffusion
model to paint on a fixed green screen canvas. To this end, a large-scale
greenscreen dataset (Green100K) is collected as a training dataset for
DiffuMatting. Secondly, a green background control loss is proposed to keep the
drawing board as a pure green color to distinguish the foreground and
background. To ensure the synthesized object has more edge details, a
detailed-enhancement of transition boundary loss is proposed as a guideline to
generate objects with more complicated edge structures. Aiming to
simultaneously generate the object and its matting annotation, we build a
matting head to make a green color removal in the latent space of the VAE
decoder. Our DiffuMatting shows several potential applications (e.g.,
matting-data generator, community-friendly art design and controllable
generation). As a matting-data generator, DiffuMatting synthesizes general
object and portrait matting sets, effectively reducing the relative MSE error
by 15.4% in General Object Matting and 11.4% in Portrait Matting tasks. |
This paper introduces DiffuMatting, a novel diffusion-based model that generates arbitrary objects with accompanying high-quality matting-level annotations. |
Creating matting-level annotations is labor-intensive and existing datasets are limited. DiffuMatting addresses this by acting as a data factory for high-quality synthetic matting data, benefiting downstream tasks like image composition and matting algorithm training. |
The model is trained on a newly created Green100k dataset, containing images with green-screen backgrounds and accurate matting annotations. It leverages a green-background control loss for background consistency and a detailed-enhancement loss for fine edge details. A dedicated matting head in the VAE decoder extracts matting masks, further refined by a GreenPost process. |
DiffuMatting outperforms existing methods in generating clean green-screen objects.
Synthetic data generated by DiffuMatting improves the performance of general object and portrait matting tasks, reducing MSE errors by 15.4% and 11.4% respectively.
DiffuMatting is versatile and compatible with LoRA models and ControlNet for customized styles and controllable image editing. |
Currently limited to generating matting annotations for green-screen images, requiring further exploration for general backgrounds.
Potential for misuse in illicit industries, necessitating explicit markings on generated content. |
matting generation, diffusion models, synthetic data, controllable generation, image composition |
2403.06135
Report |
MACE: Mass Concept Erasure in Diffusion Models |
Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, Adams Wai-Kin Kong |
The rapid expansion of large-scale text-to-image diffusion models has raised
growing concerns regarding their potential misuse in creating harmful or
misleading content. In this paper, we introduce MACE, a finetuning framework
for the task of mass concept erasure. This task aims to prevent models from
generating images that embody unwanted concepts when prompted. Existing concept
erasure methods are typically restricted to handling fewer than five concepts
simultaneously and struggle to find a balance between erasing concept synonyms
(generality) and maintaining unrelated concepts (specificity). In contrast,
MACE differs by successfully scaling the erasure scope up to 100 concepts and
by achieving an effective balance between generality and specificity. This is
achieved by leveraging closed-form cross-attention refinement along with LoRA
finetuning, collectively eliminating the information of undesirable concepts.
Furthermore, MACE integrates multiple LoRAs without mutual interference. We
conduct extensive evaluations of MACE against prior methods across four
different tasks: object erasure, celebrity erasure, explicit content erasure,
and artistic style erasure. Our results reveal that MACE surpasses prior
methods in all evaluated tasks. Code is available at
https://github.com/Shilin-LU/MACE. |
MACE is a finetuning framework for Mass Concept Erasure in text-to-image diffusion models, capable of removing a large number of concepts (up to 100) while maintaining a balance between generality and specificity. |
Concept erasure is crucial for mitigating risks associated with large-scale T2I models, such as generating harmful, copyrighted, or offensive content, which current methods struggle to handle effectively. |
MACE leverages closed-form cross-attention refinement to remove residual information of target concepts and employs LoRA finetuning with concept-focal importance sampling to erase intrinsic concept information. It also integrates multiple LoRAs to prevent interference and catastrophic forgetting. |
MACE outperforms SOTA methods in erasing objects, celebrities, explicit content, and artistic styles while preserving unrelated concepts.
It effectively removes concepts even when prompted with synonyms, demonstrating strong generality.
MACE scales well to erasing a large number of concepts (100) with minimal impact on the generation of unrelated concepts. |
Performance slightly declines when scaling from 10 to 100 erased concepts.
Further research is needed to enhance scalability for erasing thousands of concepts in future models. |
concept erasure, text-to-image synthesis, diffusion models, ethical ai, lora |
2403.06098
Report |
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models |
Wenhao Wang, Yi Yang |
The arrival of Sora marks a new era for text-to-video diffusion models,
bringing significant advancements in video generation and potential
applications. However, Sora, along with other text-to-video diffusion models,
is highly reliant on prompts, and there is no publicly available dataset that
features a study of text-to-video prompts. In this paper, we introduce VidProM,
the first large-scale dataset comprising 1.67 Million unique text-to-Video
Prompts from real users. Additionally, this dataset includes 6.69 million
videos generated by four state-of-the-art diffusion models, alongside some
related data. We initially discuss the curation of this large-scale dataset, a
process that is both time-consuming and costly. Subsequently, we underscore the
need for a new prompt dataset specifically designed for text-to-video
generation by illustrating how VidProM differs from DiffusionDB, a large-scale
prompt-gallery dataset for image generation. Our extensive and diverse dataset
also opens up many exciting new research areas. For instance, we suggest
exploring text-to-video prompt engineering, efficient video generation, and
video copy detection for diffusion models to develop better, more efficient,
and safer models. The project (including the collected dataset VidProM and
related code) is publicly available at https://vidprom.github.io under the
CC-BY-NC 4.0 License. |
Introduces \dsnameM, the first large-scale dataset of 1.67 million unique text-to-video prompts and 6.69 million corresponding videos generated using four state-of-the-art diffusion models. |
Addresses the lack of publicly available datasets for studying text-to-video prompts, crucial for advancing text-to-video generation models like Sora. |
Collects prompts from Pika Discord channels, generates videos using Pika, Text2Video-Zero, VideoCraft2, and ModelScope. Embeds prompts using OpenAI's text-embedding-3-large and assigns NSFW probabilities using Detoxify. |
VidProM contains 1.67M unique prompts and 6.69M videos, significantly more diverse than existing text-to-image prompt datasets.
Analysis reveals text-to-video prompts are more dynamic, complex, and longer than text-to-image prompts, highlighting the need for a dedicated dataset.
Benchmarks show existing fake image detection methods generalize poorly to fake videos, demonstrating the dataset's value for developing specialized detectors. |
Current generated videos are short and may not reflect the highest quality possible.
Dataset currently lacks videos generated by advanced models like Sora, planned for future updates. |
text-to-video generation, diffusion models, prompt engineering, dataset, fake video detection |
2403.06092
Report |
Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis? |
Hanxin Zhu, Tianyu He, Xin Li, Bingchen Li, Zhibo Chen |
Neural Radiance Field (NeRF) has achieved superior performance for novel view
synthesis by modeling the scene with a Multi-Layer Perception (MLP) and a
volume rendering procedure, however, when fewer known views are given (i.e.,
few-shot view synthesis), the model is prone to overfit the given views. To
handle this issue, previous efforts have been made towards leveraging learned
priors or introducing additional regularizations. In contrast, in this paper,
we for the first time provide an orthogonal method from the perspective of
network structure. Given the observation that trivially reducing the number of
model parameters alleviates the overfitting issue, but at the cost of missing
details, we propose the multi-input MLP (mi-MLP) that incorporates the inputs
(i.e., location and viewing direction) of the vanilla MLP into each layer to
prevent the overfitting issue without harming detailed synthesis. To further
reduce the artifacts, we propose to model colors and volume density separately
and present two regularization terms. Extensive experiments on multiple
datasets demonstrate that: 1) although the proposed mi-MLP is easy to
implement, it is surprisingly effective as it boosts the PSNR of the baseline
from $14.73$ to $24.23$. 2) the overall framework achieves state-of-the-art
results on a wide range of benchmarks. We will release the code upon
publication. |
This paper introduces mi-MLP, a multi-input MLP designed to address overfitting in few-shot novel view synthesis by incorporating location and viewing direction inputs into each layer, enhancing flexibility without sacrificing model capacity. |
Few-shot novel view synthesis with NeRF suffers from overfitting due to limited training views, resulting in poor generalization and artifacts. This work explores network structure modification as an alternative solution. |
The paper proposes mi-MLP, incorporating inputs into every MLP layer. Additionally, it proposes separate modeling of color and volume density with different positional encoding frequencies. Two regularization techniques are introduced: background regularization for object-centric scenes and sampling annealing for near-field artifacts. |
mi-MLP significantly improves PSNR compared to the baseline (e.g., 14.73 to 24.23 on Blender).
The proposed method achieves state-of-the-art results on Blender, LLFF, and Shiny datasets.
Ablation studies confirm the effectiveness of mi-MLP, separate modeling, and regularization techniques. |
Consistency for complex textures or thin structures is limited due to no constraints on unknown views.
Future work includes exploring additional regularizations and priors for improved novel view synthesis. |
novel view synthesis, neural radiance fields (nerf), few-shot learning, multi-layer perceptron (mlp), overfitting |
2403.05907
Report |
Lightning NeRF: Efficient Hybrid Scene Representation for Autonomous Driving |
Junyi Cao, Zhichao Li, Naiyan Wang, Chao Ma |
Recent studies have highlighted the promising application of NeRF in
autonomous driving contexts. However, the complexity of outdoor environments,
combined with the restricted viewpoints in driving scenarios, complicates the
task of precisely reconstructing scene geometry. Such challenges often lead to
diminished quality in reconstructions and extended durations for both training
and rendering. To tackle these challenges, we present Lightning NeRF. It uses
an efficient hybrid scene representation that effectively utilizes the geometry
prior from LiDAR in autonomous driving scenarios. Lightning NeRF significantly
improves the novel view synthesis performance of NeRF and reduces computational
overheads. Through evaluations on real-world datasets, such as KITTI-360,
Argoverse2, and our private dataset, we demonstrate that our approach not only
exceeds the current state-of-the-art in novel view synthesis quality but also
achieves a five-fold increase in training speed and a ten-fold improvement in
rendering speed. Codes are available at
https://github.com/VISION-SJTU/Lightning-NeRF . |
This paper introduces Lightning-NeRF, an efficient novel view synthesis framework for large-scale outdoor scenes that leverages point clouds and images in autonomous driving scenarios. |
Existing NeRF methods struggle to balance high-fidelity reconstruction with computational efficiency, especially in outdoor driving scenarios where scenes are vast and computationally expensive to process. |
The proposed method employs a hybrid scene representation that explicitly models density with a voxel grid initialized by LiDAR point clouds and implicitly models color with shallow MLPs. It also incorporates efficient background modeling and color decomposition to enhance rendering quality and extrapolation ability. |
Lightning-NeRF outperforms state-of-the-art methods in novel view synthesis quality on KITTI-360 and Argoverse2 datasets.
It achieves a five-fold improvement in training speed and a ten-fold improvement in rendering speed compared to previous methods.
The method demonstrates superior extrapolation capabilities, vital for simulating novel views in autonomous driving scenarios. |
The method assumes the availability of LiDAR data, which might not be universally applicable.
Future work could explore dynamically adjusting the resolution of the hybrid representation for better efficiency. |
neural radiance fields, novel view synthesis, autonomous driving, lidar, hybrid scene representation |
2403.05846
Report |
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines |
Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov |
Text-to-image diffusion models (T2I) use a latent representation of a text
prompt to guide the image generation process. However, the process by which the
encoder produces the text representation is unknown. We propose the Diffusion
Lens, a method for analyzing the text encoder of T2I models by generating
images from its intermediate representations. Using the Diffusion Lens, we
perform an extensive analysis of two recent T2I models. Exploring compound
prompts, we find that complex scenes describing multiple objects are composed
progressively and more slowly compared to simple scenes; Exploring knowledge
retrieval, we find that representation of uncommon concepts requires further
computation compared to common concepts, and that knowledge retrieval is
gradual across layers. Overall, our findings provide valuable insights into the
text encoder component in T2I pipelines. |
The paper introduces Diffusion Lens, a novel method for analyzing the internal workings of text encoders in text-to-image diffusion models by generating images from intermediate layers of the encoder. |
The text encoder is a key component of text-to-image generation, yet its internal mechanisms are poorly understood. This work provides a new tool to analyze how these encoders represent and process language. |
The method extracts the hidden state representations from different layers of the text encoder, passes them through the final layer norm, and feeds them to the diffusion model to generate images. These images provide a visual representation of how the text is encoded at each layer. |
Complex concepts are composed gradually, with simpler concepts emerging in earlier layers and relationships between concepts solidifying in later layers.
Common concepts are retrieved earlier in the network compared to uncommon concepts, suggesting gradual knowledge retrieval.
Different text encoders (T5 vs. CLIP) exhibit different representation building patterns, potentially influenced by training data and objectives. |
The study primarily relies on automatically generated prompts, which might not fully represent the complexity of human language.
The method requires manual analysis of generated images to derive insights, limiting the scale of analysis. |
text-to-image generation, diffusion models, text encoder, interpretability, conceptual combination |
2403.05726
Report |
Augmentations vs Algorithms: What Works in Self-Supervised Learning |
Warren Morningstar, Alex Bijamov, Chris Duvarney, Luke Friedman, Neha Kalibhat, Luyang Liu, Philip Mansfield, Renan Rojas-Gomez, Karan Singhal, Bradley Green, Sushant Prakash |
We study the relative effects of data augmentations, pretraining algorithms,
and model architectures in Self-Supervised Learning (SSL). While the recent
literature in this space leaves the impression that the pretraining algorithm
is of critical importance to performance, understanding its effect is
complicated by the difficulty in making objective and direct comparisons
between methods. We propose a new framework which unifies many seemingly
disparate SSL methods into a single shared template. Using this framework, we
identify aspects in which methods differ and observe that in addition to
changing the pretraining algorithm, many works also use new data augmentations
or more powerful model architectures. We compare several popular SSL methods
using our framework and find that many algorithmic additions, such as
prediction networks or new losses, have a minor impact on downstream task
performance (often less than $1\%$), while enhanced augmentation techniques
offer more significant performance improvements ($2-4\%$). Our findings
challenge the premise that SSL is being driven primarily by algorithmic
improvements, and suggest instead a bitter lesson for SSL: that augmentation
diversity and data / model scale are more critical contributors to recent
advances in self-supervised learning. |
This paper investigates the relative contributions of data augmentations, pretraining algorithms, and model architectures to the performance of self-supervised learning (SSL), demonstrating that data augmentation diversity and model scale are more impactful than algorithmic innovations. |
The importance of this study lies in clarifying the key drivers of SSL performance, which has been often attributed to algorithmic improvements, and providing insights for future research directions. |
The authors propose a unified framework encompassing popular SSL methods and conduct experiments comparing SimCLR, BYOL, SwAV, MoCo v2, DINO, and MoCo v3 with varying augmentations, algorithms, and architectures. |
Increasing augmentation diversity significantly improves downstream task performance across all methods, contributing to a substantial portion of performance gains in recent SSL advances.
Algorithmic enhancements, such as momentum encoders and prediction networks, show a smaller performance impact than augmentations, with their effects varying across different methods.
Increasing model size, specifically switching from ResNet-50 to ViT-B, leads to a notable performance improvement, supporting the significance of model scale in SSL. |
The study primarily focuses on instance-based joint embedding methods, excluding other SSL paradigms such as generative models.
While the paper demonstrates the importance of augmentations, further investigation is needed to understand the interplay between specific augmentations and SSL algorithms, especially in the context of increasingly diverse augmentations. |
self-supervised learning, data augmentation, pretraining algorithms, model architectures, representation learning |
2403.05438
Report |
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models |
Yabo Zhang, Yuxiang Wei, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong Xie, Xiangyang Ji, Wangmeng Zuo |
Text-to-image diffusion models (T2I) have demonstrated unprecedented
capabilities in creating realistic and aesthetic images. On the contrary,
text-to-video diffusion models (T2V) still lag far behind in frame quality and
text alignment, owing to insufficient quality and quantity of training videos.
In this paper, we introduce VideoElevator, a training-free and plug-and-play
method, which elevates the performance of T2V using superior capabilities of
T2I. Different from conventional T2V sampling (i.e., temporal and spatial
modeling), VideoElevator explicitly decomposes each sampling step into temporal
motion refining and spatial quality elevating. Specifically, temporal motion
refining uses encapsulated T2V to enhance temporal consistency, followed by
inverting to the noise distribution required by T2I. Then, spatial quality
elevating harnesses inflated T2I to directly predict less noisy latent, adding
more photo-realistic details. We have conducted experiments in extensive
prompts under the combination of various T2V and T2I. The results show that
VideoElevator not only improves the performance of T2V baselines with
foundational T2I, but also facilitates stylistic video synthesis with
personalized T2I. Our code is available at
https://github.com/YBYBZhang/VideoElevator. |
VideoElevator is a training-free and plug-and-play method that enhances the quality of text-to-video diffusion models (T2V) by integrating them with various text-to-image diffusion models (T2I). |
Existing T2V models often produce videos with lower quality and fidelity than T2I models due to the limitations of training video datasets. VideoElevator leverages the superior capabilities of T2I models to improve the quality of T2V generated videos. |
VideoElevator decomposes each sampling step into temporal motion refining and spatial quality elevating. Temporal motion refining enhances motion consistency using a low-pass filter and T2V-based SDEdit. Spatial quality elevating employs an inflated T2I to add high-quality details. To ensure interaction between models, VideoElevator projects noise latents to clean latents using DDIM inversion. |
VideoElevator significantly improves the performance of T2V baselines in terms of frame quality, text alignment, and aesthetic style when integrated with either foundational or personalized T2I.
Human evaluation shows a strong preference for videos generated by VideoElevator-enhanced T2V models.
VideoElevator is compatible with personalized Stable Diffusion XL (SDXL) models, including those fine-tuned with LoRA and Diffusion-DPO. |
The paper focuses on improving quality and doesn't explicitly address aspects like video length or computational efficiency.
Further exploration is needed to optimize the trade-off between quality improvement and computational cost. |
video generation, text-to-video synthesis, diffusion models, text-to-image diffusion, video quality enhancement |
2403.05239
Report |
Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation |
Junyan Wang, Zhenhong Sun, Zhiyu Tan, Xuanbai Chen, Weihua Chen, Hao Li, Cheng Zhang, Yang Song |
Vanilla text-to-image diffusion models struggle with generating accurate
human images, commonly resulting in imperfect anatomies such as unnatural
postures or disproportionate limbs.Existing methods address this issue mostly
by fine-tuning the model with extra images or adding additional controls --
human-centric priors such as pose or depth maps -- during the image generation
phase. This paper explores the integration of these human-centric priors
directly into the model fine-tuning stage, essentially eliminating the need for
extra conditions at the inference stage. We realize this idea by proposing a
human-centric alignment loss to strengthen human-related information from the
textual prompts within the cross-attention maps. To ensure semantic detail
richness and human structural accuracy during fine-tuning, we introduce
scale-aware and step-wise constraints within the diffusion process, according
to an in-depth analysis of the cross-attention layer. Extensive experiments
show that our method largely improves over state-of-the-art text-to-image
models to synthesize high-quality human images based on user-written prompts.
Project page: \url{https://hcplayercvpr2024.github.io}. |
This paper proposes a novel Human-centric Prior (HcP) layer to enhance the accuracy of human image generation in text-to-image diffusion models without requiring additional conditions during inference. |
Generating accurate human images from text descriptions is crucial for various applications, but vanilla diffusion models often struggle with this task, resulting in anatomical imperfections. |
The HcP layer is trained with a human-centric alignment loss to better align cross-attention maps with human-centric textual information. This approach incorporates human-centric prior knowledge, such as pose images, directly into the model fine-tuning stage. |
The HcP layer significantly improves the structural accuracy of generated human images, particularly in depicting complex poses and proportions.
The proposed method preserves the original generative capabilities and style of the pre-trained diffusion model, unlike methods like LoRA that might alter the model's expressiveness.
The HcP layer is a plug-and-play module compatible with other controllable text-to-image diffusion models like ControlNet, further enhancing their capabilities. |
The model currently relies on a single type of human-centric prior information (e.g., pose).
There is room for improvement in handling highly complex scenes with multiple interacting individuals. |
text-to-image generation, diffusion models, human image synthesis, cross-attention, human-centric priors |
2403.05231
Report |
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance |
Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, Haibin Ling |
Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language
models, we propose LoRAT, a method that unveils the power of larger Vision
Transformers (ViT) for tracking within laboratory-level resources. The essence
of our work lies in adapting LoRA, a technique that fine-tunes a small subset
of model parameters without adding inference latency, to the domain of visual
tracking. However, unique challenges and potential domain gaps make this
transfer not as easy as the first intuition. Firstly, a transformer-based
tracker constructs unshared position embedding for template and search image.
This poses a challenge for the transfer of LoRA, usually requiring consistency
in the design when applied to the pre-trained backbone, to downstream tasks.
Secondly, the inductive bias inherent in convolutional heads diminishes the
effectiveness of parameter-efficient fine-tuning in tracking models. To
overcome these limitations, we first decouple the position embeddings in
transformer-based trackers into shared spatial ones and independent type ones.
The shared embeddings, which describe the absolute coordinates of
multi-resolution images (namely, the template and search images), are inherited
from the pre-trained backbones. In contrast, the independent embeddings
indicate the sources of each token and are learned from scratch. Furthermore,
we design an anchor-free head solely based on a multilayer perceptron (MLP) to
adapt PETR, enabling better performance with less computational overhead. With
our design, 1) it becomes practical to train trackers with the ViT-g backbone
on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the
training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve
the LaSOT SUC score from 0.703 to 0.743 with the L-224 variant; 4) we fast the
inference speed of the L-224 variant from 52 to 119 FPS. Code and models will
be released. |
Proposes LoRAT, a novel visual tracking method leveraging Low-Rank Adaptation (LoRA) within a one-stream tracking framework for efficient fine-tuning of large Vision Transformers, making them more accessible for resource-constrained researchers. |
Large Vision Transformers, while powerful for visual tracking, demand significant computational resources, making their training impractical for most researchers. |
Adapts LoRA to a one-stream tracking architecture with two key designs: 1) a decoupled input embedding with shared spatial and independent type embeddings for preserving the pre-trained ViT structure; 2) an MLP-only head network to mitigate inductive biases from convolutional heads. |
Achieves state-of-the-art performance on multiple benchmarks, setting a new record on LaSOT with 0.762 SUC score using ViT-g backbone.
Significantly reduces training time and memory requirements compared to full fine-tuning, enabling training of large models with limited resources.
Demonstrates the feasibility of training advanced tracking models with manageable resources, making cutting-edge research accessible to a wider community. |
Limited exploration of LoRA rank variation's impact on different ViT backbones.
Future work could explore combining LoRAT with other PEFT techniques for further efficiency. |
visual object tracking, lora, parameter-efficient fine-tuning, vision transformer, one-stream tracking |
2403.05154
Report |
GSEdit: Efficient Text-Guided Editing of 3D Objects via Gaussian Splatting |
Francesco Palandra, Andrea Sanchietti, Daniele Baieri, Emanuele Rodolà |
We present GSEdit, a pipeline for text-guided 3D object editing based on
Gaussian Splatting models. Our method enables the editing of the style and
appearance of 3D objects without altering their main details, all in a matter
of minutes on consumer hardware. We tackle the problem by leveraging Gaussian
splatting to represent 3D scenes, and we optimize the model while progressively
varying the image supervision by means of a pretrained image-based diffusion
model. The input object may be given as a 3D triangular mesh, or directly
provided as Gaussians from a generative model such as DreamGaussian. GSEdit
ensures consistency across different viewpoints, maintaining the integrity of
the original object's information. Compared to previously proposed methods
relying on NeRF-like MLP models, GSEdit stands out for its efficiency, making
3D editing tasks much faster. Our editing process is refined via the
application of the SDS loss, ensuring that our edits are both precise and
accurate. Our comprehensive evaluation demonstrates that GSEdit effectively
alters object shape and appearance following the given textual instructions
while preserving their coherence and detail. |
Introduces GS-Edit, a pipeline for efficient text-guided 3D object editing using Gaussian Splatting models and image diffusion models. |
Empowers 3D artists with fast and automated editing capabilities, enhancing workflow in creative and industrial pipelines. |
Leverages Gaussian Splatting for scene representation and optimizes it by progressively modifying image supervision via a pretrained image-based diffusion model (Instruct-Pix2Pix). Employs SDS loss for accurate editing and supports both mesh and point cloud inputs. |
Achieves significant object shape and appearance modifications based on textual prompts.
Preserves object coherence and detail during editing.
Demonstrates superior efficiency compared to NeRF-based methods, enabling editing within minutes on consumer hardware. |
Editing scope limited by Instruct-Pix2Pix capabilities, hindering significant spatial transformations (e.g., pose alteration).
Perspective bias in Instruct-Pix2Pix can introduce artifacts, impacting the consistency and quality of edits. |
gaussian splatting, radiance fields, 3d object editing, text-guided editing, diffusion models |
2403.05139
Report |
Improving Diffusion Models for Virtual Try-on |
Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, Jinwoo Shin |
This paper considers image-based virtual try-on, which renders an image of a
person wearing a curated garment, given a pair of images depicting the person
and the garment, respectively. Previous works adapt existing exemplar-based
inpainting diffusion models for virtual try-on to improve the naturalness of
the generated visuals compared to other methods (e.g., GAN-based), but they
fail to preserve the identity of the garments. To overcome this limitation, we
propose a novel diffusion model that improves garment fidelity and generates
authentic virtual try-on images. Our method, coined IDM-VTON, uses two
different modules to encode the semantics of garment image; given the base UNet
of the diffusion model, 1) the high-level semantics extracted from a visual
encoder are fused to the cross-attention layer, and then 2) the low-level
features extracted from parallel UNet are fused to the self-attention layer. In
addition, we provide detailed textual prompts for both garment and person
images to enhance the authenticity of the generated visuals. Finally, we
present a customization method using a pair of person-garment images, which
significantly improves fidelity and authenticity. Our experimental results show
that our method outperforms previous approaches (both diffusion-based and
GAN-based) in preserving garment details and generating authentic virtual
try-on images, both qualitatively and quantitatively. Furthermore, the proposed
customization method demonstrates its effectiveness in a real-world scenario.
More visualizations are available in our project page:
https://idm-vton.github.io |
This paper introduces IDM-VTON, a novel diffusion model for authentic virtual try-on that improves garment fidelity by using two modules to encode garment semantics: an image prompt adapter for high-level semantics and a UNet encoder (GarmentNet) for low-level features. |
Existing diffusion-based virtual try-on methods struggle to preserve fine-grained details of garments, hindering their real-world applicability. This method aims to address this limitation and generate more realistic and detailed try-on images. |
The model consists of a base UNet (TryonNet) for the person image, an image prompt adapter for garment semantics, and GarmentNet for detailed garment features. They leverage Stable Diffusion XL and incorporate detailed garment captions to enhance the model's understanding. Additionally, they propose a customization method using a single pair of garment and person images for better adaptation to real-world scenarios. |
IDM-VTON outperforms previous diffusion-based and GAN-based methods in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively.
The use of GarmentNet significantly improves the preservation of fine-grained garment details compared to using only the image prompt adapter.
The proposed customization method significantly enhances the visual quality and garment fidelity, especially in challenging, real-world scenarios. |
The model may not perfectly preserve human attributes on masked regions like tattoos or skin moles.
Future work includes exploring broader applications like controlling garment generation through textual prompts. |
virtual try-on, diffusion models, image generation, garment fidelity, customization |
2403.05135
Report |
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment |
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu |
Diffusion models have demonstrated remarkable performance in the domain of
text-to-image generation. However, most widely used models still employ CLIP as
their text encoder, which constrains their ability to comprehend dense prompts,
encompassing multiple objects, detailed attributes, complex relationships,
long-text alignment, etc. In this paper, we introduce an Efficient Large
Language Model Adapter, termed ELLA, which equips text-to-image diffusion
models with powerful Large Language Models (LLM) to enhance text alignment
without training of either U-Net or LLM. To seamlessly bridge two pre-trained
models, we investigate a range of semantic alignment connector designs and
propose a novel module, the Timestep-Aware Semantic Connector (TSC), which
dynamically extracts timestep-dependent conditions from LLM. Our approach
adapts semantic features at different stages of the denoising process,
assisting diffusion models in interpreting lengthy and intricate prompts over
sampling timesteps. Additionally, ELLA can be readily incorporated with
community models and tools to improve their prompt-following capabilities. To
assess text-to-image models in dense prompt following, we introduce Dense
Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K
dense prompts. Extensive experiments demonstrate the superiority of ELLA in
dense prompt following compared to state-of-the-art methods, particularly in
multiple object compositions involving diverse attributes and relationships. |
This paper introduces ELLA, a lightweight approach that equips existing CLIP-based text-to-image diffusion models with Large Language Models (LLMs) to enhance their ability to understand and generate images from dense prompts, without requiring training of the LLM or the diffusion model's U-Net. |
Existing text-to-image models often struggle with dense prompts that describe multiple objects, detailed attributes, and complex relationships. ELLA addresses this limitation by incorporating the superior language understanding of LLMs. |
ELLA uses a pre-trained LLM as the text encoder and introduces a novel Timestep-Aware Semantic Connector (TSC). TSC dynamically extracts timestep-dependent semantic features from the LLM, effectively guiding the frozen U-Net at different stages of the image generation process. |
ELLA significantly outperforms CLIP-based diffusion models on dense prompt following benchmarks.
ELLA demonstrates strong compatibility with community models and downstream tools like LoRA and ControlNet, enhancing their prompt-following capabilities.
User studies confirm that ELLA leads to improved text-image alignment while maintaining competitive aesthetic quality. |
The training captions, synthesized by MLLM, might not be entirely reliable in terms of shape and spatial relationship understanding.
The aesthetic quality of generated images might be limited by the use of a frozen U-Net. |
text-to-image generation, diffusion models, large language models, semantic alignment, dense prompts |
2403.05131
Report |
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation |
Joseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, Chaoning Zhang |
Text-to-video generation marks a significant frontier in the rapidly evolving
domain of generative AI, integrating advancements in text-to-image synthesis,
video captioning, and text-guided editing. This survey critically examines the
progression of text-to-video technologies, focusing on the shift from
traditional generative models to the cutting-edge Sora model, highlighting
developments in scalability and generalizability. Distinguishing our analysis
from prior works, we offer an in-depth exploration of the technological
frameworks and evolutionary pathways of these models. Additionally, we delve
into practical applications and address ethical and technological challenges
such as the inability to perform multiple entity handling, comprehend
causal-effect learning, understand physical interaction, perceive object
scaling and proportioning, and combat object hallucination which is also a
long-standing problem in generative models. Our comprehensive discussion covers
the topic of enablement of text-to-video generation models as human-assistive
tools and world models, as well as eliciting model's shortcomings and
summarizing future improvement direction that mainly centers around training
datasets and evaluation metrics (both automatic and human-centered). Aimed at
both newcomers and seasoned researchers, this survey seeks to catalyze further
innovation and discussion in the growing field of text-to-video generation,
paving the way for more reliable and practical generative artificial
intelligence technologies. |
This paper presents a comprehensive survey of text-to-video generation models, focusing on their evolution from traditional methods to the advanced Sora model by OpenAI. |
Text-to-video generation is a significant frontier in generative AI, with potential to revolutionize content creation across various fields like entertainment, education, and marketing. |
The authors chronologically review key technologies, model architectures (GAN, autoregressive, diffusion), evaluation metrics, applications, and limitations of these models. They delve into Sora's capabilities as a potential 'world model' and discuss its human-centered design. |
Diffusion-based models, including Sora, have become the dominant approach in text-to-video generation due to their ability to generate high-quality and coherent videos.
Despite advancements, challenges remain in areas like handling multiple entities, causal-effect learning, physical interaction simulation, and object scaling.
There's a need for larger, more diverse text-video datasets and more sophisticated evaluation metrics that go beyond just visual quality. |
The paper focuses heavily on Sora, which, despite its significance, limits the depth of analysis on other models.
The ethical considerations, while mentioned, could be explored in more detail, especially regarding potential misuse and bias. |
text-to-video generation, generative ai, sora model, world models, ai ethics |
2403.05125
Report |
Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis |
Muxi Chen, Yi Liu, Jian Yi, Changran Xu, Qiuxia Lai, Hongliang Wang, Tsung-Yi Ho, Qiang Xu |
In this paper, we present an empirical study introducing a nuanced evaluation
framework for text-to-image (T2I) generative models, applied to human image
synthesis. Our framework categorizes evaluations into two distinct groups:
first, focusing on image qualities such as aesthetics and realism, and second,
examining text conditions through concept coverage and fairness. We introduce
an innovative aesthetic score prediction model that assesses the visual appeal
of generated images and unveils the first dataset marked with low-quality
regions in generated human images to facilitate automatic defect detection. Our
exploration into concept coverage probes the model's effectiveness in
interpreting and rendering text-based concepts accurately, while our analysis
of fairness reveals biases in model outputs, with an emphasis on gender, race,
and age. While our study is grounded in human imagery, this dual-faceted
approach is designed with the flexibility to be applicable to other forms of
image generation, enhancing our understanding of generative models and paving
the way to the next generation of more sophisticated, contextually aware, and
ethically attuned generative models. We will release our code, the data used
for evaluating generative models and the dataset annotated with defective areas
soon. |
This paper presents an empirical study with a new evaluation framework for text-to-image (T2I) generative models, specifically for human image synthesis. |
Existing evaluation metrics are insufficient to fully capture model performance, especially in terms of realism, adherence to text prompts, and potential biases. |
The framework uses two approaches: 1) Image Quality: A new aesthetic score prediction model (CAN) and a dataset with annotated defects in generated human images are introduced. 2) Text Condition: Concept coverage is assessed using VQA-based metrics, and fairness is analyzed by identifying potential biases in gender, race, and age. |
Midjourney generates images with higher aesthetic scores and lower defect rates compared to Stable Diffusion models.
Stable Diffusion models have shown improvements in aesthetics and realism with each update (SD1.5 to SDXL).
All evaluated models exhibit significant fairness issues, often generating biased images based on gender, race, and age despite no explicit prompt specification. |
The current defect identification model requires further improvement.
The concept coverage evaluation currently focuses on single concepts and needs to be expanded to address multiple concepts in a single prompt. |
text-to-image synthesis, generative models, human image generation, evaluation framework, bias detection |
2403.05121
Report |
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion |
Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang |
Recent advancements in text-to-image generative systems have been largely
driven by diffusion models. However, single-stage text-to-image diffusion
models still face challenges, in terms of computational efficiency and the
refinement of image details. To tackle the issue, we propose CogView3, an
innovative cascaded framework that enhances the performance of text-to-image
diffusion. CogView3 is the first model implementing relay diffusion in the
realm of text-to-image generation, executing the task by first creating
low-resolution images and subsequently applying relay-based super-resolution.
This methodology not only results in competitive text-to-image outputs but also
greatly reduces both training and inference costs. Our experimental results
demonstrate that CogView3 outperforms SDXL, the current state-of-the-art
open-source text-to-image diffusion model, by 77.0\% in human evaluations, all
while requiring only about 1/2 of the inference time. The distilled variant of
CogView3 achieves comparable performance while only utilizing 1/10 of the
inference time by SDXL. |
This paper introduces CogView3, a novel text-to-image generation system leveraging relay diffusion to enhance efficiency and detail refinement. |
Existing single-stage text-to-image diffusion models are computationally expensive and struggle with detailed refinement, prompting the need for more efficient and effective approaches. |
CogView3 employs a cascaded framework, generating low-resolution images before applying relay-based super-resolution for refinement. This approach, implemented in the latent image space with a linear blurring schedule, reduces training and inference costs while maintaining output quality. Notably, it uses a pretrained T5-XXL text encoder and a variational KL-regularized autoencoder for latent representation. |
CogView3 outperforms SDXL in human evaluations by 77.0% while halving inference time.
The distilled CogView3 achieves comparable performance to SDXL using only 1/10 of the inference time.
Prompt expansion techniques significantly improve CogView3's instruction following capabilities. |
Exploring the generation of even higher resolution images (e.g., 4096x4096) using tiled diffusion methods is a potential future direction.
Further investigation into optimizing the trade-off between generation quality and inference speed during distillation is warranted. |
text-to-image generation, diffusion models, relay diffusion, cascaded framework, super-resolution |
2403.05094
Report |
Face2Diffusion for Fast and Editable Face Personalization |
Kaede Shiohara, Toshihiko Yamasaki |
Face personalization aims to insert specific faces, taken from images, into
pretrained text-to-image diffusion models. However, it is still challenging for
previous methods to preserve both the identity similarity and editability due
to overfitting to training samples. In this paper, we propose Face2Diffusion
(F2D) for high-editability face personalization. The core idea behind F2D is
that removing identity-irrelevant information from the training pipeline
prevents the overfitting problem and improves editability of encoded faces. F2D
consists of the following three novel components: 1) Multi-scale identity
encoder provides well-disentangled identity features while keeping the benefits
of multi-scale information, which improves the diversity of camera poses. 2)
Expression guidance disentangles face expressions from identities and improves
the controllability of face expressions. 3) Class-guided denoising
regularization encourages models to learn how faces should be denoised, which
boosts the text-alignment of backgrounds. Extensive experiments on the
FaceForensics++ dataset and diverse prompts demonstrate our method greatly
improves the trade-off between the identity- and text-fidelity compared to
previous state-of-the-art methods. |
This paper introduces Face2Diffusion (F2D), a novel method for face personalization in text-to-image diffusion models that enhances editability while preserving identity similarity. |
Existing face personalization techniques often lead to overfitting on training samples, compromising the model's ability to generate diverse images that adhere to different text prompts while maintaining the subject's identity. |
F2D tackles the overfitting problem through three key innovations: 1) a multi-scale identity encoder for disentangling camera poses, 2) expression guidance for separating expressions from identity features, and 3) class-guided denoising regularization to enhance text-alignment in backgrounds. |
F2D outperforms nine state-of-the-art methods in balancing identity preservation and text alignment, evidenced by achieving the best scores in combined identity-text metrics.
The multi-scale identity encoder successfully disentangles camera poses, leading to improved editability compared to using only the deepest layer features.
Class-guided denoising regularization effectively reduces overfitting to background information without compromising identity similarity, unlike techniques like DSC. |
The reliance on the class word "a person" in CGDR makes the model susceptible to biases inherent in the base T2I model's representation of that concept.
Future work can focus on mitigating the potential misuse of face personalization for creating misleading content, such as by contributing generated images to image forensic research. |
face personalization, text-to-image synthesis, diffusion models, overfitting, disentanglement |
2403.05087
Report |
SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting |
Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, Zeyu Wang |
We present SplattingAvatar, a hybrid 3D representation of photorealistic
human avatars with Gaussian Splatting embedded on a triangle mesh, which
renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We
disentangle the motion and appearance of a virtual human with explicit mesh
geometry and implicit appearance modeling with Gaussian Splatting. The
Gaussians are defined by barycentric coordinates and displacement on a triangle
mesh as Phong surfaces. We extend lifted optimization to simultaneously
optimize the parameters of the Gaussians while walking on the triangle mesh.
SplattingAvatar is a hybrid representation of virtual humans where the mesh
represents low-frequency motion and surface deformation, while the Gaussians
take over the high-frequency geometry and detailed appearance. Unlike existing
deformation methods that rely on an MLP-based linear blend skinning (LBS) field
for motion, we control the rotation and translation of the Gaussians directly
by mesh, which empowers its compatibility with various animation techniques,
e.g., skeletal animation, blend shapes, and mesh editing. Trainable from
monocular videos for both full-body and head avatars, SplattingAvatar shows
state-of-the-art rendering quality across multiple datasets. |
SplattingAvatar: a hybrid 3D representation for photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh for real-time rendering. |
Addresses limitations of NeRF and MLP-based motion control in capturing high-frequency details and surface deformations for real-time realistic avatar rendering. |
Combines Gaussian Splatting for high-frequency details with mesh representation for low-frequency motion and deformation. Uses lifted optimization for joint optimization of Gaussian parameters and mesh embeddings, enabling explicit motion control of Gaussians by the mesh. |
Achieves state-of-the-art rendering quality for both head and full-body avatars from monocular videos.
Demonstrates efficient real-time rendering capabilities in Unity, achieving over 300 FPS on an NVIDIA RTX 3090 GPU and 30 FPS on an iPhone 13.
Outperforms existing methods in terms of photometric quality with improved details and handling of thin structures, as evidenced by quantitative metrics like PSNR, SSIM, and LPIPS. |
Performance depends on the motion representation capability of the driving mesh, limited by current FLAME and SMPL-X models.
Lacks separate motion representation for clothes and hair. |
human avatar, gaussian splatting, real-time rendering, mesh embedding, lifted optimization |
2403.05056
Report |
Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation |
Yifan Mao, Jian Liu, Xianming Liu |
Monocular depth estimation is a crucial task in computer vision. While
existing methods have shown impressive results under standard conditions, they
often face challenges in reliably performing in scenarios such as low-light or
rainy conditions due to the absence of diverse training data. This paper
introduces a novel approach named Stealing Stable Diffusion (SSD) prior for
robust monocular depth estimation. The approach addresses this limitation by
utilizing stable diffusion to generate synthetic images that mimic challenging
conditions. Additionally, a self-training mechanism is introduced to enhance
the model's depth estimation capability in such challenging environments. To
enhance the utilization of the stable diffusion prior further, the DINOv2
encoder is integrated into the depth model architecture, enabling the model to
leverage rich semantic priors and improve its scene understanding. Furthermore,
a teacher loss is introduced to guide the student models in acquiring
meaningful knowledge independently, thus reducing their dependency on the
teacher models. The effectiveness of the approach is evaluated on nuScenes and
Oxford RobotCar, two challenging public datasets, with the results showing the
efficacy of the method. Source code and weights are available at:
https://github.com/hitcslj/SSD. |
This paper introduces Stealing Stable Diffusion (SSD), a novel approach that leverages stable diffusion priors for robust monocular depth estimation in challenging conditions like low-light and rain. |
Existing monocular depth estimation methods struggle in challenging conditions due to the lack of diverse training data and the limitations of existing data augmentation techniques. |
SSD utilizes a generative diffusion model-based translation (GDT) model to generate synthetic images mimicking challenging conditions, employs a self-training mechanism with a teacher-student network architecture, and incorporates a novel teacher loss and semantic loss for improved knowledge distillation. |
SSD outperforms existing methods on nuScenes and RobotCar datasets, achieving state-of-the-art performance in challenging conditions.
The GDT model effectively generates diverse and realistic images of challenging conditions, surpassing GAN-based methods.
The proposed teacher loss and semantic loss contribute to improved depth estimation accuracy by facilitating effective knowledge transfer and semantic feature alignment. |
The performance of SSD relies on the quality and diversity of the generated synthetic images, which can be further improved with advancements in generative diffusion models.
The computational cost of SSD is higher than some existing methods due to the use of multiple large pre-trained models. |
monocular depth estimation, robustness, stable diffusion, self-training, generative diffusion models |
2403.05053
Report |
PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering |
Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin |
Image composition involves seamlessly integrating given objects into a
specific visual context. The current training-free methods rely on composing
attention weights from several samplers to guide the generator. However, since
these weights are derived from disparate contexts, their combination leads to
coherence confusion in synthesis and loss of appearance information. These
issues worsen with their excessive focus on background generation, even when
unnecessary in this task. This not only slows down inference but also
compromises foreground generation quality. Moreover, these methods introduce
unwanted artifacts in the transition area. In this paper, we formulate image
composition as a subject-based local editing task, solely focusing on
foreground generation. At each step, the edited foreground is combined with the
noisy background to maintain scene consistency. To address the remaining
issues, we propose PrimeComposer, a faster training-free diffuser that
composites the images by well-designed attention steering across different
noise levels. This steering is predominantly achieved by our Correlation
Diffuser, utilizing its self-attention layers at each step. Within these
layers, the synthesized subject interacts with both the referenced object and
background, capturing intricate details and coherent relationships. This prior
information is encoded into the attention weights, which are then integrated
into the self-attention layers of the generator to guide the synthesis process.
Besides, we introduce a Region-constrained Cross-Attention to confine the
impact of specific subject-related words to desired regions, addressing the
unwanted artifacts shown in the prior method thereby further improving the
coherence in the transition area. Our method exhibits the fastest inference
efficiency and extensive experiments demonstrate our superiority both
qualitatively and quantitatively. |
Proposes PrimeComposer, a faster training-free diffusion model for image composition that leverages attention steering to seamlessly integrate objects while preserving their appearance and ensuring natural coherence. |
Current training-free image composition methods struggle to maintain object appearance and coherent integration, especially across different visual domains. They also suffer from slow inference due to unnecessary background generation. |
Formulates composition as a local editing task focused on the foreground. Employs a Correlation Diffuser to generate attention weights capturing object appearance and coherence information, which are then used to guide the main diffusion model (LDM). Introduces Region-constrained Cross-Attention (RCA) to restrict the impact of object-specific words to desired regions, further enhancing coherence. Extends classifier-free guidance to reinforce the steering effect. |
Outperforms state-of-the-art methods qualitatively and quantitatively in cross-domain image composition.
Exhibits significantly faster inference speed compared to the previous best training-free method.
Receives favorable feedback in user studies across various domains, demonstrating its effectiveness in preserving object appearance, background consistency, and seamless composition. |
Limited control over object viewpoint.
Current methodology cannot seamlessly integrate multiple objects simultaneously. |
image composition, diffusion models, attention steering, local image editing, training-free |
2403.05018
Report |
InstructGIE: Towards Generalizable Image Editing |
Zichong Meng, Changdi Yang, Jun Liu, Hao Tang, Pu Zhao, Yanzhi Wang |
Recent advances in image editing have been driven by the development of
denoising diffusion models, marking a significant leap forward in this field.
Despite these advances, the generalization capabilities of recent image editing
approaches remain constrained. In response to this challenge, our study
introduces a novel image editing framework with enhanced generalization
robustness by boosting in-context learning capability and unifying language
instruction. This framework incorporates a module specifically optimized for
image editing tasks, leveraging the VMamba Block and an editing-shift matching
strategy to augment in-context learning. Furthermore, we unveil a selective
area-matching technique specifically engineered to address and rectify
corrupted details in generated images, such as human facial features, to
further improve the quality. Another key innovation of our approach is the
integration of a language unification technique, which aligns language
embeddings with editing semantics to elevate the quality of image editing.
Moreover, we compile the first dataset for image editing with visual prompts
and editing instructions that could be used to enhance in-context capability.
Trained on this dataset, our methodology not only achieves superior synthesis
quality for trained tasks, but also demonstrates robust generalization
capability across unseen vision tasks through tailored prompts. |
This paper introduces InstructGIE, a novel image editing framework that improves generalization robustness in image editing by enhancing in-context learning and unifying language instructions. |
Existing image editing methods struggle to generalize to unseen editing tasks due to limitations in understanding complex visual and textual instructions. |
The proposed InstructGIE framework utilizes a VMamba-based module and an editing-shift matching strategy to enhance in-context learning. It also employs a language unification technique to align language embeddings with editing semantics. Additionally, a selective area-matching method refines details in generated images. |
InstructGIE demonstrates superior synthesis quality for trained image editing tasks.
The framework exhibits robust generalization capabilities across unseen vision tasks through tailored prompts.
Quantitative and qualitative evaluations demonstrate significant improvements in FID and CLIP directional Similarity scores compared to existing methods. |
The dependence on pre-trained models like CLIP and Mask2Former introduces potential biases.
Further exploration of more complex and nuanced editing instructions is an area for future research. |
image editing, in-context learning, diffusion model, generalization, visual prompting |
2403.04993
Report |
PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts |
Zewen Chen, Haina Qin, Juan Wang, Chunfeng Yuan, Bing Li, Weiming Hu, Liang Wang |
Due to the diversity of assessment requirements in various application
scenarios for the IQA task, existing IQA methods struggle to directly adapt to
these varied requirements after training. Thus, when facing new requirements, a
typical approach is fine-tuning these models on datasets specifically created
for those requirements. However, it is time-consuming to establish IQA
datasets. In this work, we propose a Prompt-based IQA (PromptIQA) that can
directly adapt to new requirements without fine-tuning after training. On one
hand, it utilizes a short sequence of Image-Score Pairs (ISP) as prompts for
targeted predictions, which significantly reduces the dependency on the data
requirements. On the other hand, PromptIQA is trained on a mixed dataset with
two proposed data augmentation strategies to learn diverse requirements, thus
enabling it to effectively adapt to new requirements. Experiments indicate that
the PromptIQA outperforms SOTA methods with higher performance and better
generalization. The code will be available. |
This paper introduces PromptIQA, a novel No-Reference Image Quality Assessment (NR-IQA) framework that adapts to new assessment requirements using a small set of image-score pairs as prompts, eliminating the need for fine-tuning. |
Existing IQA methods struggle to adapt to diverse assessment requirements across different applications. Fine-tuning on new datasets is a common approach but is time-consuming and impractical for every new requirement. |
PromptIQA leverages Image-Score Pair Prompts (ISPPs) to represent specific assessment requirements. It's trained on a mixed dataset using data augmentation (random scaling and flipping) to learn diverse requirements, enabling adaptation to new ones without fine-tuning. |
PromptIQA outperforms state-of-the-art IQA methods, especially on authentic distortion, face, AI-generated, and underwater IQA tasks.
It demonstrates superior generalization ability on new assessment requirements simulated by FR-IQA models compared to models trained on specific datasets or with fine-tuning.
Ablation studies confirm the effectiveness of the proposed components, including mixed training, prompts, and data augmentation strategies. |
Performance on synthetic distortion datasets (LIVE, CSIQ) needs improvement, potentially due to differences in label distribution compared to other datasets.
Future work can explore alternative prompt selection strategies and investigate the impact of prompt size more comprehensively. |
nr-iqa, image quality assessment, prompts, generalization, data augmentation |
2403.04965
Report |
StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models |
Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, Siavash Arjomand Bigdeli |
The demand for stereo images increases as manufacturers launch more XR
devices. To meet this demand, we introduce StereoDiffusion, a method that,
unlike traditional inpainting pipelines, is trainning free, remarkably
straightforward to use, and it seamlessly integrates into the original Stable
Diffusion model. Our method modifies the latent variable to provide an
end-to-end, lightweight capability for fast generation of stereo image pairs,
without the need for fine-tuning model weights or any post-processing of
images. Using the original input to generate a left image and estimate a
disparity map for it, we generate the latent vector for the right image through
Stereo Pixel Shift operations, complemented by Symmetric Pixel Shift Masking
Denoise and Self-Attention Layers Modification methods to align the right-side
image with the left-side image. Moreover, our proposed method maintains a high
standard of image quality throughout the stereo generation process, achieving
state-of-the-art scores in various quantitative evaluations. |
This document appears to be an instructional template for a scientific paper, outlining standard sections and LaTeX formatting for elements like citations, figures, tables, and lists. |
Provides a framework for writing scientific papers and ensures consistency in formatting. |
Presents a structured template with placeholders (lipsum text) and example code for various elements within a LaTeX document. |
|
Content is placeholder text and lacks concrete research findings.
Limited to LaTeX formatting and doesn't cover research methodology or analysis. |
latex, academic writing, template, scientific paper, formatting |
2403.04926
Report |
BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling |
Cheng Peng, Yutao Tang, Yifan Zhou, Nengyu Wang, Xijun Liu, Deming Li, Rama Chellappa |
Recent efforts in using 3D Gaussians for scene reconstruction and novel view
synthesis can achieve impressive results on curated benchmarks; however, images
captured in real life are often blurry. In this work, we analyze the robustness
of Gaussian-Splatting-based methods against various image blur, such as motion
blur, defocus blur, downscaling blur, \etc. Under these degradations,
Gaussian-Splatting-based methods tend to overfit and produce worse results than
Neural-Radiance-Field-based methods. To address this issue, we propose Blur
Agnostic Gaussian Splatting (BAGS). BAGS introduces additional 2D modeling
capacities such that a 3D-consistent and high quality scene can be
reconstructed despite image-wise blur. Specifically, we model blur by
estimating per-pixel convolution kernels from a Blur Proposal Network (BPN).
BPN is designed to consider spatial, color, and depth variations of the scene
to maximize modeling capacity. Additionally, BPN also proposes a
quality-assessing mask, which indicates regions where blur occur. Finally, we
introduce a coarse-to-fine kernel optimization scheme; this optimization scheme
is fast and avoids sub-optimal solutions due to a sparse point cloud
initialization, which often occurs when we apply Structure-from-Motion on
blurry images. We demonstrate that BAGS achieves photorealistic renderings
under various challenging blur conditions and imaging geometry, while
significantly improving upon existing approaches. |
This paper introduces Blur Agnostic Gaussian Splatting (BAGS), a novel method addressing the sensitivity of Gaussian Splatting-based scene reconstruction to blurry images. |
Gaussian Splatting, while efficient, struggles with real-world blurry images. BAGS improves robustness by separating multi-view consistent scenes from inconsistent degradations. |
BAGS employs a Blur Proposal Network (BPN) to estimate per-pixel convolution kernels and masks, considering spatial, color, and depth variations. A coarse-to-fine optimization scheme gradually increases image resolution and kernel size for stability. |
BAGS achieves state-of-the-art performance on scenes with camera motion blur, outperforming NeRF-based deblurring methods.
It demonstrates significant visual improvements on defocus blur and handles mixed-resolution inputs effectively.
The generated masks and kernels provide interpretable insights into degradation types and regions. |
The added computational complexity of BPN requires further optimization.
Future work includes exploring dynamic kernel capacity adjustment based on degradation levels. |
scene reconstruction, gaussian splatting, deblurring, novel view synthesis, multi-scale optimization |
2403.04880
Report |
An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control |
Aosong Feng, Weikang Qiu, Jinbin Bai, Kaicheng Zhou, Zhen Dong, Xiao Zhang, Rex Ying, Leandros Tassiulas |
Building on the success of text-to-image diffusion models (DPMs), image
editing is an important application to enable human interaction with
AI-generated content. Among various editing methods, editing within the prompt
space gains more attention due to its capacity and simplicity of controlling
semantics. However, since diffusion models are commonly pretrained on
descriptive text captions, direct editing of words in text prompts usually
leads to completely different generated images, violating the requirements for
image editing. On the other hand, existing editing methods usually consider
introducing spatial masks to preserve the identity of unedited regions, which
are usually ignored by DPMs and therefore lead to inharmonic editing results.
Targeting these two challenges, in this work, we propose to disentangle the
comprehensive image-prompt interaction into several item-prompt interactions,
with each item linked to a special learned prompt. The resulting framework,
named D-Edit, is based on pretrained diffusion models with cross-attention
layers disentangled and adopts a two-step optimization to build item-prompt
associations. Versatile image editing can then be applied to specific items by
manipulating the corresponding prompts. We demonstrate state-of-the-art results
in four types of editing operations including image-based, text-based,
mask-based editing, and item removal, covering most types of editing
applications, all within a single unified framework. Notably, D-Edit is the
first framework that can (1) achieve item editing through mask editing and (2)
combine image and text-based editing. We demonstrate the quality and
versatility of the editing results for a diverse collection of images through
both qualitative and quantitative evaluations. |
This paper introduces D-Edit, a versatile image editing framework for diffusion models that disentangles image-prompt interactions into item-prompt associations for item-level editing. |
Existing diffusion model editing methods struggle to preserve original image information and maintain consistency with editing guidance. D-Edit addresses these challenges by disentangling control and leveraging unique item prompts. |
D-Edit utilizes a two-step finetuning process: first, optimizing text encoder embeddings for item prompts (special tokens), then fine-tuning UNet weights with grouped cross-attention to disentangle item-prompt interactions. This allows editing by manipulating prompts, masks, and item-prompt associations. |
D-Edit enables item-level text-guided editing, surpassing null-text inversion with better detail preservation and natural transitions.
It supports image-guided editing, outperforming baselines by seamlessly composing objects while retaining their identities.
D-Edit allows mask-based editing (moving, reshaping, resizing, refining) and item removal, leading to natural and reasonable results. |
The quality of editing relies on the accuracy of the segmentation model.
Further exploration of different segmentation methods and their impact on editing is needed. |
image editing, diffusion models, text-to-image, disentangled representation, item-prompt association |
2403.04692
Report |
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation |
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li |
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer
model~(DiT) capable of directly generating images at 4K resolution.
PixArt-\Sigma represents a significant advancement over its predecessor,
PixArt-\alpha, offering images of markedly higher fidelity and improved
alignment with text prompts. A key feature of PixArt-\Sigma is its training
efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it
evolves from the `weaker' baseline to a `stronger' model via incorporating
higher quality data, a process we term "weak-to-strong training". The
advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data:
PixArt-\Sigma incorporates superior-quality image data, paired with more
precise and detailed image captions. (2) Efficient Token Compression: we
propose a novel attention module within the DiT framework that compresses both
keys and values, significantly improving efficiency and facilitating
ultra-high-resolution image generation. Thanks to these improvements,
PixArt-\Sigma achieves superior image quality and user prompt adherence
capabilities with significantly smaller model size (0.6B parameters) than
existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD
Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K
images supports the creation of high-resolution posters and wallpapers,
efficiently bolstering the production of high-quality visual content in
industries such as film and gaming. |
This paper introduces PixArt-sigma, a Diffusion Transformer model capable of directly generating 4K resolution images with high fidelity and improved text-prompt alignment. |
Existing high-quality text-to-image models require substantial resources for training, hindering innovation. This paper explores efficient methods to integrate new datasets and algorithms into pre-trained models, enabling the development of more powerful models with limited resources. |
The paper leverages the pre-trained PixArt-alpha model and introduces two key advancements: (1) Training with a higher-quality dataset containing high-resolution images and detailed captions. (2) Implementing an efficient token compression method within the DiT framework to reduce computational demands for high-resolution generation. |
PixArt-sigma achieves superior image quality and text-prompt alignment compared to its predecessor, PixArt-alpha, with minimal additional training cost.
The model produces high-fidelity 4K images with a smaller model size (0.6B parameters) compared to existing models like SDXL (2.6B) and SD Cascade (5.1B).
Human and AI preference studies demonstrate that PixArt-sigma generates high-quality images that closely adhere to user instructions, outperforming or rivaling other open-source and commercial T2I models. |
The model still exhibits limitations in generating specific scenes, objects (text and hands), and perfectly aligning complex prompts.
Future research should focus on data quality, model scaling, mitigating potential biases, and addressing ethical concerns. |
text-to-image synthesis, diffusion models, diffusion transformer, high-resolution image generation, efficient ai |
2403.04690
Report |
Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level |
Ali Hassani, Wen-Mei Hwu, Humphrey Shi |
Neighborhood attention reduces the cost of self attention by restricting each
token's attention span to its nearest neighbors. This restriction,
parameterized by a window size and dilation factor, draws a spectrum of
possible attention patterns between linear projection and self attention.
Neighborhood attention, and more generally sliding window attention patterns,
have long been bounded by infrastructure, particularly in higher-rank spaces
(2-D and 3-D), calling for the development of custom kernels, which have been
limited in either functionality, or performance, if not both. In this work, we
first show that neighborhood attention can be represented as a batched GEMM
problem, similar to standard attention, and implement it for 1-D and 2-D
neighborhood attention. These kernels on average provide 895% and 272%
improvement in full precision latency compared to existing naive kernels for
1-D and 2-D neighborhood attention respectively. We find certain inherent
inefficiencies in all unfused neighborhood attention kernels that bound their
performance and lower-precision scalability. We also developed fused
neighborhood attention; an adaptation of fused dot-product attention kernels
that allow fine-grained control over attention across different spatial axes.
Known for reducing the quadratic time complexity of self attention to a linear
complexity, neighborhood attention can now enjoy a reduced and constant memory
footprint, and record-breaking half precision latency. We observe that our
fused kernels successfully circumvent some of the unavoidable inefficiencies in
unfused implementations. While our unfused GEMM-based kernels only improve half
precision performance compared to naive kernels by an average of 496% and 113%
in 1-D and 2-D problems respectively, our fused kernels improve naive kernels
by an average of 1607% and 581% in 1-D and 2-D problems respectively. |
This paper introduces two new classes of CUDA kernels for neighborhood attention, significantly improving performance over existing implementations. |
Neighborhood attention reduces the quadratic complexity of self-attention to linear complexity, but efficient implementations have been challenging, limiting its practicality. |
The authors formulate neighborhood attention as a batched GEMM problem with space-aware tiling and gather/scatter fusion, enabling efficient implementation using optimized GEMM kernels. They further propose fused neighborhood attention, extending the logic to fused attention kernels for further latency and memory footprint reduction. |
GEMM-based kernels achieve up to 9x speedup over naive implementations in full precision and outperform them in most benchmarks.
Fused kernels consistently outperform both naive and GEMM-based kernels, with up to 16x speedup in half precision while reducing memory footprint.
Model-level benchmarks demonstrate significant throughput improvements in image classification and image generation tasks using the proposed kernels. |
Current implementation lacks support for backpropagation in fused kernels.
Higher-rank implementations (2-D, 3-D) in fused kernels introduce some unavoidable overhead compared to single-rank (1-D) and self-attention. |
neighborhood attention, self attention, cuda, gemm, fused kernel |
2403.04634
Report |
Pix2Gif: Motion-Guided Diffusion for GIF Generation |
Hitesh Kandala, Jianfeng Gao, Jianwei Yang |
We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video)
generation. We tackle this problem differently by formulating the task as an
image translation problem steered by text and motion magnitude prompts, as
shown in teaser fig. To ensure that the model adheres to motion guidance, we
propose a new motion-guided warping module to spatially transform the features
of the source image conditioned on the two types of prompts. Furthermore, we
introduce a perceptual loss to ensure the transformed feature map remains
within the same space as the target image, ensuring content consistency and
coherence. In preparation for the model training, we meticulously curated data
by extracting coherent image frames from the TGIF video-caption dataset, which
provides rich information about the temporal changes of subjects. After
pretraining, we apply our model in a zero-shot manner to a number of video
datasets. Extensive qualitative and quantitative experiments demonstrate the
effectiveness of our model -- it not only captures the semantic prompt from
text but also the spatial ones from motion guidance. We train all our models
using a single node of 16xV100 GPUs. Code, dataset and models are made public
at: https://hiteshk03.github.io/Pix2Gif/. |
Presents Pix2Gif, a motion-guided diffusion model for generating GIFs from a single image using text and motion magnitude prompts, framing the task as an image translation problem. |
Addresses limitations in existing video generation models that compromise resolution and fine-grained temporal control by enabling high-resolution GIF generation with precise motion guidance. |
Leverages latent diffusion models (LDMs) and introduces a motion-guided warping module to transform source image features based on motion prompts, ensuring consistency with perceptual loss and training on a curated TGIF dataset. |
Pix2Gif generates GIFs with superior temporal coherence compared to state-of-the-art methods.
The model demonstrates enhanced controllability over motion dynamics in generated GIFs.
Pix2Gif exhibits emergent capabilities for combining different actions based on complex text prompts. |
Limited resolution (256x256 pixels) of generated frames.
Training dataset size is limited due to computational constraints, potentially affecting model performance. |
gif generation, motion-guided diffusion, image-to-image translation, temporal coherence, video generation |
2403.04493
Report |
What makes an image realistic? |
Lucas Theis |
The last decade has seen tremendous progress in our ability to generate
realistic-looking data, be it images, text, audio, or video. Here, we discuss
the closely related problem of quantifying realism, that is, designing
functions that can reliably tell realistic data from unrealistic data. This
problem turns out to be significantly harder to solve and remains poorly
understood, despite its prevalence in machine learning and recent breakthroughs
in generative AI. Drawing on insights from algorithmic information theory, we
discuss why this problem is challenging, why a good generative model alone is
insufficient to solve it, and what a good solution would look like. In
particular, we introduce the notion of a universal critic, which unlike
adversarial critics does not require adversarial training. While universal
critics are not immediately practical, they can serve both as a North Star for
guiding practical implementations and as a tool for analyzing existing attempts
to capture realism. |
This paper argues that quantifying the realism of data, such as images, can be understood as determining its randomness deficiency, drawing parallels with algorithmic information theory. |
Defining and measuring realism is crucial for various machine learning applications, including anomaly detection, deepfake detection, and generative model evaluation, yet it remains a challenging and poorly understood problem. |
The paper leverages the concept of randomness deficiency from algorithmic information theory, proposing "universal critics" based on Kolmogorov complexity and Solomonoff's probability to quantify realism. |
Randomness deficiency, defined as the difference between negative log-probability and Kolmogorov complexity, effectively captures realism.
Batched universal critics, processing multiple independent samples, provide tighter bounds for realism evaluation and generalize both no-reference metrics and divergences.
The concept of universal critics sheds light on the success of existing techniques like score distillation sampling, suggesting new avenues for improvement. |
Kolmogorov complexity and Solomonoff's probability are uncomputable, necessitating practical approximations for real-world applications.
Further research is needed to explore efficient and robust approximations to universal critics for optimization tasks. |
perceptual quality, realism, neural compression, generative adversarial networks, algorithmic information theory |
2403.04437
Report |
StableDrag: Stable Dragging for Point-based Image Editing |
Yutao Cui, Xiaotong Zhao, Guozhen Zhang, Shengming Cao, Kai Ma, Limin Wang |
Point-based image editing has attracted remarkable attention since the
emergence of DragGAN. Recently, DragDiffusion further pushes forward the
generative quality via adapting this dragging technique to diffusion models.
Despite these great success, this dragging scheme exhibits two major drawbacks,
namely inaccurate point tracking and incomplete motion supervision, which may
result in unsatisfactory dragging outcomes. To tackle these issues, we build a
stable and precise drag-based editing framework, coined as StableDrag, by
designing a discirminative point tracking method and a confidence-based latent
enhancement strategy for motion supervision. The former allows us to precisely
locate the updated handle points, thereby boosting the stability of long-range
manipulation, while the latter is responsible for guaranteeing the optimized
latent as high-quality as possible across all the manipulation steps. Thanks to
these unique designs, we instantiate two types of image editing models
including StableDrag-GAN and StableDrag-Diff, which attains more stable
dragging performance, through extensive qualitative experiments and
quantitative assessment on DragBench. |
This paper presents StableDrag, a stable dragging framework for point-based image editing, improving upon previous methods like DragGAN and DragDiffusion. |
Existing dragging techniques suffer from inaccurate point tracking and incomplete motion supervision, leading to unsatisfactory editing outcomes. |
StableDrag introduces a discriminative point tracking method using a learned convolutional filter to better locate updated handle points. It also employs a confidence-based latent enhancement strategy during motion supervision to ensure high-quality optimization at each step. |
StableDrag accurately moves handle points to target points, even for long-range manipulations.
It generates higher-quality editing results, preserving image fidelity and content consistency.
Quantitative evaluation on DragBench shows StableDrag outperforms DragDiffusion in both accuracy and image quality. |
The current implementation relies on a local search strategy during point tracking, limiting its ability to differentiate between very similar objects.
Future work includes exploring global tracking strategies and adapting StableDrag to other generative models. |
image editing, generative models, stable diffusion, draggan, point tracking |
2403.04321
Report |
Discriminative Probing and Tuning for Text-to-Image Generation |
Leigang Qu, Wenjie Wang, Yongqi Li, Hanwang Zhang, Liqiang Nie, Tat-Seng Chua |
Despite advancements in text-to-image generation (T2I), prior methods often
face text-image misalignment problems such as relation confusion in generated
images. Existing solutions involve cross-attention manipulation for better
compositional understanding or integrating large language models for improved
layout planning. However, the inherent alignment capabilities of T2I models are
still inadequate. By reviewing the link between generative and discriminative
modeling, we posit that T2I models' discriminative abilities may reflect their
text-image alignment proficiency during generation. In this light, we advocate
bolstering the discriminative abilities of T2I models to achieve more precise
text-to-image alignment for generation. We present a discriminative adapter
built on T2I models to probe their discriminative abilities on two
representative tasks and leverage discriminative fine-tuning to improve their
text-image alignment. As a bonus of the discriminative adapter, a
self-correction mechanism can leverage discriminative gradients to better align
generated images to text prompts during inference. Comprehensive evaluations
across three benchmark datasets, including both in-distribution and
out-of-distribution scenarios, demonstrate our method's superior generation
performance. Meanwhile, it achieves state-of-the-art discriminative performance
on the two discriminative tasks compared to other generative models. |
This paper proposes DPT, a novel paradigm to enhance text-image alignment in text-to-image generation models by improving their discriminative abilities. |
Existing text-to-image generation models often struggle with accurately aligning generated images with text prompts, especially in complex scenes. This misalignment issue hinders the generation of high-quality images that faithfully reflect the input text. |
DPT is a two-stage process. Stage 1 (Discriminative Probing) assesses the model's discriminative abilities on Image-Text Matching (ITM) and Referring Expression Comprehension (REC) tasks using a lightweight Discriminative Adapter. Stage 2 (Discriminative Tuning) improves these abilities through parameter-efficient fine-tuning using LoRA, focusing on enhancing both generative and discriminative performance. Additionally, a self-correction mechanism guides image generation towards better alignment during inference. |
DPT significantly improves text-image alignment across five diverse T2I benchmarks, outperforming existing state-of-the-art methods in terms of alignment accuracy.
The study reveals that text-to-image generation models possess inherent discriminative abilities (global matching and local grounding), which can be effectively enhanced through discriminative tuning.
The proposed self-correction mechanism effectively guides the generation process towards images better aligned with the text prompts. |
The study primarily focuses on two specific discriminative tasks (ITM and REC). Exploring the impact of other discriminative tasks on text-to-image generation could be beneficial.
Balancing multi-task learning objectives, especially those related to generation and discrimination, requires further investigation to prevent potential conflicts during optimization. |
text-to-image generation, text-image alignment, discriminative probing, discriminative tuning, self-correction |
2403.04306
Report |
Effectiveness Assessment of Recent Large Vision-Language Models |
Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, Fahad Shahbaz Khan |
The advent of large vision-language models (LVLMs) represents a noteworthy
advancement towards the pursuit of artificial general intelligence. However,
the model efficacy across both specialized and general tasks warrants further
investigation. This paper endeavors to evaluate the competency of popular LVLMs
in specialized and general tasks, respectively, aiming to offer a comprehensive
understanding of these novel models. To gauge their efficacy in specialized
tasks, we employ six challenging tasks across three distinct application
scenarios, namely natural, healthcare, and industrial ones. Such six tasks
include salient/camouflaged/transparent object detection, as well as polyp
detection, skin lesion detection, and industrial anomaly detection. We examine
the performance of three recent open-source LVLMs, including MiniGPT-v2,
LLaVA-1.5, and Shikra, on both visual recognition and localization under these
tasks. Moreover, we conduct empirical investigations utilizing the
aforementioned LVLMs together with GPT-4V, assessing their multi-modal
understanding capabilities in general tasks including object counting, absurd
question answering, affordance reasoning, attribute recognition, and spatial
relation reasoning. Our investigations reveal that these LVLMs demonstrate
limited proficiency not only in specialized tasks but also in general tasks. We
delve deep into this inadequacy and uncover several potential factors,
including limited cognition in specialized tasks, object hallucination,
text-to-image interference, and decreased robustness in complex problems. We
hope this study could provide useful insights for the future development of
LVLMs, helping researchers improve LVLMs to cope with both general and
specialized applications. |
This paper presents a comprehensive evaluation of popular large vision-language models (LVLMs) on both specialized and general vision-language tasks. |
Understanding the strengths and limitations of LVLMs in handling specialized and general tasks is crucial for guiding future research and development towards artificial general intelligence. |
The authors evaluate three open-source LVLMs (MiniGPT-v2, LLaVA-1.5, and Shikra) on six specialized tasks and five general tasks. They quantitatively analyze model performance using established metrics and qualitatively examine failure cases to identify potential reasons for inadequacy. |
LVLMs show promising but insufficient performance on specialized tasks due to limited domain knowledge and common weaknesses like object hallucination.
Shikra and MiniGPT-v2 exhibit better localization capabilities than LLaVA-1.5, particularly in natural scenarios.
In general tasks, all evaluated LVLMs exhibit significant room for improvement, particularly in object counting, spatial reasoning, and absurd question answering. |
The evaluation primarily focuses on a limited number of open-source LVLMs.
Future work should explore effective strategies like prompt engineering and model optimization to improve LVLMs' performance on specialized tasks. |
large vision-language models, multi-modal understanding, specialized vision tasks, object hallucination, prompt engineering |
2403.04303
Report |
LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking |
Jialin Li, Qiang Nie, Weifu Fu, Yuhuan Lin, Guangpin Tao, Yong Liu, Chengjie Wang |
Deep learning models, particularly those based on transformers, often employ
numerous stacked structures, which possess identical architectures and perform
similar functions. While effective, this stacking paradigm leads to a
substantial increase in the number of parameters, posing challenges for
practical applications. In today's landscape of increasingly large models,
stacking depth can even reach dozens, further exacerbating this issue. To
mitigate this problem, we introduce LORS (LOw-rank Residual Structure). LORS
allows stacked modules to share the majority of parameters, requiring a much
smaller number of unique ones per module to match or even surpass the
performance of using entirely distinct ones, thereby significantly reducing
parameter usage. We validate our method by applying it to the stacked decoders
of a query-based object detector, and conduct extensive experiments on the
widely used MS COCO dataset. Experimental results demonstrate the effectiveness
of our method, as even with a 70\% reduction in the parameters of the decoder,
our method still enables the model to achieve comparable or |
This paper proposes a novel Low-rank Residual Structure (LORS) for parameter reduction in deep learning models with stacked structures. LORS decomposes parameters into shared and private components, significantly reducing overall parameter usage without compromising performance. |
Stacking structures, while effective, significantly increase parameter count, posing challenges for training, inference, and deployment. LORS addresses this issue by promoting parameter sharing. |
LORS is formulated mathematically for both adaptive and static parameters, utilizing low-rank decomposition and residual connections. The approach is validated by applying it to the stacked decoders of AdaMixer, a query-based object detector. |
LORS reduced AdaMixer's decoder parameters by up to 70% while maintaining or even improving performance on the MS COCO dataset.
Both adaptive and static LORS effectively reduced parameters without compromising performance.
Experiments showed that shared and private weights are crucial, and the optimal configuration for LORS depends on the specific task and model. |
While effective, LORS requires a relatively long training process to fully realize its potential.
The current LORS implementation slightly increases inference time due to serial and redundant computations, necessitating further optimization. |
parameter reduction, deep learning, stacked structures, object detection, low-rank decomposition |
2403.04279
Report |
Controllable Generation with Text-to-Image Diffusion Models: A Survey |
Pu Cao, Feng Zhou, Qing Song, Lu Yang |
In the rapidly advancing realm of visual generation, diffusion models have
revolutionized the landscape, marking a significant shift in capabilities with
their impressive text-guided generative functions. However, relying solely on
text for conditioning these models does not fully cater to the varied and
complex requirements of different applications and scenarios. Acknowledging
this shortfall, a variety of studies aim to control pre-trained text-to-image
(T2I) models to support novel conditions. In this survey, we undertake a
thorough review of the literature on controllable generation with T2I diffusion
models, covering both the theoretical foundations and practical advancements in
this domain. Our review begins with a brief introduction to the basics of
denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion
models. We then reveal the controlling mechanisms of diffusion models,
theoretically analyzing how novel conditions are introduced into the denoising
process for conditional generation. Additionally, we offer a detailed overview
of research in this area, organizing it into distinct categories from the
condition perspective: generation with specific conditions, generation with
multiple conditions, and universal controllable generation. For an exhaustive
list of the controllable generation literature surveyed, please refer to our
curated repository at
\url{https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models}. |
This paper presents a survey of controllable generation techniques using text-to-image diffusion models, focusing on how novel conditions beyond text prompts can steer the image generation process. |
Achieving fine-grained control over image generation is crucial for various applications. This survey provides a comprehensive overview of the rapidly developing field of controllable generation with T2I diffusion models. |
The paper categorizes existing methods based on condition types and analyzes two core controlling mechanisms: conditional score prediction and condition-guided score estimation. |
The survey introduces a structured taxonomy for classifying controllable generation methods based on condition types.
It provides an in-depth analysis of how conditional score prediction and condition-guided score estimation methods incorporate novel conditions into T2I models.
The paper highlights the diverse applications of conditional generation, demonstrating its impact on various tasks such as image manipulation, completion, and 3D generation. |
The paper primarily focuses on analyzing existing methods and their applications, leaving the exploration of potential future directions for controllable generation with T2I diffusion models as an open question.
The survey primarily focuses on image generation, leaving the exploration of conditional generation in other domains like video and 3D as a potential area for future investigation. |
text-to-image generation, diffusion models, controllable generation, conditional image synthesis, generative ai |
2403.04200
Report |
ACC-ViT : Atrous Convolution's Comeback in Vision Transformers |
Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara |
Transformers have elevated to the state-of-the-art vision architectures
through innovations in attention mechanism inspired from visual perception. At
present two classes of attentions prevail in vision transformers, regional and
sparse attention. The former bounds the pixel interactions within a region; the
latter spreads them across sparse grids. The opposing natures of them have
resulted in a dilemma between either preserving hierarchical relation or
attaining a global context. In this work, taking inspiration from atrous
convolution, we introduce Atrous Attention, a fusion of regional and sparse
attention, which can adaptively consolidate both local and global information,
while maintaining hierarchical relations. As a further tribute to atrous
convolution, we redesign the ubiquitous inverted residual convolution blocks
with atrous convolution. Finally, we propose a generalized, hybrid vision
transformer backbone, named ACC-ViT, following conventional practices for
standard vision tasks. Our tiny version model achieves $\sim 84 \%$ accuracy on
ImageNet-1K, with less than $28.5$ million parameters, which is $0.42\%$
improvement over state-of-the-art MaxViT while having $8.4\%$ less parameters.
In addition, we have investigated the efficacy of ACC-ViT backbone under
different evaluation settings, such as finetuning, linear probing, and
zero-shot learning on tasks involving medical image analysis, object detection,
and language-image contrastive learning. ACC-ViT is therefore a strong vision
backbone, which is also competitive in mobile-scale versions, ideal for niche
applications with small datasets. |
This paper introduces Atrous Attention, a novel attention mechanism for vision transformers, and proposes ACC-ViT, a hybrid vision transformer architecture based on this mechanism, inspired by atrous convolution. |
The proposed ACC-ViT aims to address the limitations of regional and sparse attention mechanisms in vision transformers by consolidating both local and global information while preserving hierarchical relations, thereby enhancing visual representation. |
The methodology involves designing Atrous Attention by emulating atrous convolution for sparse regional attention, implementing a gating operation for adaptive fusion of hierarchical features, using a shared MLP layer across parallel attentions for efficiency, and proposing Parallel Atrous Inverted Residual Convolution blocks. |
ACC-ViT achieves state-of-the-art performance, outperforming MaxViT and MOAT on ImageNet-1K with a tiny version achieving 83.97% accuracy.
The model exhibits strong transfer learning capabilities, surpassing baselines on medical image datasets (HAM10000, EyePACS, BUSI).
ACC-ViT demonstrates competence as a frozen backbone for object detection and shows promising zero-shot performance on the ELEVATER benchmark. |
Limitations include computational constraints preventing pretraining on larger datasets (ImageNet-21K) and developing larger models.
Future work involves exploring the full potential of ACC-ViT by scaling up the model and evaluating it on higher-resolution images. |
vision transformer, atrous attention, atrous convolution, hybrid architecture, transfer learning |
2403.03485
Report |
NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging |
Takahiro Shirakawa, Seiichi Uchida |
Layout-aware text-to-image generation is a task to generate multi-object
images that reflect layout conditions in addition to text conditions. The
current layout-aware text-to-image diffusion models still have several issues,
including mismatches between the text and layout conditions and quality
degradation of generated images. This paper proposes a novel layout-aware
text-to-image diffusion model called NoiseCollage to tackle these issues.
During the denoising process, NoiseCollage independently estimates noises for
individual objects and then crops and merges them into a single noise. This
operation helps avoid condition mismatches; in other words, it can put the
right objects in the right places. Qualitative and quantitative evaluations
show that NoiseCollage outperforms several state-of-the-art models. These
successful results indicate that the crop-and-merge operation of noises is a
reasonable strategy to control image generation. We also show that NoiseCollage
can be integrated with ControlNet to use edges, sketches, and pose skeletons as
additional conditions. Experimental results show that this integration boosts
the layout accuracy of ControlNet. The code is available at
https://github.com/univ-esuty/noisecollage. |
This paper proposes NoiseCollage, a novel training-free layout-aware text-to-image diffusion model for generating multi-object images by independently estimating and then cropping and merging noises for individual objects. |
Current layout-aware text-to-image diffusion models suffer from mismatches between text and layout conditions and image quality degradation. NoiseCollage tackles these issues by its unique noise manipulation strategy. |
NoiseCollage leverages a pre-trained diffusion model like Stable Diffusion. It estimates noises for individual objects independently, then crops and merges them into a single noise for final image generation. It uses masked cross-attention to localize visual information of each object and allows overlapping layout conditions with a weighted merging operation. |
NoiseCollage generates high-quality multi-object images with accurate alignment between objects and their text/layout conditions.
Qualitative and quantitative evaluations demonstrate NoiseCollage's superior performance over state-of-the-art methods, showing less condition mismatches and better image quality.
Integrating ControlNet into NoiseCollage enables finer control over object appearances through edge maps, sketches, and pose skeletons while maintaining layout accuracy. |
NoiseCollage occasionally struggles to accurately generate small objects.
Future work includes enabling automatic layout inference from text conditions, support for point annotations, and exploring further noise manipulation techniques for tasks like video generation. |
text-to-image generation, diffusion models, layout-aware generation, noise manipulation, controlnet |
2403.03431
Report |
Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing |
Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, Jun Huang |
Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have
recently gained significant popularity for creative Text-to-image generation.
Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE)
is of greater importance for application developers, which modify objects or
object properties in images by manipulating feature components in attention
layers during the generation process. However, little is known about what
semantic meanings these attention layers have learned and which parts of the
attention maps contribute to the success of image editing. In this paper, we
conduct an in-depth probing analysis and demonstrate that cross-attention maps
in Stable Diffusion often contain object attribution information that can
result in editing failures. In contrast, self-attention maps play a crucial
role in preserving the geometric and shape details of the source image during
the transformation to the target image. Our analysis offers valuable insights
into understanding cross and self-attention maps in diffusion models. Moreover,
based on our findings, we simplify popular image editing methods and propose a
more straightforward yet more stable and efficient tuning-free procedure that
only modifies self-attention maps of the specified attention layers during the
denoising process. Experimental results show that our simplified method
consistently surpasses the performance of popular approaches on multiple
datasets. |
This paper presents Free-Prompt-Editing (FPE), a simplified and efficient method for tuning-free text-guided image editing in diffusion models by modifying self-attention maps during the denoising process. |
Domain-specific image editing often requires modifying objects or properties in images, making tuning-free methods crucial for developers. However, existing approaches have limitations, such as unstable results due to cross-attention manipulation and high computational costs. |
The authors conduct a probing analysis of cross- and self-attention maps in Stable Diffusion, revealing that cross-attention maps contain object attribution information leading to editing failures, while self-attention maps preserve geometric and shape details. Based on these findings, FPE replaces specific self-attention maps during denoising, leveraging cross-attention for image-prompt alignment and self-attention for preserving source image structure. |
Cross-attention maps in Stable Diffusion contain object attribution information, making their manipulation prone to editing failures.
Self-attention maps are crucial for preserving the original image's structure during editing.
FPE consistently outperforms existing tuning-free methods on multiple datasets while being more efficient. |
FPE is limited by the generative capabilities of the underlying TIS model.
Reconstruction of real images can result in loss of detail due to limitations of the VQ autoencoder. |
image editing, text-guided image editing, diffusion models, stable diffusion, attention mechanisms |
2403.02981
Report |
Doubly Abductive Counterfactual Inference for Text-based Image Editing |
Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, Yu-Gang Jiang |
We study text-based image editing (TBIE) of a single image by counterfactual
inference because it is an elegant formulation to precisely address the
requirement: the edited image should retain the fidelity of the original one.
Through the lens of the formulation, we find that the crux of TBIE is that
existing techniques hardly achieve a good trade-off between editability and
fidelity, mainly due to the overfitting of the single-image fine-tuning. To
this end, we propose a Doubly Abductive Counterfactual inference framework
(DAC). We first parameterize an exogenous variable as a UNet LoRA, whose
abduction can encode all the image details. Second, we abduct another exogenous
variable parameterized by a text encoder LoRA, which recovers the lost
editability caused by the overfitted first abduction. Thanks to the second
abduction, which exclusively encodes the visual transition from post-edit to
pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit
back to post-edit, thereby accomplishing the edit. Through extensive
experiments, our DAC achieves a good trade-off between editability and
fidelity. Thus, we can support a wide spectrum of user editing intents,
including addition, removal, manipulation, replacement, style transfer, and
facial change, which are extensively validated in both qualitative and
quantitative evaluations. Codes are in https://github.com/xuesong39/DAC. |
This paper introduces Doubly Abductive Counterfactual (DAC), a novel framework for text-based image editing that leverages counterfactual inference to achieve a better trade-off between editability and fidelity compared to existing methods. |
Text-based image editing (TBIE) is challenging because existing techniques struggle to balance preserving the original image's fidelity while effectively incorporating textual edits. This paper provides a theoretical framework, counterfactual inference, to formally define TBIE and address this challenge. |
DAC uses a two-step abduction process. First, it parameterizes an exogenous variable as a UNet LoRA to encode image details (fidelity). Second, it introduces another exogenous variable, a text encoder LoRA, to recover editing capabilities lost due to overfitting in the first abduction. The method then inverts the second abduction to apply the semantic change, achieving the desired edit. |
DAC achieves a good balance between editability and fidelity, outperforming existing methods in qualitative and quantitative evaluations.
The method supports a wide range of editing intents, including addition, removal, manipulation, replacement, style transfer, and facial changes.
Ablation studies confirm the importance of the two-step abduction process, annealing strategy, and specific LoRA parameterization for optimal performance. |
The method's reliance on stable diffusion as the generative model introduces limitations related to random seed sensitivity, comprehension of referring expressions, and lack of common sense.
Multi-turn editing leads to gradual degradation in image quality due to information loss during abduction. |
text-based image editing, counterfactual inference, stable diffusion, lora, image manipulation |
2403.02827
Report |
Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation |
Weijie Li, Litong Gong, Yiran Zhu, Fanda Fan, Biao Wang, Tiezheng Ge, Bo Zheng |
Image-to-video (I2V) generation tasks always suffer from keeping high
fidelity in the open domains. Traditional image animation techniques primarily
focus on specific domains such as faces or human poses, making them difficult
to generalize to open domains. Several recent I2V frameworks based on diffusion
models can generate dynamic content for open domain images but fail to maintain
fidelity. We found that two main factors of low fidelity are the loss of image
details and the noise prediction biases during the denoising process. To this
end, we propose an effective method that can be applied to mainstream video
diffusion models. This method achieves high fidelity based on supplementing
more precise image information and noise rectification. Specifically, given a
specified image, our method first adds noise to the input image latent to keep
more details, then denoises the noisy latent with proper rectification to
alleviate the noise prediction biases. Our method is tuning-free and
plug-and-play. The experimental results demonstrate the effectiveness of our
approach in improving the fidelity of generated videos. For more image-to-video
generated results, please refer to the project website:
https://noise-rectification.github.io. |
This paper proposes a noise rectification method for high-fidelity image-to-video generation, addressing the limitations of existing approaches in maintaining detail and mitigating noise prediction biases. |
Generating high-fidelity videos from still images is challenging, with existing methods struggling to maintain detail and suffering from noise accumulation during the denoising process. |
The method utilizes a "noising and rectified denoising" approach. It first adds noise to the input image latent. Then, it rectifies the predicted noise during denoising by leveraging the known initial noise, striking a balance between fidelity and motion. |
The method outperforms existing image-to-video generation techniques in preserving fine-grained details and achieving higher fidelity.
Ablation studies demonstrate the impact of rectification weight and timestep on fidelity and motion.
The method is shown to be plug-and-play, effectively extending various text-to-video frameworks for high-fidelity image-to-video generation. |
The method, while excelling in fidelity, may lead to a slight reduction in motion intensity.
Future work will focus on enhancing motion intensity while preserving the achieved high fidelity. |
image-to-video generation, diffusion models, noise rectification, fidelity enhancement, open-domain video synthesis |
2403.02799
Report |
DPPA: Pruning Method for Large Language Model to Model Merging |
Yaochen Zhu, Rui Xia, Jiajun Zhang |
Model merging is to combine fine-tuned models derived from multiple domains,
with the intent of enhancing the model's proficiency across various domains.
The principal concern is the resolution of parameter conflicts. A substantial
amount of existing research remedy this issue during the merging stage, with
the latest study focusing on resolving this issue throughout the pruning stage.
The DARE approach has exhibited promising outcomes when applied to a simplistic
fine-tuned model. However, the efficacy of this method tends to wane when
employed on complex fine-tuned models that show a significant parameter bias
relative to the baseline model. In this paper, we introduce a dual-stage method
termed Dynamic Pruning Partition Amplification (DPPA), devised to tackle the
challenge of merging complex fine-tuned models. Initially, we introduce
Dynamically Pruning (DP), an improved approach based on magnitude pruning,
which aim is to enhance performance at higher pruning rates. Subsequently, we
propose Dynamically Partition Amplification (DPA), a rescaling strategy, is
designed to dynamically amplify parameter partitions in relation to their
significance levels. The experimental results show that our method maintains a
mere 20% of domain-specific parameters and yet delivers a performance
comparable to other methodologies that preserve up to 90% of parameters.
Furthermore, our method displays outstanding performance post-pruning, leading
to a significant improvement of nearly 20% performance in model merging. We
make our code on Github. |
This paper presents DPPA, a dual-stage method for merging large language models fine-tuned on different domains by addressing parameter conflicts through a novel pruning and rescaling strategy. |
Model merging aims to combine domain-specific models into a single model with multi-domain capabilities. However, parameter conflicts between models often lead to performance degradation, which DPPA aims to mitigate. |
DPPA first employs Dynamic Pruning (DP) to prune less significant parameters based on their magnitudes at layer and linear layer levels. Then, it uses Dynamic Partition Amplification (DPA) to dynamically rescale the remaining parameters based on their importance derived from pruning rates. |
DPPA retains only 20% of domain-specific parameters while achieving comparable performance to other methods retaining 90% of parameters.
DPPA outperforms the state-of-the-art merging method DARE by nearly 20% in performance.
Analysis suggests DPPA implicitly partitions parameters by dimensions, allowing it to restore domain-specific capabilities by amplifying important dimensions. |
DPPA's performance is suboptimal for models with minor differences compared to the base model.
DPA requires significant time to find the optimal rescaling ratio. |
model merging, large language models, pruning, rescaling, parameter conflicts |
2403.02775
Report |
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs |
Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang |
Large language models (LLMs) have proven to be very superior to conventional
methods in various tasks. However, their expensive computations and high memory
requirements are prohibitive for deployment. Model quantization is an effective
method for reducing this overhead. The problem is that in most previous works,
the quantized model was calibrated using few samples from the training data,
which might affect the generalization of the quantized LLMs to unknown cases
and tasks. Hence in this work, we explore an important question: Can we design
a data-independent quantization method for LLMs to guarantee its generalization
performance? In this work, we propose EasyQuant, a training-free and
data-independent weight-only quantization algorithm for LLMs. Our observation
indicates that two factors: outliers in the weight and quantization ranges, are
essential for reducing the quantization error. Therefore, in EasyQuant, we
leave the outliers (less than 1%) unchanged and optimize the quantization range
to reduce the reconstruction error. With these methods, we surprisingly find
that EasyQuant achieves comparable performance to the original model. Since
EasyQuant does not depend on any training data, the generalization performance
of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented
in parallel so that the quantized model could be attained in a few minutes even
for LLMs over 100B. To our best knowledge, we are the first work that achieves
almost lossless quantization performance for LLMs under a data-independent
setting and our algorithm runs over 10 times faster than the data-dependent
methods. |
This paper proposes "EasyQuant", a training-free and data-free weight quantization algorithm for Large Language Models (LLMs) that isolates outliers in weight from quantization and optimizes quantization ranges to improve performance. |
LLMs are computationally and memory intensive. Quantization reduces these overheads, but existing methods suffer from generalization issues due to data-dependent calibration. This work aims for a data-free approach to guarantee generalization performance. |
EasyQuant identifies outliers in the weight matrices using a sigma-based criterion and keeps them unquantized. It then optimizes the quantization ranges for the remaining weights by minimizing reconstruction error using gradient descent. |
EasyQuant achieves comparable performance to the original full-precision LLMs after quantization.
It significantly outperforms naive Round-to-Nearest (RTN) quantization in a data-free setting.
EasyQuant shows better performance than data-dependent algorithms like GPTQ on several benchmarks. |
The outlier recovery in EasyQuant requires additional CUDA kernels.
It focuses on weight-only quantization and doesn't address the computational cost reduction, leaving latency minimization for future work. |
model quantization, large language models, data-free quantization, outlier isolation, quantization range optimization |
2403.02677
Report |
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters |
Weizhi Wang, Khalil Mrini, Linjie Yang, Sateesh Kumar, Yu Tian, Xifeng Yan, Heng Wang |
We propose a novel framework for filtering image-text data by leveraging
fine-tuned Multimodal Language Models (MLMs). Our approach outperforms
predominant filtering methods (e.g., CLIPScore) via integrating the recent
advances in MLMs. We design four distinct yet complementary metrics to
holistically measure the quality of image-text data. A new pipeline is
established to construct high-quality instruction data for fine-tuning MLMs as
data filters. Comparing with CLIPScore, our MLM filters produce more precise
and comprehensive scores that directly improve the quality of filtered data and
boost the performance of pre-trained models. We achieve significant
improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2)
and various downstream tasks. Our MLM filter can generalize to different models
and tasks, and be used as a drop-in replacement for CLIPScore. An additional
ablation study is provided to verify our design choices for the MLM filter. |
The paper proposes using fine-tuned Multimodal Language Models (MLMs) as data filters to improve the quality of image-text datasets for training Vision-Language Models (VLMs). |
Existing methods like CLIPScore rely on holistic image-text alignment and struggle to capture fine-grained details, limiting the quality of filtered data and downstream VLM performance. |
The authors fine-tune open-source MLMs on a dataset constructed using proprietary LLMs (GPT-4, GPT-4V) to score image-text pairs across four metrics: Image-Text Matching, Object Detail Fulfillment, Caption Text Quality, and Semantic Understanding. Different design choices for data construction and filtering metrics are evaluated on the DataComp benchmark. |
MLM filters significantly outperform CLIPScore on DataComp, achieving 1.7% higher average accuracy over 38 datasets.
Combining multiple MLM-based metrics (ITM and ODF) further improves filtering performance.
MLM filter scores demonstrate stronger correlation with human judgment compared to CLIPScore. |
Limited effectiveness of certain metrics (CTQ, SU) on classification-focused benchmarks.
Computational cost of MLM filtering despite acceleration efforts. |
data filtering, multimodal language models, vision-language models, image-text alignment, data quality |
2403.02580
Report |
What do we learn from inverting CLIP models? |
Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein |
We employ an inversion-based approach to examine CLIP models. Our examination
reveals that inverting CLIP models results in the generation of images that
exhibit semantic alignment with the specified target prompts. We leverage these
inverted images to gain insights into various aspects of CLIP models, such as
their ability to blend concepts and inclusion of gender biases. We notably
observe instances of NSFW (Not Safe For Work) images during model inversion.
This phenomenon occurs even for semantically innocuous prompts, like "a
beautiful landscape," as well as for prompts involving the names of
celebrities. |
This paper investigates the capabilities and biases of CLIP models through model inversion, revealing insights into their ability to blend concepts, presence of NSFW content, and gender biases. |
Understanding the capabilities and biases of CLIP models is crucial due to their widespread use in various AI applications, including text-to-image generation. |
The study employs an inversion-based approach, optimizing input images to align with given textual prompts. It utilizes techniques like augmentations, ensembling, and regularization terms to generate meaningful inversions. |
CLIP models demonstrate a capacity to blend concepts, generating images that accurately combine multiple ideas from a given prompt.
Model inversion reveals the presence of NSFW content within CLIP models, even for seemingly innocuous prompts, suggesting limitations in training data curation.
CLIP models exhibit gender bias, particularly in associating professions and social statuses with specific genders. |
The study acknowledges limitations in using generative strategies to analyze a model not primarily intended for generative tasks.
Future work could explore addressing NSFW content generation stemming from CLIP embeddings in text-to-image generation models. |
clip, model inversion, nsfw content, gender bias, text-to-image generation |
2403.02473
Report |
When do Convolutional Neural Networks Stop Learning? |
Sahan Ahmad, Gabriel Trahan, Aminul Islam |
Convolutional Neural Networks (CNNs) have demonstrated outstanding
performance in computer vision tasks such as image classification, detection,
segmentation, and medical image analysis. In general, an arbitrary number of
epochs is used to train such neural networks. In a single epoch, the entire
training data -- divided by batch size -- are fed to the network. In practice,
validation error with training loss is used to estimate the neural network's
generalization, which indicates the optimal learning capacity of the network.
Current practice is to stop training when the training loss decreases and the
gap between training and validation error increases (i.e., the generalization
gap) to avoid overfitting. However, this is a trial-and-error-based approach
which raises a critical question: Is it possible to estimate when neural
networks stop learning based on training data? This research work introduces a
hypothesis that analyzes the data variation across all the layers of a CNN
variant to anticipate its near-optimal learning capacity. In the training
phase, we use our hypothesis to anticipate the near-optimal learning capacity
of a CNN variant without using any validation data. Our hypothesis can be
deployed as a plug-and-play to any existing CNN variant without introducing
additional trainable parameters to the network. We test our hypothesis on six
different CNN variants and three different general image datasets (CIFAR10,
CIFAR100, and SVHN). The result based on these CNN variants and datasets shows
that our hypothesis saves 58.49\% of computational time (on average) in
training. We further conduct our hypothesis on ten medical image datasets and
compared with the MedMNIST-V2 benchmark. Based on our experimental result, we
save $\approx$ 44.1\% of computational time without losing accuracy against the
MedMNIST-V2 benchmark. |
This paper introduces a hypothesis and method to anticipate the near-optimal learning capacity of a Convolutional Neural Network (CNN) during training, potentially saving computational time by stopping training earlier. |
Selecting the number of training epochs for CNNs is currently a trial-and-error process that relies on monitoring validation error, which may not be reliable and incurs extra computational cost. This method aims to address this by predicting when the model stops learning significantly from the training data. |
The method analyzes data variation after the convolution operation in each layer of the CNN across epochs. It introduces the concept of a "stability vector" for each layer, which tracks the standard deviation of data after convolution for each iteration in an epoch. By comparing the mean stability vectors of consecutive epochs, the method determines when the data variation stabilizes, implying the model has reached its near-optimal learning capacity. |
The proposed hypothesis, when applied to six different CNN architectures and three image datasets, saves 32% to 79% of the computational time compared to using a fixed 200 epochs.
The method achieves comparable testing accuracy to traditional training with validation data.
Analysis of data variation patterns across layers provides insights into the learning dynamics of CNNs and supports the hypothesis that stability indicates near-optimal learning capacity. |
The method relies on a heuristic choice of rounding decimal places when comparing mean stability vectors, potentially limiting its generalizability.
Further investigation is needed to apply the method to other deep neural networks beyond CNNs. |
optimization, cnn, deep learning, image classification, early stopping |
2403.02460
Report |
MagicClay: Sculpting Meshes With Generative Neural Fields |
Amir Barda, Vladimir G. Kim, Noam Aigerman, Amit H. Bermano, Thibault Groueix |
The recent developments in neural fields have brought phenomenal capabilities
to the field of shape generation, but they lack crucial properties, such as
incremental control - a fundamental requirement for artistic work. Triangular
meshes, on the other hand, are the representation of choice for most geometry
related tasks, offering efficiency and intuitive control, but do not lend
themselves to neural optimization. To support downstream tasks, previous art
typically proposes a two-step approach, where first a shape is generated using
neural fields, and then a mesh is extracted for further processing. Instead, in
this paper we introduce a hybrid approach that maintains both a mesh and a
Signed Distance Field (SDF) representations consistently. Using this
representation, we introduce MagicClay - an artist friendly tool for sculpting
regions of a mesh according to textual prompts while keeping other regions
untouched. Our framework carefully and efficiently balances consistency between
the representations and regularizations in every step of the shape
optimization; Relying on the mesh representation, we show how to render the SDF
at higher resolutions and faster. In addition, we employ recent work in
differentiable mesh reconstruction to adaptively allocate triangles in the mesh
where required, as indicated by the SDF. Using an implemented prototype, we
demonstrate superior generated geometry compared to the state-of-the-art, and
novel consistent control, allowing sequential prompt-based edits to the same
mesh for the first time. |
Introduces MagicChisel, a tool for sculpting regions of a mesh based on text prompts, using a hybrid mesh-SDF representation. |
Combines the advantages of neural fields (robust generation) and meshes (efficiency, control), enabling localized and sequential prompt-based mesh editing. |
Jointly optimizes a mesh and SDF, using score distillation sampling from text prompts. Employs differentiable rendering, consistency losses, and dynamic topology updates via ROAR. |
Generates smoother and higher-quality geometry than existing text-to-3D methods.
Enables localized mesh edits according to textual prompts, preserving unedited regions.
Outperforms text-driven mesh deformation baselines in terms of expressiveness and control. |
Limited by the quality and noise of SDS gradients.
Computationally expensive, taking around 1 hour per prompt on an A100 GPU. |
3d shape generation, text-guided editing, hybrid representations, mesh sculpting, score distillation sampling |
2403.02332
Report |
UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control |
Xuweiyi Chen, Tian Xia, Sihan Xu |
Video Diffusion Models have been developed for video generation, usually
integrating text and image conditioning to enhance control over the generated
content. Despite the progress, ensuring consistency across frames remains a
challenge, particularly when using text prompts as control conditions. To
address this problem, we introduce UniCtrl, a novel, plug-and-play method that
is universally applicable to improve the spatiotemporal consistency and motion
diversity of videos generated by text-to-video models without additional
training. UniCtrl ensures semantic consistency across different frames through
cross-frame self-attention control, and meanwhile, enhances the motion quality
and spatiotemporal consistency through motion injection and spatiotemporal
synchronization. Our experimental results demonstrate UniCtrl's efficacy in
enhancing various text-to-video models, confirming its effectiveness and
universality. |
The paper introduces UniCtrl, a training-free, plug-and-play method to enhance the spatiotemporal consistency and motion diversity of videos generated by text-to-video diffusion models. |
Existing text-to-video diffusion models struggle to maintain consistency across frames, especially when guided by text prompts, leading to discrepancies in generated content over time. |
UniCtrl leverages a three-pronged approach: 1) Cross-Frame Self-Attention Control ensures semantic consistency by applying keys and values from the first frame to subsequent frames, 2) Motion Injection preserves motion dynamics by using original queries for spatial information, and 3) Spatiotemporal Synchronization enhances coherence by synchronizing latent representations between frames. |
UniCtrl significantly improves spatiotemporal consistency across different text-to-video models, as evidenced by quantitative metrics like DINO.
The method effectively preserves motion diversity within generated videos, surpassing baseline models and alternative approaches in metrics like RAFT.
UniCtrl demonstrates strong compatibility with existing enhancement techniques, as shown by its successful integration with FreeInit for further improvements. |
UniCtrl's reliance on the attention mechanism limits its applicability to non-attention-based models.
The method's constraint of using the same values for each frame restricts its ability to generate videos with varying colors across frames. |
video diffusion, spatiotemporal consistency, attention control, text-to-video generation, motion preservation |
2403.02325
Report |
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training |
David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal |
Highlighting particularly relevant regions of an image can improve the
performance of vision-language models (VLMs) on various vision-language (VL)
tasks by guiding the model to attend more closely to these regions of interest.
For example, VLMs can be given a "visual prompt", where visual markers such as
bounding boxes delineate key image regions. However, current VLMs that can
incorporate visual guidance are either proprietary and expensive or require
costly training on curated data that includes visual prompts. We introduce
Contrastive Region Guidance (CRG), a training-free guidance method that enables
open-source VLMs to respond to visual prompts. CRG contrasts model outputs
produced with and without visual prompts, factoring out biases revealed by the
model when answering without the information required to produce a correct
answer (i.e., the model's prior). CRG achieves substantial improvements in a
wide variety of VL tasks: When region annotations are provided, CRG increases
absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse
region-based tasks such as recognition, math, and object relationship
reasoning. We also show CRG's applicability to spatial reasoning, with 10%
improvement on What'sUp, as well as to compositional generalization --
improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe
-- and to image-text alignment for generated images, where we improve by up to
8.4 AUROC and 6.8 F1 points on SeeTRUE. When reference regions are absent, CRG
allows us to re-rank proposed regions in referring expression comprehension and
phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an
average gain of 3.2% in accuracy. Our analysis explores alternative masking
strategies for CRG, quantifies CRG's probability shift, and evaluates the role
of region guidance strength, empirically validating CRG's design choices. |
This paper introduces Contrastive Region Guidance (CRG), a training-free method to improve visual grounding in vision-language models (VLMs) by leveraging classifier-free guidance (CFG) to focus on specific image regions. |
Current methods for incorporating visual prompts into VLMs either rely on proprietary, expensive models like GPT-4V or require costly finetuning on datasets with visual prompts. CRG addresses these limitations by offering a training-free approach compatible with various existing models. |
CRG contrasts the VLM's output distribution on the original image with its output on a masked version where specific regions are blacked out. This contrast highlights the importance of the masked region for the model's prediction. |
CRG significantly improves visual prompt following, matching the performance of fine-tuned models on ViP-Bench and even outperforming them in some categories.
CRG enhances spatial reasoning on the challenging 'Set of 4' setting of the WhatsUp benchmark, leading to substantial accuracy gains over baseline models.
CRG leads to improvements in compositional generalization, boosting performance on the challenging SugarCrepe benchmark and demonstrating the method's ability to enhance models' understanding of language compositionality. |
CRG requires running the VLM twice (on original and masked images), leading to increased computational cost compared to inference without CRG.
The current implementation of CRG relies on object detection models to propose bounding boxes when visual markers are absent. Integrating better visual encoders could further improve efficiency and accuracy. |
visual grounding, vision-language models, visual prompting, classifier-free guidance, compositional generalization |
2403.02234
Report |
3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors |
Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, Ziwei Liu |
We present a two-stage text-to-3D generation system, namely 3DTopia, which
generates high-quality general 3D assets within 5 minutes using hybrid
diffusion priors. The first stage samples from a 3D diffusion prior directly
learned from 3D data. Specifically, it is powered by a text-conditioned
tri-plane latent diffusion model, which quickly generates coarse 3D samples for
fast prototyping. The second stage utilizes 2D diffusion priors to further
refine the texture of coarse 3D models from the first stage. The refinement
consists of both latent and pixel space optimization for high-quality texture
generation. To facilitate the training of the proposed system, we clean and
caption the largest open-source 3D dataset, Objaverse, by combining the power
of vision language models and large language models. Experiment results are
reported qualitatively and quantitatively to show the performance of the
proposed system. Our codes and models are available at
https://github.com/3DTopia/3DTopia |
3DTopia, a two-stage text-to-3D generation system using hybrid diffusion priors, enabling fast prototyping and high-quality 3D generation. |
Generating 3D assets from text is crucial for various applications but challenging due to limited data and computational demands. Existing methods compromise either speed or quality. |
The first stage employs a tri-plane latent diffusion model trained on a captioned and cleaned Objaverse dataset for fast coarse 3D generation. The second stage refines texture using Score Distillation Sampling with latent-space and pixel-space 2D diffusion priors. |
3DTopia outperforms Point-E and Shap-E in text-to-3D generation quality, even with less training data.
The proposed 3D captioning pipeline, leveraging LLaVA and GPT-3.5, produces more detailed and accurate captions compared to existing methods.
Hybrid refinement using both latent-space and pixel-space diffusion priors achieves a balance between texture diversity and quality. |
Limited ability to handle complex, concept-mixing text prompts due to the lack of strong 2D priors in the first stage.
Dependence on the quality of the first stage mesh for refinement. |
text-to-3d generation, diffusion models, 3d captioning, score distillation sampling, tri-plane representation |
2403.02151
Report |
TripoSR: Fast 3D Object Reconstruction from a Single Image |
Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, Yan-Pei Cao |
This technical report introduces TripoSR, a 3D reconstruction model
leveraging transformer architecture for fast feed-forward 3D generation,
producing 3D mesh from a single image in under 0.5 seconds. Building upon the
LRM network architecture, TripoSR integrates substantial improvements in data
processing, model design, and training techniques. Evaluations on public
datasets show that TripoSR exhibits superior performance, both quantitatively
and qualitatively, compared to other open-source alternatives. Released under
the MIT license, TripoSR is intended to empower researchers, developers, and
creatives with the latest advancements in 3D generative AI. |
TripoSR, a fast feed-forward 3D reconstruction model leveraging transformer architecture for high-quality 3D mesh generation from single images in under 0.5 seconds. |
Addresses limitations of slow generation speed and control challenges in existing 3D generation methods, enabling efficient and scalable 3D model creation. |
Builds upon the LRM architecture with improvements in data curation, rendering, model design (triplane channel optimization), and training techniques (mask loss, local rendering supervision). |
Outperforms state-of-the-art methods on GSO and OmniObject3D datasets in terms of CD and F-score metrics.
Achieves superior reconstruction quality for both shape and texture details compared to baselines.
Maintains fast inference speed, producing a 3D mesh in approximately 0.5 seconds on an NVIDIA A100 GPU. |
Reliance on high-resolution rendering for supervision may pose computational challenges.
Future work could explore extending the model for multi-view 3D reconstruction or text-to-3D generation. |
3d reconstruction, transformer, nerf, single image, generative ai |
2403.02118
Report |
Position: Towards Implicit Prompt For Text-To-Image Models |
Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo |
Recent text-to-image (T2I) models have had great success, and many benchmarks
have been proposed to evaluate their performance and safety. However, they only
consider explicit prompts while neglecting implicit prompts (hint at a target
without explicitly mentioning it). These prompts may get rid of safety
constraints and pose potential threats to the applications of these models.
This position paper highlights the current state of T2I models toward implicit
prompts. We present a benchmark named ImplicitBench and conduct an
investigation on the performance and impacts of implicit prompts with popular
T2I models. Specifically, we design and collect more than 2,000 implicit
prompts of three aspects: General Symbols, Celebrity Privacy, and
Not-Safe-For-Work (NSFW) Issues, and evaluate six well-known T2I models'
capabilities under these implicit prompts. Experiment results show that (1) T2I
models are able to accurately create various target symbols indicated by
implicit prompts; (2) Implicit prompts bring potential risks of privacy leakage
for T2I models. (3) Constraints of NSFW in most of the evaluated T2I models can
be bypassed with implicit prompts. We call for increased attention to the
potential and risks of implicit prompts in the T2I community and further
investigation into the capabilities and impacts of implicit prompts, advocating
for a balanced approach that harnesses their benefits while mitigating their
risks. |
This paper introduces the concept of "implicit prompts" in text-to-image generation, which describe targets without directly naming them. It presents ImplicitBench, a benchmark to evaluate the capabilities and risks of T2I models in handling such prompts. |
Existing T2I benchmarks primarily focus on explicit prompts, neglecting the potential of implicit prompts to enhance creativity and the risks they pose to safety constraints. This work aims to bridge this gap and advocate for responsible development in the field. |
The authors curated ImplicitBench spanning three aspects: General Symbols, Celebrity Privacy, and NSFW Issues. They evaluated six popular T2I models on this benchmark using tailored evaluation methods, combining MLLMs, face recognition, and safety checkers. |
T2I models demonstrate a promising ability to interpret and generate images from implicit prompts, particularly for general symbols.
Implicit prompts can bypass safety filters, enabling the generation of content that infringes on celebrity privacy or falls under NSFW categories.
The risk of generating unsafe content through implicit prompts is amplified by the use of specific terminologies, detailed descriptions, and ambiguous language. |
The definition and scope of "implicit prompts" are still under exploration and require further refinement.
Developing robust safety mechanisms and policy constraints tailored for implicit prompts is crucial to mitigate potential risks. |
text-to-image generation, implicit prompts, benchmarking, safety constraints, ethical considerations |
2403.02084
Report |
ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models |
Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Min Zheng, Lean Fu |
Recent advancement in text-to-image models (e.g., Stable Diffusion) and
corresponding personalized technologies (e.g., DreamBooth and LoRA) enables
individuals to generate high-quality and imaginative images. However, they
often suffer from limitations when generating images with resolutions outside
of their trained domain. To overcome this limitation, we present the Resolution
Adapter (ResAdapter), a domain-consistent adapter designed for diffusion models
to generate images with unrestricted resolutions and aspect ratios. Unlike
other multi-resolution generation methods that process images of static
resolution with complex post-process operations, ResAdapter directly generates
images with the dynamical resolution. Especially, after learning a deep
understanding of pure resolution priors, ResAdapter trained on the general
dataset, generates resolution-free images with personalized diffusion models
while preserving their original style domain. Comprehensive experiments
demonstrate that ResAdapter with only 0.5M can process images with flexible
resolutions for arbitrary diffusion models. More extended experiments
demonstrate that ResAdapter is compatible with other modules (e.g., ControlNet,
IP-Adapter and LCM-LoRA) for image generation across a broad range of
resolutions, and can be integrated into other multi-resolution model (e.g.,
ElasticDiffusion) for efficiently generating higher-resolution images. Project
link is https://res-adapter.github.io |
This paper proposes ResAdapter, a plug-and-play adapter for diffusion models that enables generation of images with unrestricted resolutions and aspect ratios while preserving the original style domain. |
Current text-to-image models struggle to generate consistent images outside their trained resolution, impacting fidelity and composition. Existing methods are either computationally expensive or disrupt the original style domain of personalized models. |
ResAdapter utilizes ResCLoRA for resolution interpolation, dynamically matching the receptive field of convolution layers to feature map size. ResENorm addresses resolution extrapolation by adapting normalization layers to handle statistical distribution in higher-resolution images. It is trained on a mixed-resolution dataset with a sampling strategy favoring lower and higher resolutions. |
ResAdapter generates higher quality multi-resolution images compared to MultiDiffusion and ElasticDiffusion.
It significantly improves fidelity and composition in lower and higher resolution images compared to personalized models, without style domain transfer.
ResAdapter is compatible with other modules like ControlNet, IP-Adapter, and LCM-LoRA, and can optimize the generation efficiency of multi-resolution models like ElasticDiffusion. |
Failure cases are prominent with generic prompts on personalized models, potentially needing prompt correction using a large language model.
Future work could explore integrating super-resolution models for faster high-resolution image generation. |
diffusion models, resolution extrapolation, resolution interpolation, style domain consistency, text-to-image generation |
2403.01852
Report |
PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis |
Zhengyao Lv, Yuxiang Wei, Wangmeng Zuo, Kwan-Yee K. Wong |
Recent advancements in large-scale pre-trained text-to-image models have led
to remarkable progress in semantic image synthesis. Nevertheless, synthesizing
high-quality images with consistent semantics and layout remains a challenge.
In this paper, we propose the adaPtive LAyout-semantiC fusion modulE (PLACE)
that harnesses pre-trained models to alleviate the aforementioned issues.
Specifically, we first employ the layout control map to faithfully represent
layouts in the feature space. Subsequently, we combine the layout and semantic
features in a timestep-adaptive manner to synthesize images with realistic
details. During fine-tuning, we propose the Semantic Alignment (SA) loss to
further enhance layout alignment. Additionally, we introduce the Layout-Free
Prior Preservation (LFP) loss, which leverages unlabeled data to maintain the
priors of pre-trained models, thereby improving the visual quality and semantic
consistency of synthesized images. Extensive experiments demonstrate that our
approach performs favorably in terms of visual quality, semantic consistency,
and layout alignment. The source code and model are available at
https://github.com/cszy98/PLACE/tree/main. |
This paper proposes PLACE, an adaptive layout-semantic fusion module, to enhance the quality and layout consistency of images synthesized from semantic maps using pre-trained text-to-image diffusion models. |
Synthesizing high-quality images with consistent semantics and layout from semantic maps remains challenging for existing text-to-image synthesis models. |
PLACE leverages a layout control map for accurate layout representation and employs an adaptive fusion module to integrate layout and semantic features during image synthesis. It also introduces a semantic alignment loss and a layout-free prior preservation loss during fine-tuning. |
PLACE achieves state-of-the-art visual quality and semantic consistency scores on ADE20K and COCO-Stuff datasets.
It demonstrates superior performance in synthesizing out-of-distribution images with new objects, styles, and attributes.
The proposed layout control map, adaptive fusion module, and loss functions are shown to contribute to the performance improvements through ablation studies. |
The inference speed of PLACE is still slower than GAN-based methods, limited by the diffusion process.
Synthesizing images from long or uncommon prompts might result in inconsistency due to limitations of the pre-trained Stable Diffusion model. |
semantic image synthesis, layout control, text-to-image synthesis, diffusion models, adaptive fusion |
2403.01807
Report |
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models |
Lukas Höllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, Matthias Nießner |
3D asset generation is getting massive amounts of attention, inspired by the
recent success of text-guided 2D content creation. Existing text-to-3D methods
use pretrained text-to-image diffusion models in an optimization problem or
fine-tune them on synthetic data, which often results in non-photorealistic 3D
objects without backgrounds. In this paper, we present a method that leverages
pretrained text-to-image models as a prior, and learn to generate multi-view
images in a single denoising process from real-world data. Concretely, we
propose to integrate 3D volume-rendering and cross-frame-attention layers into
each block of the existing U-Net network of the text-to-image model. Moreover,
we design an autoregressive generation that renders more 3D-consistent images
at any viewpoint. We train our model on real-world datasets of objects and
showcase its capabilities to generate instances with a variety of high-quality
shapes and textures in authentic surroundings. Compared to the existing
methods, the results generated by our method are consistent, and have favorable
visual quality (-30% FID, -37% KID). |
The paper proposes a method to generate high-quality, multi-view consistent images of 3D objects in authentic surroundings using pretrained text-to-image diffusion models fine-tuned on real-world multi-view data. |
This approach bridges the gap between the diversity of text-to-3D methods and the photorealism of diffusion models trained on smaller 3D datasets, enabling the generation of realistic and diverse 3D assets. |
The method augments the U-Net architecture of pretrained text-to-image models with cross-frame-attention layers and projection layers to encode 3D knowledge and ensure consistency. It employs an autoregressive generation scheme to render images from any viewpoint, enabling novel view synthesis. |
The method significantly improves FID and KID scores compared to existing multi-view diffusion models, demonstrating higher visual quality and similarity to real images.
The generated images are 3D-consistent, allowing for smooth novel view synthesis and enabling the optimization of NeRF or NeuS representations.
The method retains the diversity of pretrained text-to-image models, allowing for controllable generation based on text descriptions and combining attributes in novel ways. |
Slight inconsistencies, like view-dependent lighting and sharpness variations, may occur due to the nature of the real-world training data.
The current work focuses on object-level generation; extending it to scene-scale generation is a potential future direction. |
text-to-3d, diffusion models, multi-view consistency, novel view synthesis, 3d asset generation |
2403.01800
Report |
AtomoVideo: High Fidelity Image-to-Video Generation |
Litong Gong, Yiran Zhu, Weijie Li, Xiaoyang Kang, Biao Wang, Tiezheng Ge, Bo Zheng |
Recently, video generation has achieved significant rapid development based
on superior text-to-image generation techniques. In this work, we propose a
high fidelity framework for image-to-video generation, named AtomoVideo. Based
on multi-granularity image injection, we achieve higher fidelity of the
generated video to the given image. In addition, thanks to high quality
datasets and training strategies, we achieve greater motion intensity while
maintaining superior temporal consistency and stability. Our architecture
extends flexibly to the video frame prediction task, enabling long sequence
prediction through iterative generation. Furthermore, due to the design of
adapter training, our approach can be well combined with existing personalized
models and controllable modules. By quantitatively and qualitatively
evaluation, AtomoVideo achieves superior results compared to popular methods,
more examples can be found on our project website:
https://atomo-video.github.io/. |
Presents AtomoVideo, a high-fidelity image-to-video generation framework that prioritizes fidelity to the input image and generates videos with greater motion intensity while maintaining temporal consistency. |
Addresses the limitations of existing image-to-video generation methods that struggle to balance fidelity with the input image and generating coherent motion in the video. |
Leverages a pre-trained text-to-image model, injecting image information at multiple levels: low-level details are concatenated with input noise, while high-level semantics are introduced through cross-attention. Employs zero terminal SNR and v-prediction during training to enhance stability. |
Achieves state-of-the-art performance on several image-to-video generation benchmarks, demonstrating high fidelity to the input image and superior motion intensity.
Demonstrates the flexibility to be combined with personalized text-to-image models, enabling diverse video styles.
Extends to long video generation through iterative frame prediction. |
Slight underperformance in image consistency and video quality compared to commercial methods, potentially due to the use of a fixed base model and resolution limitations.
Limited exploration of stylistic variations, focusing primarily on realistic videos. |
image-to-video generation, diffusion models, video synthesis, high-fidelity generation, temporal consistency |
2403.01779
Report |
OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on |
Yuhao Xu, Tao Gu, Weifeng Chen, Chengcai Chen |
We present OOTDiffusion, a novel network architecture for realistic and
controllable image-based virtual try-on (VTON). We leverage the power of
pretrained latent diffusion models, designing an outfitting UNet to learn the
garment detail features. Without a redundant warping process, the garment
features are precisely aligned with the target human body via the proposed
outfitting fusion in the self-attention layers of the denoising UNet. In order
to further enhance the controllability, we introduce outfitting dropout to the
training process, which enables us to adjust the strength of the garment
features through classifier-free guidance. Our comprehensive experiments on the
VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently
generates high-quality try-on results for arbitrary human and garment images,
which outperforms other VTON methods in both realism and controllability,
indicating an impressive breakthrough in virtual try-on. Our source code is
available at https://github.com/levihsu/OOTDiffusion. |
Proposed OOTDiffusion, an LDM-based network architecture with a novel outfitting UNet for realistic and controllable virtual try-on. |
Image-based virtual try-on (VTON) is vital for e-commerce, but existing methods struggle to balance realism with preserving garment details. |
Leverages pretrained LDMs for realism, employs an outfitting UNet to learn garment features, uses outfitting fusion to align features, and introduces outfitting dropout for controllable generation. |
Achieves state-of-the-art performance on VITON-HD and Dress Code datasets, surpassing GAN-based and other LDM-based methods in realism and detail preservation.
Demonstrates superior generalization ability in cross-dataset evaluations.
Outfitting dropout with classifier-free guidance effectively controls garment feature strength. |
May not perform well for cross-category virtual try-on due to training on paired data.
Minor details in the original human image might be altered after the try-on process. |
virtual try-on, latent diffusion models, outfitting fusion, classifier-free guidance, image generation |
2403.01693
Report |
HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances |
Supreeth Narasimhaswamy, Uttaran Bhattacharya, Xiang Chen, Ishita Dasgupta, Saayan Mitra, Minh Hoai |
Text-to-image generative models can generate high-quality humans, but realism
is lost when generating hands. Common artifacts include irregular hand poses,
shapes, incorrect numbers of fingers, and physically implausible finger
orientations. To generate images with realistic hands, we propose a novel
diffusion-based architecture called HanDiffuser that achieves realism by
injecting hand embeddings in the generative process. HanDiffuser consists of
two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and
MANO-Hand parameters from input text prompts, and a Text-Guided
Hand-Params-to-Image diffusion model to synthesize images by conditioning on
the prompts and hand parameters generated by the previous component. We
incorporate multiple aspects of hand representation, including 3D shapes and
joint-level finger positions, orientations and articulations, for robust
learning and reliable performance during inference. We conduct extensive
quantitative and qualitative experiments and perform user studies to
demonstrate the efficacy of our method in generating images with high-quality
hands. |
This paper introduces a novel text-to-image generation model that produces images with realistic hand appearances by incorporating SMPL-H parameters. |
Existing text-to-image generation models often struggle to depict hands accurately. This method aims to address this limitation and enhance the realism of generated images. |
The model utilizes a two-component system: (1) a diffusion model generating SMPL-H parameters from text prompts, and (2) a text-to-image generation model conditioned on both the text and generated SMPL-H parameters. |
The model demonstrates superior performance in generating realistic hand appearances compared to baseline models.
User studies confirm the plausibility and relevance of the generated hand poses.
The method allows for creative control over the generated image by modifying the SMPL-H parameters. |
The model may face challenges generating complex hand-object interactions due to the lack of object information in the first component.
Further investigation is needed to quantitatively evaluate the diversity of generated images. |
text-to-image generation, hand pose estimation, smpl-h, diffusion models, generative models |
2403.01643
Report |
You Need to Pay Better Attention |
Mehran Hosseini, Peyman Hosseini |
We introduce three new attention mechanisms that outperform standard
multi-head attention in terms of efficiency and learning capabilities, thereby
improving the performance and broader deployability of Transformer models. Our
first contribution is Optimised Attention, which performs similarly to standard
attention, but has 3/4 as many parameters and one matrix multiplication fewer
per head. Next, we introduce Efficient Attention, which performs on par with
standard attention with only 1/2 as many parameters as many parameters and two
matrix multiplications fewer per head and is up to twice as fast as standard
attention. Lastly, we introduce Super Attention, which surpasses standard
attention by a significant margin in both vision and natural language
processing tasks while having fewer parameters and matrix multiplications. In
addition to providing rigorous mathematical comparisons, we evaluate the
presented attention mechanisms on MNIST, CIFAR100, IMDB Movie Reviews, and
Amazon Reviews datasets. |
The paper introduces three novel attention mechanisms: Optimised Attention, Efficient Attention, and Super Attention, designed to improve the efficiency and performance of Transformer models. |
Large language models, while powerful, pose challenges in terms of computational cost, memory footprint, and deployability on resource-constrained devices. This paper addresses these limitations by optimizing the core attention mechanism. |
The authors mathematically analyze the standard attention mechanism, identifying redundancies and proposing optimizations based on three key principles: combining consecutive linear transformations, leveraging single-head attention, and introducing kernels between inputs. The proposed mechanisms are evaluated on image classification (MNIST, CIFAR100) and text sentiment analysis (IMDB, Amazon Reviews) tasks. |
Optimised Attention reduces the attention layer size by 25% and computational cost, performing similarly to standard attention.
Efficient Attention, with half the parameters of standard attention, achieves comparable performance while being up to twice as fast.
Super Attention outperforms standard attention in both vision and language tasks, while being 25% smaller and up to 45% faster for specific context sizes. |
The paper primarily focuses on classification tasks due to computational constraints.
Future work could explore the application and optimization of these attention mechanisms in more complex tasks beyond classification, such as language generation or object detection. |
attention mechanism, transformer, efficiency, deep learning, natural language processing |
2403.01560
Report |
Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition |
Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, Wei-Shi Zheng |
Contrastive Language-Image Pretraining (CLIP) has shown remarkable
open-vocabulary abilities across various image understanding tasks. Building
upon this impressive success, recent pioneer works have proposed to adapt the
powerful CLIP to video data, leading to efficient and effective video learners
for open-vocabulary action recognition. Inspired by the fact that humans
perform actions in diverse environments, our work delves into an intriguing
question: Can CLIP-based video learners effectively generalize to video domains
they have not encountered during training? To answer this, we establish a
CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and
conduct a comprehensive evaluation of five state-of-the-art CLIP-based video
learners under various types of domain gaps. Our evaluation demonstrates that
previous methods exhibit limited action recognition performance in unseen video
domains, revealing potential challenges of the cross-domain open-vocabulary
action recognition task. To address this task, our work focuses on a critical
challenge, namely scene bias, and we accordingly contribute a novel scene-aware
video-text alignment method. Our key idea is to distinguish video
representations apart from scene-encoded text representations, aiming to learn
scene-agnostic video representations for recognizing actions across domains.
Extensive experimental results demonstrate the effectiveness of our method. The
benchmark and code will be available at
https://github.com/KunyuLin/XOV-Action/. |
This work introduces XOV-Action, a benchmark for cross-domain open-vocabulary action recognition, and proposes SATA, a novel Scene-Aware video-Text Alignment method to improve performance on this task. |
Generalizing to unseen video domains is crucial for real-world action recognition applications, but existing CLIP-based video learners struggle with domain shifts. |
XOV-Action benchmark comprises two source datasets for training and four target datasets with various domain gaps for testing. SATA mitigates scene bias by distinguishing video representations from scene-encoded text representations during training, encouraging the model to focus on action information rather than scene details. |
Existing CLIP-based video learners show limited performance in cross-domain open-vocabulary action recognition.
SATA outperforms state-of-the-art methods on XOV-Action by mitigating scene bias effectively.
Analysis of SATA components demonstrates the importance of scene-aware losses and text-adaptive aggregation. |
The current SATA method primarily addresses scene bias, but other factors contributing to domain gaps remain unexplored.
Future work will focus on tackling cross-category generalization in cross-domain settings. |
action recognition, open vocabulary, domain generalization, clip, benchmark |
2403.01444
Report |
3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos |
Jiakai Sun, Han Jiao, Guangyuan Li, Zhanjie Zhang, Lei Zhao, Wei Xing |
Constructing photo-realistic Free-Viewpoint Videos (FVVs) of dynamic scenes
from multi-view videos remains a challenging endeavor. Despite the remarkable
advancements achieved by current neural rendering techniques, these methods
generally require complete video sequences for offline training and are not
capable of real-time rendering. To address these constraints, we introduce
3DGStream, a method designed for efficient FVV streaming of real-world dynamic
scenes. Our method achieves fast on-the-fly per-frame reconstruction within 12
seconds and real-time rendering at 200 FPS. Specifically, we utilize 3D
Gaussians (3DGs) to represent the scene. Instead of the na\"ive approach of
directly optimizing 3DGs per-frame, we employ a compact Neural Transformation
Cache (NTC) to model the translations and rotations of 3DGs, markedly reducing
the training time and storage required for each FVV frame. Furthermore, we
propose an adaptive 3DG addition strategy to handle emerging objects in dynamic
scenes. Experiments demonstrate that 3DGStream achieves competitive performance
in terms of rendering speed, image quality, training time, and model storage
when compared with state-of-the-art methods. |
This paper proposes 3DGStream, a novel method for efficient free-viewpoint video streaming of dynamic scenes. |
Constructing photo-realistic free-viewpoint videos of dynamic scenes is crucial for VR/AR/XR applications but remains challenging due to limitations in existing methods that require complete video sequences for training and lack real-time rendering capabilities. |
The method leverages 3D Gaussians (3DG) for scene representation and employs a two-stage per-frame training pipeline. Stage 1 uses a Neural Transformation Cache (NTC) to efficiently model 3DG transformations, while Stage 2 introduces an adaptive 3DG addition strategy to handle emerging objects. |
3DGStream achieves competitive performance in terms of image quality and model storage compared to state-of-the-art methods.
The method achieves fast on-the-fly per-frame reconstruction within 12 seconds.
3DGStream enables real-time rendering of free-viewpoint videos at 200 FPS. |
The quality of the initial frame reconstruction using 3DGs heavily influences the overall performance.
The limited number of training iterations for efficiency may restrict the modeling of drastic motions or complex emerging objects. |
free-viewpoint video, neural rendering, 3d gaussian splatting, dynamic scene reconstruction, real-time rendering |
2403.01427
Report |
Logit Standardization in Knowledge Distillation |
Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, Xiaochun Cao |
Knowledge distillation involves transferring soft labels from a teacher to a
student using a shared temperature-based softmax function. However, the
assumption of a shared temperature between teacher and student implies a
mandatory exact match between their logits in terms of logit range and
variance. This side-effect limits the performance of student, considering the
capacity discrepancy between them and the finding that the innate logit
relations of teacher are sufficient for student to learn. To address this
issue, we propose setting the temperature as the weighted standard deviation of
logit and performing a plug-and-play Z-score pre-process of logit
standardization before applying softmax and Kullback-Leibler divergence. Our
pre-process enables student to focus on essential logit relations from teacher
rather than requiring a magnitude match, and can improve the performance of
existing logit-based distillation methods. We also show a typical case where
the conventional setting of sharing temperature between teacher and student
cannot reliably yield the authentic distillation evaluation; nonetheless, this
challenge is successfully alleviated by our Z-score. We extensively evaluate
our method for various student and teacher models on CIFAR-100 and ImageNet,
showing its significant superiority. The vanilla knowledge distillation powered
by our pre-process can achieve favorable performance against state-of-the-art
methods, and other distillation variants can obtain considerable gain with the
assistance of our pre-process. |
This paper proposes a novel knowledge distillation (KD) method that employs a logit z-score standardization process as a pre-processing step before applying softmax and calculating the KL divergence loss. This approach addresses the limitations of conventional KD methods that enforce a strict match in logit magnitude between the teacher and student models. |
Conventional KD methods, relying on a shared temperature for teacher and student softmax functions, implicitly force an exact match between their logits, neglecting the capacity gap between them and the finding that preserving inter-class relations in logits suffices for effective knowledge transfer. This can hinder the student's performance. |
The authors first derive the softmax function in KD from the principle of entropy maximization, demonstrating that temperature values can be different for the teacher and student. They then propose a z-score standardization process on the logits before applying softmax, using weighted logit standard deviation as an adaptive temperature. This allows the student to learn the essential logit relations from the teacher without being constrained by magnitude matching. |
The proposed logit standardization method consistently improves the performance of various existing logit-based KD approaches on CIFAR-100 and ImageNet datasets.
Vanilla KD equipped with the proposed pre-processing achieves comparable results to state-of-the-art feature-based KD methods.
The method effectively addresses the issue of inauthentic evaluation of student performance caused by shared temperatures in conventional KD pipelines. |
The pre-processing necessitates a larger weight for the KD loss compared to the cross-entropy loss, potentially requiring further investigation and optimization.
Future work includes exploring the application of the proposed logit standardization pre-process in other areas like confidence calibration and uncertainty estimation. |
knowledge distillation, logit standardization, z-score, softmax temperature, deep neural networks |
2403.01422
Report |
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies |
Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Gang Yu, Jiayuan Fan, Tao Chen |
The development of multimodal models has marked a significant step forward in
how machines understand videos. These models have shown promise in analyzing
short video clips. However, when it comes to longer formats like movies, they
often fall short. The main hurdles are the lack of high-quality, diverse video
data and the intensive work required to collect or annotate such data. In the
face of these challenges, we propose MovieLLM, a novel framework designed to
create synthetic, high-quality data for long videos. This framework leverages
the power of GPT-4 and text-to-image models to generate detailed scripts and
corresponding visuals. Our approach stands out for its flexibility and
scalability, making it a superior alternative to traditional data collection
methods. Our extensive experiments validate that the data produced by MovieLLM
significantly improves the performance of multimodal models in understanding
complex video narratives, overcoming the limitations of existing datasets
regarding scarcity and bias. |
This paper presents MovieLLM, a novel framework for generating synthetic data to improve the understanding of long videos in multimodal models. |
Current multimodal models struggle with long videos due to the lack of high-quality, diverse, and extensive video data, which is difficult and expensive to collect and annotate. |
The framework uses GPT-4 to generate detailed movie plots and text-to-image models (guided by textual inversion for style consistency) to produce corresponding keyframes. This data is then used to fine-tune existing long-form video understanding models like LLaMA-VID. |
MovieLLM generates consistent and high-quality keyframes, outperforming existing multi-concept customization methods in terms of frame consistency, text-image alignment, and image quality.
Models trained on MovieLLM's synthetic data show significant performance improvements in both short and long video understanding tasks, including zero-shot question answering and comprehension of video overview, plot, and temporal aspects.
A new benchmark for long-form video understanding is proposed based on the MovieNet database and human-generated question-answer pairs. |
The forgetting issue inherent in LLMs might lead to inconsistencies in the generated frame descriptions and discontinuities in video scenes.
Future work will focus on refining the text generation component to address this limitation. |
multimodal learning, video understanding, synthetic data generation, large language models, text-to-image synthesis |
2403.01306
Report |
ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation |
Moran Yanuka, Morris Alper, Hadar Averbuch-Elor, Raja Giryes |
Web-scale training on paired text-image data is becoming increasingly central
to multimodal learning, but is challenged by the highly noisy nature of
datasets in the wild. Standard data filtering approaches succeed in removing
mismatched text-image pairs, but permit semantically related but highly
abstract or subjective text. These approaches lack the fine-grained ability to
isolate the most concrete samples that provide the strongest signal for
learning in a noisy dataset. In this work, we propose a new metric, image
caption concreteness, that evaluates caption text without an image reference to
measure its concreteness and relevancy for use in multimodal learning. Our
approach leverages strong foundation models for measuring visual-semantic
information loss in multimodal representations. We demonstrate that this
strongly correlates with human evaluation of concreteness in both single-word
and sentence-level texts. Moreover, we show that curation using ICC complements
existing approaches: It succeeds in selecting the highest quality samples from
multimodal web-scale datasets to allow for efficient training in
resource-constrained settings. |
This paper introduces Image Caption Concreteness (ICC), a novel metric designed to assess the visual concreteness of image captions without relying on image references. |
Web-scale datasets used for training multimodal models often contain noisy and abstract captions that hinder effective learning. Existing filtering methods struggle to identify these problematic captions while maintaining semantically relevant ones. ICC addresses this challenge by focusing on caption concreteness, a crucial aspect for effective multimodal learning. |
ICC leverages the capabilities of large foundation models through two autoencoding pipelines: a Visual-Bottleneck Autoencoder (VBA) utilizing a text-to-image model and a captioning model, and a Semantic-Bottleneck Autoencoder (SBA) employing CLIP text embeddings and a large language model. These pipelines' reconstruction scores are then distilled into a smaller language model, enabling efficient ICC score generation for new text. |
ICC demonstrates superior performance in curating high-quality image-caption pairs from large datasets compared to existing filtering methods, leading to improved performance in downstream tasks like image captioning and representation learning.
There is a strong correlation observed between ICC scores and human judgments of concreteness for both single-word and sentence-level texts, highlighting its effectiveness in capturing human intuition about visual concreteness.
Combining both VBA and SBA pipelines proves crucial for ICC's effectiveness, as each approach compensates for the other's weaknesses in accurately identifying abstract or concrete captions. |
ICC may not be sensitive enough to grammatical inconsistencies in captions, potentially assigning high scores to poorly structured but semantically concrete sentences. This limitation could be addressed by training the distillation model on a more diverse range of caption styles.
The experiments were conducted on a relatively small dataset due to computational limitations. Future work could explore the impact of scaling up the dataset size and evaluating ICC's efficacy on a wider array of downstream tasks like VQA and caption ranking. |
multimodal learning, dataset curation, text concreteness, image captioning, representation learning |
2403.01212
Report |
TCIG: Two-Stage Controlled Image Generation with Quality Enhancement through Diffusion |
Salaheldin Mohamed |
In recent years, significant progress has been made in the development of
text-to-image generation models. However, these models still face limitations
when it comes to achieving full controllability during the generation process.
Often, specific training or the use of limited models is required, and even
then, they have certain restrictions. To address these challenges, A two-stage
method that effectively combines controllability and high quality in the
generation of images is proposed. This approach leverages the expertise of
pre-trained models to achieve precise control over the generated images, while
also harnessing the power of diffusion models to achieve state-of-the-art
quality. By separating controllability from high quality, This method achieves
outstanding results. It is compatible with both latent and image space
diffusion models, ensuring versatility and flexibility. Moreover, This approach
consistently produces comparable outcomes to the current state-of-the-art
methods in the field. Overall, This proposed method represents a significant
advancement in text-to-image generation, enabling improved controllability
without compromising on the quality of the generated images. |
This paper presents TCIG, a novel two-stage method for controllable text-to-image generation that leverages pre-trained models (segmentation and diffusion) without requiring training or fine-tuning. |
Existing text-to-image generation models often lack full controllability, struggling to incorporate user preferences beyond textual prompts. Existing solutions often involve costly training, fine-tuning, or are limited by model architectures. |
TCIG first generates a controlled image based on input segmentation masks and text using a pre-trained VQGAN guided by a CLIP network and segmentation models. The second stage refines the image for quality and detail using a pre-trained diffusion model (Img-to-Img). |
TCIG allows flexible and controllable image generation with diverse outputs.
Quantitative evaluation on the COCO dataset shows TCIG outperforms existing methods in terms of IoU.
Qualitative comparison highlights TCIG's superior adherence to input masks compared to other models. |
The development of this method faced limitations due to the computational power of GPUs.
Future work can explore separating control from high-quality image generation further. |
image generation, controllable generation, text-to-image, diffusion models, segmentation |
2403.01124
Report |
Text-guided Explorable Image Super-resolution |
Kanchana Vaishnavi Gandikota, Paramanand Chandramouli |
In this paper, we introduce the problem of zero-shot text-guided exploration
of the solutions to open-domain image super-resolution. Our goal is to allow
users to explore diverse, semantically accurate reconstructions that preserve
data consistency with the low-resolution inputs for different large
downsampling factors without explicitly training for these specific
degradations. We propose two approaches for zero-shot text-guided
super-resolution - i) modifying the generative process of text-to-image
\textit{T2I} diffusion models to promote consistency with low-resolution
inputs, and ii) incorporating language guidance into zero-shot diffusion-based
restoration methods. We show that the proposed approaches result in diverse
solutions that match the semantic meaning provided by the text prompt while
preserving data consistency with the degraded inputs. We evaluate the proposed
baselines for the task of extreme super-resolution and demonstrate advantages
in terms of restoration quality, diversity, and explorability of solutions. |
This paper introduces zero-shot text-guided exploration of solutions for open-domain image super-resolution, enabling users to explore diverse, semantically accurate reconstructions that are consistent with low-resolution inputs using text prompts. |
This is important because it allows for more intuitive and flexible control over the super-resolution process, especially for high upscaling factors where the problem is ill-posed and has many possible solutions. |
The authors propose two approaches: 1) modifying the generative process of text-to-image (T2I) diffusion models to promote consistency with low-resolution inputs, and 2) incorporating language guidance into zero-shot diffusion-based restoration methods using CLIP. |
Text-guided super-resolution methods achieve comparable performance to specialized models trained on faces for neutral prompts.
Text-guided methods significantly improve image quality and semantic matching on open-domain images compared to unconditional diffusion models.
User studies show that Imagen-DDNM and unCLIP-DDNM produce more realistic and semantically consistent results compared to CLIP-guided DDNM. |
Generating realistic images consistently can be challenging, requiring multiple attempts to achieve the desired output.
The performance of all methods depends on the generative capabilities of the pre-trained generative model and inherits its biases. |
image super-resolution, text-guided image generation, diffusion models, zero-shot learning, clip |
2403.00939
Report |
G3DR: Generative 3D Reconstruction in ImageNet |
Pradyumna Reddy, Ismail Elezi, Jiankang Deng |
We introduce a novel 3D generative method, Generative 3D Reconstruction
(G3DR) in ImageNet, capable of generating diverse and high-quality 3D objects
from single images, addressing the limitations of existing methods. At the
heart of our framework is a novel depth regularization technique that enables
the generation of scenes with high-geometric fidelity. G3DR also leverages a
pretrained language-vision model, such as CLIP, to enable reconstruction in
novel views and improve the visual realism of generations. Additionally, G3DR
designs a simple but effective sampling procedure to further improve the
quality of generations. G3DR offers diverse and efficient 3D asset generation
based on class or text conditioning. Despite its simplicity, G3DR is able to
beat state-of-theart methods, improving over them by up to 22% in perceptual
metrics and 90% in geometry scores, while needing only half of the training
time. Code is available at https://github.com/preddy5/G3DR |
Introduces G3DR, a novel 3D generative method that generates diverse and high-quality 3D objects from single images in ImageNet, addressing limitations of existing methods. |
3D asset generation is crucial for various applications, and G3DR enables this from diverse, unaligned datasets like ImageNet. |
Combines a latent diffusion model with a conditional triplane generator and a novel depth regularization technique to ensure geometric fidelity and improve visual realism. |
Achieves state-of-the-art results on ImageNet, improving FID score by 22% and Inception Score by 21.5% over previous methods.
Significantly outperforms competing methods in geometry evaluation, almost doubling the Non-Flatness Score and achieving better depth accuracy.
Demonstrates strong performance in fine-grained datasets and generalizes well to text-conditioned generation and out-of-domain examples. |
Reliance on pseudo-ground truth depth maps from an off-the-shelf estimator may limit geometry accuracy.
Exploring alternative novel view supervision methods beyond CLIP could further enhance generation quality. |
3d generation, imagenet, depth regularization, single-view reconstruction, generative models |
2403.00835
Report |
CLLMs: Consistency Large Language Models |
Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang |
Parallel decoding methods such as Jacobi decoding show promise for more
efficient LLM inference as it breaks the sequential nature of the LLM decoding
process and transforms it into parallelizable computation. However, in
practice, it achieves little speedup compared to traditional autoregressive
(AR) decoding, primarily because Jacobi decoding seldom accurately predicts
more than one token in a single fixed-point iteration step. To address this, we
develop a new approach aimed at realizing fast convergence from any state to
the fixed point on a Jacobi trajectory. This is accomplished by refining the
target LLM to consistently predict the fixed point given any state as input.
Extensive experiments demonstrate the effectiveness of our method, showing
2.4$\times$ to 3.4$\times$ improvements in generation speed while preserving
generation quality across both domain-specific and open-domain benchmarks. |
This paper proposes Consistency Large Language Models (CLLMs), a new method to adapt large language models (LLMs) for fast parallel decoding with Jacobi iteration. |
Traditional autoregressive decoding in LLMs leads to high latency, especially for lengthy responses. Existing solutions often require additional models or architectural changes with significant overhead. CLLMs addresses these limitations to achieve significant speedup with minimal performance degradation and without extra models or architectural modifications. |
CLLMs are trained on Jacobi trajectory datasets collected from a target LLM, employing a consistency loss to encourage the prediction of multiple correct tokens in each Jacobi iteration. This approach is inspired by consistency models used for accelerating diffusion models. |
CLLMs achieve 2.4x to 3.4x speedup with Jacobi decoding on various tasks, including GSM8K, CodeSearchNet Python, Spider, and MT-bench.
The acceleration is attributed to the *fast-forwarding* phenomenon, where CLLMs can predict multiple consecutive tokens correctly in a single iteration, and the emergence of *stationary tokens*, which are predicted correctly early on and remain unchanged despite preceding incorrect tokens.
CLLMs show advantages over existing methods like speculative decoding and Medusa with higher adaptability to existing LLMs and lower memory consumption. |
The efficiency of CLLMs heavily relies on the quality and size of the Jacobi trajectory dataset.
Current training procedure of CLLMs introduces extra overhead for collecting Jacobi trajectory dataset. |
large language models, efficient inference, parallel decoding, jacobi iteration, consistency models |
2403.00818
Report |
DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models |
Wei He, Kai Han, Yehui Tang, Chengcheng Wang, Yujie Yang, Tianyu Guo, Yunhe Wang |
Large language models (LLMs) face a daunting challenge due to the excessive
computational and memory requirements of the commonly used Transformer
architecture. While state space model (SSM) is a new type of foundational
network architecture offering lower computational complexity, their performance
has yet to fully rival that of Transformers. This paper introduces DenseSSM, a
novel approach to enhance the flow of hidden information between layers in
SSMs. By selectively integrating shallowlayer hidden states into deeper layers,
DenseSSM retains fine-grained information crucial for the final output. Dense
connections enhanced DenseSSM still maintains the training parallelizability
and inference efficiency. The proposed method can be widely applicable to
various SSM types like RetNet and Mamba. With similar model size, DenseSSM
achieves significant improvements, exemplified by DenseRetNet outperforming the
original RetNet with up to 5% accuracy improvement on public benchmarks. code
is avalaible at https://github.com/WailordHe/DenseSSM |
This paper introduces DenseSSM, a novel approach for enhancing state space models (SSMs) by selectively integrating hidden states from shallow layers into deeper layers to improve information flow and retain fine-grained information. |
Large language models (LLMs) based on Transformers face challenges with computational and memory demands. While SSMs offer lower complexity, their performance hasn't matched Transformers. DenseSSM aims to bridge this performance gap by enhancing information flow in SSMs. |
DenseSSM addresses hidden state degradation in deeper SSM layers by: 1) Collecting and projecting shallow layer hidden states to the target layer's subspace using a selective transition module. 2) Fusing these projected hidden states with the target layer's hidden state using a hidden fusion module. |
DenseSSM significantly improves the performance of various SSM architectures like RetNet and Mamba.
DenseRetNet, based on DenseSSM, outperforms the original RetNet by up to 5% accuracy on public benchmarks.
DenseSSM maintains the training parallelizability and inference efficiency of SSMs while achieving these improvements. |
The paper primarily focuses on evaluating DenseSSM on language modeling tasks, leaving exploration in other domains for future work.
Further investigation into different implementations of the selective transition and hidden fusion modules could yield additional performance gains. |
state space models, large language models, deep learning, natural language processing, dense connections |
2403.00762
Report |
Point Cloud Mamba: Point Cloud Learning via State Space Model |
Tao Zhang, Xiangtai Li, Haobo Yuan, Shunping Ji, Shuicheng Yan |
In this work, for the first time, we demonstrate that Mamba-based point cloud
methods can outperform point-based methods. Mamba exhibits strong global
modeling capabilities and linear computational complexity, making it highly
attractive for point cloud analysis. To enable more effective processing of 3-D
point cloud data by Mamba, we propose a novel Consistent Traverse Serialization
to convert point clouds into 1-D point sequences while ensuring that
neighboring points in the sequence are also spatially adjacent. Consistent
Traverse Serialization yields six variants by permuting the order of x, y, and
z coordinates, and the synergistic use of these variants aids Mamba in
comprehensively observing point cloud data. Furthermore, to assist Mamba in
handling point sequences with different orders more effectively, we introduce
point prompts to inform Mamba of the sequence's arrangement rules. Finally, we
propose positional encoding based on spatial coordinate mapping to inject
positional information into point cloud sequences better. Based on these
improvements, we construct a point cloud network named Point Cloud Mamba, which
combines local and global modeling. Point Cloud Mamba surpasses the SOTA
point-based method PointNeXt and achieves new SOTA performance on the
ScanObjectNN, ModelNet40, and ShapeNetPart datasets. |
This paper introduces Point Cloud Mamba (PCM), a novel framework for point cloud learning that leverages the strengths of state space models, specifically Mamba, to achieve global feature modeling with linear computational complexity. |
While state space models like Mamba have shown promise in sequence modeling, their application to 3D point cloud analysis remained unexplored. This paper bridges that gap by demonstrating the effectiveness of Mamba for point cloud tasks, offering a compelling alternative to computationally expensive Transformer-based methods. |
PCM employs a novel Consistent Traverse Serialization (CTS) method to convert 3D point clouds into 1D sequences suitable for Mamba. It introduces 'order prompts' to help Mamba discern the arrangement of points in sequences generated by different CTS variants and utilizes a spatial coordinate mapping-based positional encoding scheme. The overall architecture combines local geometric feature extraction with global modeling using Mamba layers. |
PCM surpasses the state-of-the-art point-based method PointNeXt on ScanObjectNN, ModelNet40, and ShapeNetPart datasets.
The Consistent Traverse Serialization strategy, combined with multiple serialization orders, is shown to be crucial for capturing spatial relationships within point clouds.
Order prompts and spatial coordinate mapping-based positional encoding significantly contribute to PCM's performance. |
For large-scale point clouds, the scan-based training of Mamba limits its applicability, necessitating point cloud cropping and creating a discrepancy between training and inference.
The throughput of PCM, while showing promise, is currently lower than PointMLP due to the computational overhead of multiple reorderings. |
3d point cloud, state space models, mamba, point cloud classification, point cloud segmentation |
2403.00729
Report |
Can Transformers Capture Spatial Relations between Objects? |
Chuan Wen, Dinesh Jayaraman, Yang Gao |
Spatial relationships between objects represent key scene information for
humans to understand and interact with the world. To study the capability of
current computer vision systems to recognize physically grounded spatial
relations, we start by proposing precise relation definitions that permit
consistently annotating a benchmark dataset. Despite the apparent simplicity of
this task relative to others in the recognition literature, we observe that
existing approaches perform poorly on this benchmark. We propose new approaches
exploiting the long-range attention capabilities of transformers for this task,
and evaluating key design principles. We identify a simple "RelatiViT"
architecture and demonstrate that it outperforms all current approaches. To our
knowledge, this is the first method to convincingly outperform naive baselines
on spatial relation prediction in in-the-wild settings. The code and datasets
are available in \url{https://sites.google.com/view/spatial-relation}. |
This paper introduces RelatiViT, a novel transformer-based architecture designed for precise and physically grounded spatial relation prediction in computer vision. |
Recognizing spatial relations between objects is crucial for scene understanding and robot manipulation, but existing methods struggle to surpass naive bounding-box-based baselines. |
The authors systematically explore different transformer designs, focusing on feature extraction, query localization, context aggregation, and pair interaction. They benchmark these designs on Rel3D and a refined version of SpatialSense, called SpatialSense+, featuring precise relation definitions and annotations. |
RelatiViT significantly outperforms all existing methods, including naive baselines and adapted visual relation detection models.
RelatiViT effectively leverages visual information, outperforming baselines on relations requiring depth, pose, and shape understanding.
Ablation studies confirm the importance of feature extraction, context aggregation, and pair interaction in RelatiViT's performance. |
The current study primarily focuses on pairwise relations and doesn't explicitly address higher-order relationships.
Future work could explore incorporating depth information or 3D object representations to further improve performance. |
spatial relation prediction, vision transformer, computer vision, scene understanding, benchmarking |
2403.00712
Report |
Rethinking Inductive Biases for Surface Normal Estimation |
Gwangbin Bae, Andrew J. Davison |
Despite the growing demand for accurate surface normal estimation models,
existing methods use general-purpose dense prediction models, adopting the same
inductive biases as other tasks. In this paper, we discuss the inductive biases
needed for surface normal estimation and propose to (1) utilize the per-pixel
ray direction and (2) encode the relationship between neighboring surface
normals by learning their relative rotation. The proposed method can generate
crisp - yet, piecewise smooth - predictions for challenging in-the-wild images
of arbitrary resolution and aspect ratio. Compared to a recent ViT-based
state-of-the-art model, our method shows a stronger generalization ability,
despite being trained on an orders of magnitude smaller dataset. The code is
available at https://github.com/baegwangbin/DSINE. |
This paper introduces a new method for single-image surface normal estimation that encodes per-pixel ray direction and models the pairwise relative rotation between nearby pixels. |
Existing surface normal estimation models rely on general-purpose dense prediction models, neglecting task-specific inductive biases. This limits accuracy, especially for images with out-of-distribution camera intrinsics. |
The proposed method encodes camera intrinsics via ray direction and utilizes a ray direction-based activation function for visibility. It recasts normal estimation as rotation estimation, learning relative rotations between neighboring pixels for piecewise smooth predictions. |
The method demonstrates strong generalization ability, outperforming state-of-the-art methods on challenging datasets with diverse scenes and camera intrinsics.
It achieves high sample efficiency, requiring an order of magnitude smaller dataset than competing methods.
Qualitative results showcase superior detail and sharpness, particularly near object boundaries. |
The method assumes prior knowledge of camera intrinsics, limiting its applicability to images without such information.
Future work explores joint estimation of camera intrinsics and surface normals for improved generalization to in-the-wild images. |
surface normal estimation, inductive bias, ray direction, rotation estimation, piecewise smooth |
2403.00644
Report |
Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks |
Yuhao Liu, Zhanghan Ke, Fang Liu, Nanxuan Zhao, Rynson W. H. Lau |
Diffusion models trained on large-scale datasets have achieved remarkable
progress in image synthesis. However, due to the randomness in the diffusion
process, they often struggle with handling diverse low-level tasks that require
details preservation. To overcome this limitation, we present a new Diff-Plugin
framework to enable a single pre-trained diffusion model to generate
high-fidelity results across a variety of low-level tasks. Specifically, we
first propose a lightweight Task-Plugin module with a dual branch design to
provide task-specific priors, guiding the diffusion process in preserving image
content. We then propose a Plugin-Selector that can automatically select
different Task-Plugins based on the text instruction, allowing users to edit
images by indicating multiple low-level tasks with natural language. We conduct
extensive experiments on 8 low-level vision tasks. The results demonstrate the
superiority of Diff-Plugin over existing methods, particularly in real-world
scenarios. Our ablations further validate that Diff-Plugin is stable,
schedulable, and supports robust training across different dataset sizes. |
This paper proposes Diff-Plugin, a novel framework that enhances pre-trained diffusion models for handling various low-level vision tasks requiring stringent detail preservation. |
Existing diffusion models struggle with detail preservation in low-level vision tasks due to the randomness in the diffusion process. Diff-Plugin addresses this by incorporating task-specific priors without retraining the entire model. |
Diff-Plugin consists of a lightweight, dual-branch Task-Plugin module to inject task-specific priors into the diffusion process and a Plugin-Selector that allows users to choose the desired Task-Plugin via text input. |
Diff-Plugin demonstrates superior performance over existing diffusion and regression-based methods, particularly in real-world scenarios.
The framework is flexible and scalable, adapting to new tasks and datasets without affecting existing trained plugins.
User studies confirm a preference for Diff-Plugin's output quality and content consistency. |
A current limitation is the inability to perform local editing.
Future work will explore integrating LLMs for region-specific task application. |
diffusion models, low-level vision, image editing, task-specific priors, text-driven editing |
2403.00587
Report |
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset |
Ander Salaberria, Gorka Azkune, Oier Lopez de Lacalle, Aitor Soroa, Eneko Agirre, Frank Keller |
Existing work has observed that current text-to-image systems do not
accurately reflect explicit spatial relations between objects such as 'left of'
or 'below'. We hypothesize that this is because explicit spatial relations
rarely appear in the image captions used to train these models. We propose an
automatic method that, given existing images, generates synthetic captions that
contain 14 explicit spatial relations. We introduce the Spatial Relation for
Generation (SR4G) dataset, which contains 9.9 millions image-caption pairs for
training, and more than 60 thousand captions for evaluation. In order to test
generalization we also provide an 'unseen' split, where the set of objects in
the train and test captions are disjoint. SR4G is the first dataset that can be
used to spatially fine-tune text-to-image systems. We show that fine-tuning two
different Stable Diffusion models (denoted as SD$_{SR4G}$) yields up to 9
points improvements in the VISOR metric. The improvement holds in the 'unseen'
split, showing that SD$_{SR4G}$ is able to generalize to unseen objects.
SD$_{SR4G}$ improves the state-of-the-art with fewer parameters, and avoids
complex architectures. Our analysis shows that improvement is consistent for
all relations. The dataset and the code will be publicly available. |
This paper introduces SR4G, a new synthetic dataset for training and evaluating the ability of text-to-image models to understand and generate images from textual descriptions containing explicit spatial relations. |
Current text-to-image systems struggle to accurately represent explicit spatial relations, limiting their use in applications like text-based image editing. This is mainly because training datasets lack captions with explicit spatial relations. |
SR4G leverages object annotations from the COCO dataset and heuristic rules to automatically generate synthetic captions containing 14 explicit spatial relations, paired with real images. |
Fine-tuning Stable Diffusion models on SR4G leads to significant improvements in spatial relation understanding, as measured by the VISOR metric.
The fine-tuned models generalize to unseen objects, indicating a deeper understanding of spatial relations beyond object-specific correlations.
SR4G enables fine-tuned Stable Diffusion models to outperform larger and more complex state-of-the-art pipeline models in spatial relation generation. |
SR4G currently only supports English captions, limiting its applicability to other languages.
The dataset focuses on unambiguous spatial relations defined over bounding box information, excluding orientation and 3D relations. |
text-to-image generation, spatial relations, synthetic datasets, stable diffusion, computer vision |
2403.00522
Report |
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks |
Xiangxiang Chu, Jianlin Su, Bo Zhang, Chunhua Shen |
Large language models are built on top of a transformer-based architecture to
process textual inputs. For example, the LLaMA stands out among many
open-source implementations. Can the same transformer be used to process 2D
images? In this paper, we answer this question by unveiling a LLaMA-like vision
transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored
for this purpose. VisionLLaMA is a unified and generic modelling framework for
solving most vision tasks. We extensively evaluate its effectiveness using
typical pre-training paradigms in a good portion of downstream tasks of image
perception and especially image generation. In many cases, VisionLLaMA have
exhibited substantial gains over the previous state-of-the-art vision
transformers. We believe that VisionLLaMA can serve as a strong new baseline
model for vision generation and understanding. Our code will be released at
https://github.com/Meituan-AutoML/VisionLLaMA. |
The paper proposes VisionLLaMA, a vision transformer architecture inspired by the LLaMA architecture for large language models, aiming to bridge the architectural gap between vision and language modalities. |
The success of LLaMA in NLP motivates the exploration of a similar architecture for vision, potentially enabling unified architectures and shared deployment techniques for both modalities. |
The paper adapts the LLaMA architecture to process 2D images, investigates plain and pyramid transformer variants, and introduces AS2DRoPE (auto-scaled 2D RoPE) to handle variable input resolutions. |
VisionLLaMA significantly outperforms DiT and SiT, state-of-the-art vision transformers for image generation, across various model sizes and evaluation metrics.
In image classification, VisionLLaMA achieves competitive performance compared to DeiT3 and Twins under both supervised and self-supervised training settings.
VisionLLaMA demonstrates superiority in downstream tasks like semantic segmentation (ADE20K) and object detection (COCO), outperforming Swin and Twins in terms of mIoU and mAP. |
The paper primarily focuses on square image inputs, leaving the exploration for arbitrary aspect ratios as future work.
Further investigation into combining VisionLLaMA with modality-specific components like PEG is needed to maximize its potential. |
vision transformer, llama, image generation, image classification, positional encoding |
2403.00483
Report |
RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization |
Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, Yongdong Zhang |
Text-to-image customization, which aims to synthesize text-driven images for
the given subjects, has recently revolutionized content creation. Existing
works follow the pseudo-word paradigm, i.e., represent the given subjects as
pseudo-words and then compose them with the given text. However, the inherent
entangled influence scope of pseudo-words with the given text results in a
dual-optimum paradox, i.e., the similarity of the given subjects and the
controllability of the given text could not be optimal simultaneously. We
present RealCustom that, for the first time, disentangles similarity from
controllability by precisely limiting subject influence to relevant parts only,
achieved by gradually narrowing real text word from its general connotation to
the specific subject and using its cross-attention to distinguish relevance.
Specifically, RealCustom introduces a novel "train-inference" decoupled
framework: (1) during training, RealCustom learns general alignment between
visual conditions to original textual conditions by a novel adaptive scoring
module to adaptively modulate influence quantity; (2) during inference, a novel
adaptive mask guidance strategy is proposed to iteratively update the influence
scope and influence quantity of the given subjects to gradually narrow the
generation of the real text word. Comprehensive experiments demonstrate the
superior real-time customization ability of RealCustom in the open domain,
achieving both unprecedented similarity of the given subjects and
controllability of the given text for the first time. The project page is
https://corleone-huang.github.io/realcustom/. |
This paper presents RealCustom, a novel text-to-image customization paradigm that disentangles subject similarity from text controllability by limiting subject influence to relevant image regions. |
Existing pseudo-word based customization methods suffer from a dual-optimum paradox where optimizing for subject similarity often degrades controllability of the text prompt, and vice versa. This limits their ability to achieve high-quality customization in real-time and open-domain scenarios. |
RealCustom introduces a train-inference decoupled framework. During training, an adaptive scoring module learns general alignment between visual and textual conditions. During inference, an adaptive mask guidance strategy progressively narrows down the generation of a real text word (e.g., "toy") to the specific given subject by iteratively updating its influence scope and quantity based on cross-attention. |
RealCustom achieves superior simultaneous similarity and controllability compared to state-of-the-art methods.
The method enables real-time open-domain customization, generalizing to any given subject without requiring training on specific object datasets.
RealCustom exhibits high-quality generation with better aesthetics compared to existing methods. |
The influence scope of the given subject is limited to the top-k attention region of a single real word, which could be further improved.
RealCustom focuses on the single subject customization. Extending it to multiple subjects is an interesting future direction. |
text-to-image customization, generative models, diffusion models, cross-attention, open-domain customization |
2403.00459
Report |
Deformable One-shot Face Stylization via DINO Semantic Guidance |
Yang Zhou, Zichong Chen, Hui Huang |
This paper addresses the complex issue of one-shot face stylization, focusing
on the simultaneous consideration of appearance and structure, where previous
methods have fallen short. We explore deformation-aware face stylization that
diverges from traditional single-image style reference, opting for a real-style
image pair instead. The cornerstone of our method is the utilization of a
self-supervised vision transformer, specifically DINO-ViT, to establish a
robust and consistent facial structure representation across both real and
style domains. Our stylization process begins by adapting the StyleGAN
generator to be deformation-aware through the integration of spatial
transformers (STN). We then introduce two innovative constraints for generator
fine-tuning under the guidance of DINO semantics: i) a directional deformation
loss that regulates directional vectors in DINO space, and ii) a relative
structural consistency constraint based on DINO token self-similarities,
ensuring diverse generation. Additionally, style-mixing is employed to align
the color generation with the reference, minimizing inconsistent
correspondences. This framework delivers enhanced deformability for general
one-shot face stylization, achieving notable efficiency with a fine-tuning
duration of approximately 10 minutes. Extensive qualitative and quantitative
comparisons demonstrate our superiority over state-of-the-art one-shot face
stylization methods. Code is available at https://github.com/zichongc/DoesFS |
This paper introduces a novel deformable one-shot face stylization framework that leverages DINO semantic guidance to achieve both appearance and structure stylization using a single real-style image pair. |
Existing one-shot face stylization methods primarily focus on appearance transfer and struggle to accurately capture and reproduce structural deformations present in artistic styles, especially those with exaggerated features. |
The method uses a deformation-aware StyleGAN generator augmented with spatial transformers (STN). It employs DINO feature representations to guide the stylization process with two novel constraints: a directional deformation loss to regulate structural changes and a relative structural consistency constraint to preserve diversity. Style mixing is also employed for color alignment. |
The method generates high-quality stylized faces with convincing structural deformations, outperforming existing one-shot methods qualitatively and quantitatively.
DINO features prove effective in capturing consistent semantic representations across real and stylized face domains.
The framework allows for controllable facial deformation through interpolation of the STN warping fields. |
The reliance on existing generative models to produce paired training data may limit the framework's ability to learn from real-world style examples.
Further exploration of DINO features could lead to improved disentanglement of appearance and structure, potentially enhancing stylization control. |
face stylization, one-shot learning, deformation-aware, dino, stylegan |
2403.00437
Report |
LoMOE: Localized Multi-Object Editing via Multi-Diffusion |
Goirik Chakrabarty, Aditya Chandrasekar, Ramya Hebbalaguppe, Prathosh AP |
Recent developments in the field of diffusion models have demonstrated an
exceptional capacity to generate high-quality prompt-conditioned image edits.
Nevertheless, previous approaches have primarily relied on textual prompts for
image editing, which tend to be less effective when making precise edits to
specific objects or fine-grained regions within a scene containing
single/multiple objects. We introduce a novel framework for zero-shot localized
multi-object editing through a multi-diffusion process to overcome this
challenge. This framework empowers users to perform various operations on
objects within an image, such as adding, replacing, or editing $\textbf{many}$
objects in a complex scene $\textbf{in one pass}$. Our approach leverages
foreground masks and corresponding simple text prompts that exert localized
influences on the target regions resulting in high-fidelity image editing. A
combination of cross-attention and background preservation losses within the
latent space ensures that the characteristics of the object being edited are
preserved while simultaneously achieving a high-quality, seamless
reconstruction of the background with fewer artifacts compared to the current
methods. We also curate and release a dataset dedicated to multi-object
editing, named $\texttt{LoMOE}$-Bench. Our experiments against existing
state-of-the-art methods demonstrate the improved effectiveness of our approach
in terms of both image editing quality and inference speed. |
Introduces LoMOE, a zero-shot framework for localized multi-object editing using diffusion models, enabling simultaneous edits to multiple objects in an image guided by masks and text prompts. |
Addresses the limitations of existing text-based image editing methods that struggle with precise localized edits, particularly in complex scenes with multiple objects. |
Leverages a multi-diffusion process with cross-attention matching and background preservation losses to guide the editing process within specified regions while maintaining structural consistency and background fidelity. Introduces \proposedDataset, a dataset for multi-object editing. |
Achieves superior performance in neural image quality metrics compared to baseline methods, demonstrating realistic and faithful edits.
Enables single-pass multi-object editing, leading to significantly faster inference times compared to iterative approaches.
Demonstrates the effectiveness of cross-attention and background preservation losses in achieving a balance between realism and faithfulness in edits. |
Faces challenges in realism and blending for certain edits, suggesting avenues for improving fidelity and object integration.
Currently unable to handle object deletion or swapping within an image, presenting opportunities for future research. |
image editing, diffusion models, multi-object editing, localized image editing, generative ai |
2402.19481
Report |
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models |
Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han |
Diffusion models have achieved great success in synthesizing high-quality
images. However, generating high-resolution images with diffusion models is
still challenging due to the enormous computational costs, resulting in a
prohibitive latency for interactive applications. In this paper, we propose
DistriFusion to tackle this problem by leveraging parallelism across multiple
GPUs. Our method splits the model input into multiple patches and assigns each
patch to a GPU. However, naively implementing such an algorithm breaks the
interaction between patches and loses fidelity, while incorporating such an
interaction will incur tremendous communication overhead. To overcome this
dilemma, we observe the high similarity between the input from adjacent
diffusion steps and propose displaced patch parallelism, which takes advantage
of the sequential nature of the diffusion process by reusing the pre-computed
feature maps from the previous timestep to provide context for the current
step. Therefore, our method supports asynchronous communication, which can be
pipelined by computation. Extensive experiments show that our method can be
applied to recent Stable Diffusion XL with no quality degradation and achieve
up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is
publicly available at https://github.com/mit-han-lab/distrifuser. |
Introduces DistriFusion, a training-free algorithm leveraging multiple GPUs to accelerate diffusion model inference without sacrificing image quality. |
Generating high-resolution images with diffusion models is computationally expensive and slow, hindering interactive applications. DistriFusion tackles this by enabling efficient parallel processing on multiple GPUs. |
DistriFusion splits the input image into patches, assigns each to a GPU, and reuses activations from previous denoising steps (activation displacement) to maintain inter-patch interaction while minimizing communication overhead. |
Achieves up to 6.1x speedup on eight A100 GPUs compared to a single GPU on Stable Diffusion XL.
Maintains comparable image quality to the original model across various metrics (PSNR, LPIPS, FID).
Effectively hides communication overhead within computation using asynchronous communication and sparse operations. |
Speedups are limited for low-resolution images due to underutilized GPUs.
May not be suitable for extremely-few-step sampling methods due to rapid denoising state changes. |
diffusion models, parallel computing, image generation, gpu acceleration, activation displacement |
2402.19479
Report |
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers |
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov |
The quality of the data and annotation upper-bounds the quality of a
downstream model. While there exist large text corpora and image-text pairs,
high-quality video-text data is much harder to collect. First of all, manual
labeling is more time-consuming, as it requires an annotator to watch an entire
video. Second, videos have a temporal dimension, consisting of several scenes
stacked together, and showing multiple actions. Accordingly, to establish a
video dataset with high-quality captions, we propose an automatic approach
leveraging multimodal inputs, such as textual video description, subtitles, and
individual video frames. Specifically, we curate 3.8M high-resolution videos
from the publicly available HD-VILA-100M dataset. We then split them into
semantically consistent video clips, and apply multiple cross-modality teacher
models to obtain captions for each video. Next, we finetune a retrieval model
on a small subset where the best caption of each video is manually selected and
then employ the model in the whole dataset to select the best caption as the
annotation. In this way, we get 70M videos paired with high-quality text
captions. We dub the dataset as Panda-70M. We show the value of the proposed
dataset on three downstream tasks: video captioning, video and text retrieval,
and text-driven video generation. The models trained on the proposed data score
substantially better on the majority of metrics across all the tasks. |
Introduces Panda-70M, a large-scale video dataset with 70 million video clips and high-quality text captions, generated using multiple cross-modality teacher models and a fine-grained retrieval model for annotation selection. |
High-quality video-text data is crucial for training robust video-language models, but manual annotation is expensive and time-consuming, limiting the scale and quality of existing datasets. |
1. Split 3.8M long videos into semantically coherent clips. 2. Generate multiple candidate captions per clip using eight cross-modality teacher models with different input modalities. 3. Train a fine-grained video-to-text retrieval model on a human-annotated subset to select the best caption as the final annotation. |
Pretraining on Panda-70M significantly improves performance on video captioning, video and text retrieval, and text-to-video generation tasks.
The proposed captioning pipeline, using multiple teacher models and fine-grained retrieval, outperforms single models and generates captions comparable to human annotations.
A student captioning model trained on Panda-70M with knowledge distillation outperforms any individual teacher model and benefits from multimodal inputs. |
The dataset primarily consists of vocal-intensive videos due to the source data.
Focus on fine-grained, semantically consistent clips limits content diversity within a single video and average video length. |
video captioning, video-text retrieval, text-to-video generation, large-scale dataset, multimodal learning |
2402.19474
Report |
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World |
Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai |
We present the All-Seeing Project V2: a new model and dataset designed for
understanding object relations in images. Specifically, we propose the
All-Seeing Model V2 (ASMv2) that integrates the formulation of text generation,
object localization, and relation comprehension into a relation conversation
(ReC) task. Leveraging this unified task, our model excels not only in
perceiving and recognizing all objects within the image but also in grasping
the intricate relation graph between them, diminishing the relation
hallucination often encountered by Multi-modal Large Language Models (MLLMs).
To facilitate training and evaluation of MLLMs in relation understanding, we
created the first high-quality ReC dataset ({AS-V2) which is aligned with the
format of standard instruction tuning data. In addition, we design a new
benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for
comprehensively evaluating the relation comprehension capabilities of MLLMs.
Notably, our ASMv2 achieves an overall accuracy of 52.04 on this relation-aware
benchmark, surpassing the 43.14 of LLaVA-1.5 by a large margin. We hope that
our work can inspire more future research and contribute to the evolution
towards artificial general intelligence. Our project is released at
https://github.com/OpenGVLab/all-seeing. |
The paper introduces the All-Seeing Project V2, a model and dataset for enhancing relation comprehension in Multi-modal Large Language Models (MLLMs). |
Existing MLLMs struggle to accurately comprehend relations between objects in images, leading to hallucinations and reliance on language priors. |
The authors propose a novel task called Relation Conversation (ReC) that unifies text generation, object localization, and relation comprehension. They also create a high-quality ReC dataset (AS-V2) and a benchmark for evaluating relation comprehension (CRPE). |
The proposed All-Seeing Model V2 (ASMv2) achieves state-of-the-art performance on Open-ended Scene Graph Generation and various image-level and region-level vision-language tasks.
ASMv2 significantly outperforms existing MLLMs on CRPE, demonstrating superior relation comprehension ability.
The paper shows that training with relation conversation data significantly improves region-level visual information understanding and relation comprehension. |
The paper acknowledges the need for more appropriate metrics for evaluating open-ended scene graph generation.
Future work could explore more sophisticated methods for handling the imbalanced distribution of predicate labels in scene graph generation. |
multimodal large language model, relation comprehension, scene graph generation, grounded language understanding, vision-language reasoning |
2402.19469
Report |
Humanoid Locomotion as Next Token Prediction |
Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik |
We cast real-world humanoid control as a next token prediction problem, akin
to predicting the next word in language. Our model is a causal transformer
trained via autoregressive prediction of sensorimotor trajectories. To account
for the multi-modal nature of the data, we perform prediction in a
modality-aligned way, and for each input token predict the next token from the
same modality. This general formulation enables us to leverage data with
missing modalities, like video trajectories without actions. We train our model
on a collection of simulated trajectories coming from prior neural network
policies, model-based controllers, motion capture data, and YouTube videos of
humans. We show that our model enables a full-sized humanoid to walk in San
Francisco zero-shot. Our model can transfer to the real world even when trained
on only 27 hours of walking data, and can generalize to commands not seen
during training like walking backward. These findings suggest a promising path
toward learning challenging real-world control tasks by generative modeling of
sensorimotor trajectories. |
This paper presents a novel approach to real-world humanoid control by framing it as a next-token prediction problem, similar to language modeling. |
This method leverages the power of generative modeling with transformers, successfully applied in fields like language processing, to address the challenge of real-world robot control. |
A causal transformer model is trained to autoregressively predict sensorimotor trajectories, incorporating data from various sources like pre-trained policies, model-based controllers, motion capture, and even YouTube videos. |
The model enables zero-shot real-world walking on diverse terrains, demonstrated through successful deployment on a Digit humanoid robot in San Francisco.
The approach effectively incorporates incomplete trajectory data, such as video footage lacking action labels, leading to performance comparable to or exceeding state-of-the-art reinforcement learning methods.
The model exhibits promising scaling properties, with performance improving with larger datasets, longer context lengths, and increased model size. |
The reliance on simulated data for pre-training may limit the model's ability to handle certain real-world scenarios not well-represented in simulation.
The current study focuses on locomotion, and extending this approach to more complex manipulation tasks presents a future challenge. |
humanoid locomotion, generative modeling, transformer networks, next token prediction, real-world robotics |
2402.19427
Report |
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models |
Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre |
Recurrent neural networks (RNNs) have fast inference and scale efficiently on
long sequences, but they are difficult to train and hard to scale. We propose
Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that
mixes gated linear recurrences with local attention. Hawk exceeds the reported
performance of Mamba on downstream tasks, while Griffin matches the performance
of Llama-2 despite being trained on over 6 times fewer tokens. We also show
that Griffin can extrapolate on sequences significantly longer than those seen
during training. Our models match the hardware efficiency of Transformers
during training, and during inference they have lower latency and significantly
higher throughput. We scale Griffin up to 14B parameters, and explain how to
shard our models for efficient distributed training. |
This paper presents Hawk and Griffin, novel recurrent neural network models for language modeling that address the scalability limitations of traditional RNNs and offer an efficient alternative to Transformers with global attention. |
Transformers, while dominant, struggle with long sequences due to the quadratic complexity of global attention. Recurrent models offer a solution by compressing sequences into a fixed-size state, but need to match Transformer performance and hardware efficiency. |
The authors develop the RG-LRU, a novel gated linear recurrent layer, and integrate it into Hawk, a pure RNN model. They also introduce Griffin, a hybrid model combining RG-LRU with local attention. The models are evaluated on language modeling tasks, scaling capabilities, training and inference speed, and long context modeling. |
Hawk and Griffin exhibit power-law scaling in held-out loss with increasing training FLOPs, achieving competitive performance with Transformers.
Both models demonstrate superior inference throughput compared to Transformers, particularly on long sequences, due to their smaller cache size.
Hawk and Griffin excel in long context modeling, extrapolating well to sequences longer than training data and efficiently learning copying and retrieval tasks. |
While showing promise in copying and retrieval tasks after training, pre-trained Hawk and Griffin models lag behind Transformers in these tasks without fine-tuning.
Further research is needed to improve the copying and retrieval capabilities of these models when evaluating pre-trained models. |
language modeling, recurrent neural networks, transformers, long sequence modeling, inference efficiency |
2402.19150
Report |
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model |
Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, Renjing Xu |
Large Vision-Language Models (LVLMs) rely on vision encoders and Large
Language Models (LLMs) to exhibit remarkable capabilities on various
multi-modal tasks in the joint space of vision and language. However, the
Typographic Attack, which disrupts vision-language models (VLMs) such as
Contrastive Language-Image Pretraining (CLIP), has also been expected to be a
security threat to LVLMs. Firstly, we verify typographic attacks on current
well-known commercial and open-source LVLMs and uncover the widespread
existence of this threat. Secondly, to better assess this vulnerability, we
propose the most comprehensive and largest-scale Typographic Dataset to date.
The Typographic Dataset not only considers the evaluation of typographic
attacks under various multi-modal tasks but also evaluates the effects of
typographic attacks, influenced by texts generated with diverse factors. Based
on the evaluation results, we investigate the causes why typographic attacks
may impact VLMs and LVLMs, leading to three highly insightful discoveries. By
the examination of our discoveries and experimental validation in the
Typographic Dataset, we reduce the performance degradation from $42.07\%$ to
$13.90\%$ when LVLMs confront typographic attacks. |
This paper investigates the vulnerability of Large Vision-Language Models (LVLMs) to typographic attacks, proposing a comprehensive Typographic Dataset (TypoD) to evaluate this weakness across diverse multi-modal tasks and typographic factors. |
This research is crucial because LVLMs are increasingly used in real-world applications, and their susceptibility to typographic attacks poses significant security risks. |
The authors created TypoD, containing images with strategically embedded typographic errors, to assess the performance degradation of LVLMs under different tasks. They analyzed the attention mechanisms of both vision encoders (like CLIP) and LLMs within LVLMs to understand the root cause of this vulnerability. |
Typographic attacks significantly degrade the performance of LVLMs across various tasks, including object recognition, visual attribute detection, enumeration, and commonsense reasoning.
The severity of typographic attacks is positively correlated with the visibility of the embedded typographic text, with larger and more opaque text causing more significant performance drops.
Augmenting the prompts given to LVLMs with additional information, particularly detailed image descriptions, can mitigate the impact of typographic attacks by redirecting the model's attention away from the misleading text. |
The research primarily focuses on open-source LVLMs, and further investigation is needed to assess the robustness of commercial LVLMs against typographic attacks.
While the proposed prompt engineering techniques effectively mitigate the impact of typographic attacks, there might be limitations in their generalizability and effectiveness against more sophisticated attack strategies. |
large vision-language models, typographic attack, multi-modal learning, robustness, vision-language tasks |
2402.18956
Report |
WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts |
Yong Hyun Ahn, Hyeon Bae Kim, Seong Tae Kim |
Recent advancements in neural networks have showcased their remarkable
capabilities across various domains. Despite these successes, the "black box"
problem still remains. Addressing this, we propose a novel framework, WWW, that
offers the 'what', 'where', and 'why' of the neural network decisions in
human-understandable terms. Specifically, WWW utilizes adaptive selection for
concept discovery, employing adaptive cosine similarity and thresholding
techniques to effectively explain 'what'. To address the 'where' and 'why', we
proposed a novel combination of neuron activation maps (NAMs) with Shapley
values, generating localized concept maps and heatmaps for individual inputs.
Furthermore, WWW introduces a method for predicting uncertainty, leveraging
heatmap similarities to estimate 'how' reliable the prediction is. Experimental
evaluations of WWW demonstrate superior performance in both quantitative and
qualitative metrics, outperforming existing methods in interpretability. WWW
provides a unified solution for explaining 'what', 'where', and 'why',
introducing a method for localized explanations from global interpretations and
offering a plug-and-play solution adaptable to various architectures. |
This paper proposes WWW, a novel framework that offers interpretability for neural network decisions by explaining 'what', 'where', and 'why' in human-understandable terms. |
The "black box" problem of neural networks hinders their wider adoption. WWW addresses this by providing clear explanations of decision-making processes, improving trust and reliability. |
WWW utilizes adaptive cosine similarity and adaptive selection for concept discovery ('what'), combines neuron activation maps with Shapley values for localized concept maps and heatmaps ('where' and 'why'), and introduces heatmap similarities for uncertainty prediction. |
WWW demonstrates superior quantitative performance in concept discovery compared to existing methods like CLIP-Dissect, MILAN, and FALCON.
WWW provides qualitative explanations by identifying important neurons and concepts, showing robust interpretations across different model layers.
Heatmap similarity analysis in WWW effectively predicts uncertainty, potentially enabling the identification of mispredictions. |
The current implementation of WWW focuses on image classification tasks.
Future work includes exploring the generalization of WWW for other data modalities. |
interpretable machine learning, concept-based explanations, neuron-concept association, uncertainty prediction, explainable ai |
2402.18929
Report |
Navigating Beyond Dropout: An Intriguing Solution Towards Generalizable Image Super Resolution |
Hongjun Wang, Jiyuan Chen, Yinqiang Zheng, Tieyong Zeng |
Deep learning has led to a dramatic leap on Single Image Super-Resolution
(SISR) performances in recent years. %Despite the substantial advancement%
While most existing work assumes a simple and fixed degradation model (e.g.,
bicubic downsampling), the research of Blind SR seeks to improve model
generalization ability with unknown degradation. Recently, Kong et al pioneer
the investigation of a more suitable training strategy for Blind SR using
Dropout. Although such method indeed brings substantial generalization
improvements via mitigating overfitting, we argue that Dropout simultaneously
introduces undesirable side-effect that compromises model's capacity to
faithfully reconstruct fine details. We show both the theoretical and
experimental analyses in our paper, and furthermore, we present another easy
yet effective training strategy that enhances the generalization ability of the
model by simply modulating its first and second-order features statistics.
Experimental results have shown that our method could serve as a model-agnostic
regularization and outperforms Dropout on seven benchmark datasets including
both synthetic and real-world scenarios. |
This paper proposes a simple statistical alignment method as a regularization technique for Blind Super-Resolution (SR) to enhance model generalization against unknown degradations. |
Current Blind SR models, even trained with diverse degradations, tend to overfit specific degradation types, limiting their ability to generalize to unseen degradation scenarios. |
The method aligns first and second order feature statistics (mean and covariance) of image pairs with identical content but different degradations. This encourages the model to learn degradation-invariant features, improving its generalization ability. |
The proposed method consistently outperforms Dropout regularization on seven benchmark datasets, demonstrating its effectiveness.
Significant performance improvements are observed, particularly in cases where test degradations deviate from the training distribution, indicating enhanced generalization.
The method integrates seamlessly with existing data-driven Blind SR methods and can be applied to various SR models. |
The choice of mean and covariance as statistical indicators, while showing empirical effectiveness, lacks a strong theoretical justification.
Further investigation into the impact of different statistical alignment strategies and their theoretical underpinnings is needed. |
blind super-resolution, regularization, degradation-invariant features, statistical alignment, generalization |
2402.18848
Report |
SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting |
Hoon Kim, Minje Jang, Wonjun Yoon, Jisoo Lee, Donghyun Na, Sanghyun Woo |
We introduce a co-designed approach for human portrait relighting that
combines a physics-guided architecture with a pre-training framework. Drawing
on the Cook-Torrance reflectance model, we have meticulously configured the
architecture design to precisely simulate light-surface interactions.
Furthermore, to overcome the limitation of scarce high-quality lightstage data,
we have developed a self-supervised pre-training strategy. This novel
combination of accurate physical modeling and expanded training dataset
establishes a new benchmark in relighting realism. |
This paper proposes SwitchLight, a novel framework for human portrait relighting that combines a physics-guided architecture based on the Cook-Torrance reflectance model with a self-supervised pre-training framework called Multi-Masked Autoencoder (MMAE). |
Relighting human portraits is crucial for various applications, including VR/AR and digital content creation. Existing methods either lack realism or struggle with the limited availability of high-quality training data. This work addresses these limitations by enhancing physical accuracy and expanding training data through self-supervision. |
The SwitchLight architecture comprises multiple neural networks for predicting surface normals, lighting, diffuse rendering, specular attributes, and final relighting. MMAE, inspired by Masked Autoencoder, uses dynamic masking strategies and a generative target to pre-train the model on unlabeled data, improving feature representations for relighting. |
SwitchLight outperforms state-of-the-art methods in both quantitative metrics and qualitative comparisons, demonstrating enhanced realism in lighting, specular highlights, and skin tones.
MMAE pre-training significantly improves performance compared to training solely on labeled data, highlighting the benefits of self-supervision for relighting.
Ablation studies validate the advantages of predicting diffuse rendering over direct albedo prediction and demonstrate the effectiveness of MMAE's design choices. |
The model struggles with removing strong shadows and accurately relighting reflective surfaces or face paint.
Future work includes extending the framework to handle video and 3D data. |
image relighting, human portrait, cook-torrance model, self-supervised learning, masked autoencoder |
2402.18842
Report |
ViewFusion: Towards Multi-View Consistency via Interpolated Denoising |
Xianghui Yang, Yan Zuo, Sameera Ramasinghe, Loris Bazzani, Gil Avraham, Anton van den Hengel |
Novel-view synthesis through diffusion models has demonstrated remarkable
potential for generating diverse and high-quality images. Yet, the independent
process of image generation in these prevailing methods leads to challenges in
maintaining multiple-view consistency. To address this, we introduce
ViewFusion, a novel, training-free algorithm that can be seamlessly integrated
into existing pre-trained diffusion models. Our approach adopts an
auto-regressive method that implicitly leverages previously generated views as
context for the next view generation, ensuring robust multi-view consistency
during the novel-view generation process. Through a diffusion process that
fuses known-view information via interpolated denoising, our framework
successfully extends single-view conditioned models to work in multiple-view
conditional settings without any additional fine-tuning. Extensive experimental
results demonstrate the effectiveness of ViewFusion in generating consistent
and detailed novel views. |
This paper introduces Interpolated Denoising Diffusion Model (IDDM), a training-free algorithm that improves multi-view consistency in novel view synthesis using pre-trained diffusion models. |
Existing diffusion-based novel view synthesis methods often produce inconsistent images across different viewpoints, hindering applications like 3D reconstruction. IDDM addresses this limitation without requiring additional training or fine-tuning. |
IDDM incorporates an auto-regressive process into the diffusion process. It uses a novel Interpolated Denoising technique that leverages previously generated views as context during the generation of subsequent views, thus improving consistency. |
IDDM significantly enhances multi-view consistency compared to baseline models, as evidenced by quantitative metrics (LPIPS, SIFT, CLIP) and qualitative comparisons.
It enables single-view conditioned diffusion models to operate effectively in multi-view conditioned settings, leading to improved novel-view synthesis and 3D reconstruction quality.
The method demonstrates strong performance on out-of-distribution datasets like ABO and GSO, showcasing its generalizability. |
IDDM's sequential generation process, while improving consistency, requires additional memory and can be more time-consuming than parallel generation methods.
The effectiveness of IDDM relies on the pre-trained base model (e.g., Zero-1-to-3); if the base model fails, IDDM might not fully compensate for those shortcomings. |
novel view synthesis, diffusion models, multi-view consistency, 3d reconstruction, auto-regressive models |
2402.18780
Report |
A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D |
Xiaohan Fei, Chethan Parameshwara, Jiawei Mo, Xiaolong Li, Ashwin Swaminathan, CJ Taylor, Paolo Favaro, Stefano Soatto |
The development of generative models that create 3D content from a text
prompt has made considerable strides thanks to the use of the score
distillation sampling (SDS) method on pre-trained diffusion models for image
generation. However, the SDS method is also the source of several artifacts,
such as the Janus problem, the misalignment between the text prompt and the
generated 3D model, and 3D model inaccuracies. While existing methods heavily
rely on the qualitative assessment of these artifacts through visual inspection
of a limited set of samples, in this work we propose more objective
quantitative evaluation metrics, which we cross-validate via human ratings, and
show analysis of the failure cases of the SDS technique. We demonstrate the
effectiveness of this analysis by designing a novel computationally efficient
baseline model that achieves state-of-the-art performance on the proposed
metrics while addressing all the above-mentioned artifacts. |
This paper introduces a novel evaluation protocol for text-to-3D generation models and proposes a new baseline method based on Gaussian Splatting. |
Existing evaluation methods for text-to-3D models lack objectivity and comprehensiveness, hindering systematic progress in the field. |
The authors propose quantitative metrics to evaluate the frequency of the "Janus problem" (object duplication across viewpoints), text and 3D alignment, and the realism of generated 3D models. They also present a two-stage generation method using MVDream and Gaussian Splatting for efficiency and realism. |
The proposed method achieves state-of-the-art performance on the introduced metrics, demonstrating its effectiveness in mitigating the Janus problem and generating high-fidelity 3D content.
Analysis of existing methods reveals a high prevalence of the Janus problem, highlighting the need for robust evaluation.
The study confirms a trade-off between realism and the Janus problem in refinement stages, emphasizing the need for balanced optimization. |
The paper relies on manual inspection for detecting the Janus problem, calling for future development of automatic evaluation methods.
Future work can explore further efficiency improvements and leverage real-world and synthetic data for enhanced diversity and realism in generated 3D content. |
text-to-3d generation, score distillation sampling, gaussian splatting, evaluation protocol, janus problem |
2402.18331
Report |
FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes |
Ziying Pan, Kun Wang, Gang Li, Feihong He, Xiwang Li, Yongxuan Lai |
The class-conditional image generation based on diffusion models is renowned
for generating high-quality and diverse images. However, most prior efforts
focus on generating images for general categories, e.g., 1000 classes in
ImageNet-1k. A more challenging task, large-scale fine-grained image
generation, remains the boundary to explore. In this work, we present a
parameter-efficient strategy, called FineDiffusion, to fine-tune large
pre-trained diffusion models scaling to large-scale fine-grained image
generation with 10,000 categories. FineDiffusion significantly accelerates
training and reduces storage overhead by only fine-tuning tiered class
embedder, bias terms, and normalization layers' parameters. To further improve
the image generation quality of fine-grained categories, we propose a novel
sampling method for fine-grained image generation, which utilizes
superclass-conditioned guidance, specifically tailored for fine-grained
categories, to replace the conventional classifier-free guidance sampling.
Compared to full fine-tuning, FineDiffusion achieves a remarkable 1.56x
training speed-up and requires storing merely 1.77% of the total model
parameters, while achieving state-of-the-art FID of 9.776 on image generation
of 10,000 classes. Extensive qualitative and quantitative experiments
demonstrate the superiority of our method compared to other parameter-efficient
fine-tuning methods. The code and more generated results are available at our
project website: https://finediffusion.github.io/. |
This paper presents FineDiffusion, a parameter-efficient fine-tuning strategy for large-scale (10,000+ categories) fine-grained image generation using diffusion models. |
Large-scale fine-grained image generation is challenging and computationally expensive using traditional diffusion model training. This work offers a faster, more efficient approach. |
FineDiffusion leverages a pre-trained DiT model and fine-tunes only specific parameters: a proposed TieredEmbedder (encoding hierarchical class labels), bias terms, and normalization layers. A novel sampling method using superclass-conditioned guidance further improves generation. |
FineDiffusion achieves a 1.56x training speed-up and requires storing only 1.77% of total model parameters compared to full fine-tuning.
It achieves state-of-the-art FID of 9.776 on the iNaturalist 2021 mini dataset (10,000 classes).
FineDiffusion outperforms other parameter-efficient fine-tuning methods (BitFit, DiffFit) in FID and LPIPS scores on various fine-grained datasets. |
The current implementation focuses on image generation from class labels; exploring text-guided fine-grained generation is a potential future direction.
Investigating the impact of different pre-trained diffusion models and dataset scales on FineDiffusion's performance is of interest. |
fine-grained image generation, diffusion models, parameter-efficient fine-tuning, classifier-free guidance, hierarchical class embedding |
2402.18192
Report |
Misalignment-Robust Frequency Distribution Loss for Image Transformation |
Zhangkai Ni, Juncheng Wu, Zian Wang, Wenhan Yang, Hanli Wang, Lin Ma |
This paper aims to address a common challenge in deep learning-based image
transformation methods, such as image enhancement and super-resolution, which
heavily rely on precisely aligned paired datasets with pixel-level alignments.
However, creating precisely aligned paired images presents significant
challenges and hinders the advancement of methods trained on such data. To
overcome this challenge, this paper introduces a novel and simple Frequency
Distribution Loss (FDL) for computing distribution distance within the
frequency domain. Specifically, we transform image features into the frequency
domain using Discrete Fourier Transformation (DFT). Subsequently, frequency
components (amplitude and phase) are processed separately to form the FDL loss
function. Our method is empirically proven effective as a training constraint
due to the thoughtful utilization of global information in the frequency
domain. Extensive experimental evaluations, focusing on image enhancement and
super-resolution tasks, demonstrate that FDL outperforms existing
misalignment-robust loss functions. Furthermore, we explore the potential of
our FDL for image style transfer that relies solely on completely misaligned
data. Our code is available at: https://github.com/eezkni/FDL |
This paper introduces Frequency Distribution Loss (FDL), a novel loss function designed to enhance the robustness of image transformation models when dealing with misaligned training data. |
Existing image transformation methods heavily rely on precisely aligned datasets, which are often challenging to obtain, particularly for tasks involving natural distortions like style transfer. |
FDL leverages the Discrete Fourier Transform (DFT) to transform image features into the frequency domain. It then calculates the Sliced Wasserstein Distance (SWD) between the amplitude and phase components of the predicted and target image features. |
FDL consistently outperforms existing misalignment-robust loss functions in image enhancement and super-resolution tasks.
The method effectively mitigates artifacts and preserves structural details even in the presence of significant geometric misalignments.
FDL demonstrates promising results for image style transfer, effectively capturing and transferring structural styles. |
The choice of feature extractor and the weighting parameter for different frequency components may require task-specific tuning.
Future work could explore assigning different attention weights to distinct frequency domain regions for further performance improvement. |
image transformation, misaligned data, frequency distribution loss, deep learning, computer vision |
2402.18068
Report |
SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model |
Bin Cao, Jianhao Yuan, Yexin Liu, Jian Li, Shuyang Sun, Jing Liu, Bo Zhao |
In the rapidly evolving area of image synthesis, a serious challenge is the
presence of complex artifacts that compromise perceptual realism of synthetic
images. To alleviate artifacts and improve quality of synthetic images, we
fine-tune Vision-Language Model (VLM) as artifact classifier to automatically
identify and classify a wide range of artifacts and provide supervision for
further optimizing generative models. Specifically, we develop a comprehensive
artifact taxonomy and construct a dataset of synthetic images with artifact
annotations for fine-tuning VLM, named SynArtifact-1K. The fine-tuned VLM
exhibits superior ability of identifying artifacts and outperforms the baseline
by 25.66%. To our knowledge, this is the first time such end-to-end artifact
classification task and solution have been proposed. Finally, we leverage the
output of VLM as feedback to refine the generative model for alleviating
artifacts. Visualization results and user study demonstrate that the quality of
images synthesized by the refined diffusion model has been obviously improved. |
This paper presents SynArtifact-1K, the first synthetic image dataset annotated with artifact categories, descriptions, and coordinates, to address the challenge of artifact classification and alleviation in synthetic images. |
Existing image synthesis methods often lack the ability to effectively identify and alleviate artifacts, hindering the realism of generated images. This work provides a solution by classifying various types of artifacts and using this information to improve generative models. |
The authors first create a comprehensive taxonomy of common artifacts. Then, they construct SynArtifact-1K and use it to fine-tune a Vision-Language Model (VLM) for artifact classification. Finally, they leverage the output of the VLM as AI feedback to guide the optimization of a diffusion model through Reinforcement Learning from AI Feedback (RLAIF). |
The fine-tuned VLM outperforms the baseline by 25.66% in classification accuracy on SynArtifact-1K, demonstrating its effectiveness in artifact identification.
The VLM demonstrates promising preliminary results for artifact detection, paving the way for more explainable quality assessment of synthetic images.
By integrating the VLM feedback, the refined diffusion model generates higher-quality images with fewer artifacts, as evidenced by visualization and user study. |
The size of the SynArtifact-1K dataset is limited, and a larger dataset could potentially further improve the performance of both artifact classification and alleviation.
The VLM used for artifact detection lacks inherent localization abilities, suggesting potential for future work exploring VLMs with stronger visual grounding. |
image synthesis, artifact classification, artifact alleviation, vision-language model, reinforcement learning from ai feedback |
2402.18039
Report |
ResLoRA: Identity Residual Mapping in Low-Rank Adaption |
Shuhua Shi, Shaohan Huang, Minghui Song, Zhoujun Li, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang |
As one of the most popular parameter-efficient fine-tuning (PEFT) methods,
low-rank adaptation (LoRA) is commonly applied to fine-tune large language
models (LLMs). However, updating the weights of LoRA blocks effectively and
expeditiously is challenging due to the long calculation path in the original
model. To address this, we propose ResLoRA, an improved framework of LoRA. By
adding residual paths during training and using merging approaches to eliminate
these extra paths during inference, our method can achieve better results in
fewer training steps without any extra trainable parameters or inference cost
compared to LoRA. The experiments on NLG, NLU, and text-to-image tasks
demonstrate the effectiveness of our method. To the best of our knowledge,
ResLoRA is the first work that combines the residual path with LoRA. The code
of our method is available at
https://github.com/microsoft/LMOps/tree/main/reslora . |
This paper introduces ResLoRA, a novel framework that enhances Low-Rank Adaptation (LoRA) for fine-tuning large language models (LLMs) by incorporating residual connections to expedite and stabilize the training process. |
LoRA, while effective, suffers from limitations in efficient weight updates due to long calculation paths. ResLoRA addresses this issue, aiming for faster convergence and improved performance. |
ResLoRA integrates residual paths within LoRA blocks during training, exploring three structures: input-shortcut (is), block-shortcut (bs), and middle-shortcut (ms). Merging approaches are then employed to eliminate the extra paths during inference, ensuring no additional computational cost. |
ResLoRA consistently outperforms standard LoRA and other variants in natural language generation (NLG) and natural language understanding (NLU) tasks, achieving accuracy improvements ranging from 1% to 20%.
Experiments demonstrate that incorporating residual paths accelerates training convergence, evidenced by significantly lower loss values compared to standard LoRA.
Analysis of trained matrix weights reveals that ResLoRA promotes more complex weight patterns, potentially contributing to its superior performance. |
While not adding trainable parameters, ResLoRA's training incurs higher computational cost than standard LoRA due to the use of previous blocks in calculations.
The merging approaches, while effective, introduce minor accuracy degradation, necessitating the development of more efficient merging strategies. |
parameter-efficient fine-tuning, large language models, low-rank adaptation, residual networks, deep learning |
2402.17910
Report |
Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models |
Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Aref Miri Rekavandi, Hamid Laga, Farid Boussaid |
While latent diffusion models (LDMs) excel at creating imaginative images,
they often lack precision in semantic fidelity and spatial control over where
objects are generated. To address these deficiencies, we introduce the
Box-it-to-Bind-it (B2B) module - a novel, training-free approach for improving
spatial control and semantic accuracy in text-to-image (T2I) diffusion models.
B2B targets three key challenges in T2I: catastrophic neglect, attribute
binding, and layout guidance. The process encompasses two main steps: i) Object
generation, which adjusts the latent encoding to guarantee object generation
and directs it within specified bounding boxes, and ii) attribute binding,
guaranteeing that generated objects adhere to their specified attributes in the
prompt. B2B is designed as a compatible plug-and-play module for existing T2I
models, markedly enhancing model performance in addressing the key challenges.
We evaluate our technique using the established CompBench and TIFA score
benchmarks, demonstrating significant performance improvements compared to
existing methods. The source code will be made publicly available at
https://github.com/nextaistudio/BoxIt2BindIt. |
This paper introduces Box-it-to-Bind-it (B2B), a training-free plug-and-play module for enhancing spatial control and semantic accuracy in text-to-image diffusion models. |
Existing text-to-image models struggle with accurately binding attributes to objects and precisely controlling object placement according to a specified layout. |
B2B uses a two-step process: 1) Object generation: adjusts latent encoding to guarantee object generation within specified bounding boxes, utilizing LLMs for layout guidance. 2) Attribute binding: ensures generated objects adhere to their specified attributes in the prompt. |
B2B achieves state-of-the-art results on CompBench and TIFA benchmarks for color and texture binding.
It significantly improves spatial reasoning, enabling more precise object placement within specified layouts.
B2B's plug-and-play nature is demonstrated by its successful integration with both Stable Diffusion and GLIGEN models, enhancing their performance. |
The paper acknowledges the potential for further improvement in spatial reasoning.
Future work may explore extending B2B to handle more complex relationships between objects and attributes. |
text-to-image generation, diffusion models, attribute binding, spatial control, layout guidance |
2402.17863
Report |
Vision Transformers with Natural Language Semantics |
Young Kyung Kim, J. Matías Di Martino, Guillermo Sapiro |
Tokens or patches within Vision Transformers (ViT) lack essential semantic
information, unlike their counterparts in natural language processing (NLP).
Typically, ViT tokens are associated with rectangular image patches that lack
specific semantic context, making interpretation difficult and failing to
effectively encapsulate information. We introduce a novel transformer model,
Semantic Vision Transformers (sViT), which leverages recent progress on
segmentation models to design novel tokenizer strategies. sViT effectively
harnesses semantic information, creating an inductive bias reminiscent of
convolutional neural networks while capturing global dependencies and
contextual information within images that are characteristic of transformers.
Through validation using real datasets, sViT demonstrates superiority over ViT,
requiring less training data while maintaining similar or superior performance.
Furthermore, sViT demonstrates significant superiority in out-of-distribution
generalization and robustness to natural distribution shifts, attributed to its
scale invariance semantic characteristic. Notably, the use of semantic tokens
significantly enhances the model's interpretability. Lastly, the proposed
paradigm facilitates the introduction of new and powerful augmentation
techniques at the token (or segment) level, increasing training data diversity
and generalization capabilities. Just as sentences are made of words, images
are formed by semantic objects; our proposed methodology leverages recent
progress in object segmentation and takes an important and natural step toward
interpretable and robust vision transformers. |
This paper introduces Semantic Vision Transformers (sViT), a novel vision transformer model that leverages semantic segmentation for tokenization, enhancing performance and interpretability. |
Current Vision Transformers (ViT) lack semantic information in their tokens, hindering their interpretability and efficiency, especially compared to NLP transformers that process meaningful words. |
sViT utilizes the Segment Anything Model (SAM) for semantic segmentation, treating each segment as a token. It introduces positional and scale embeddings based on segment location and size. The model is trained on scene recognition and object-centric datasets. |
sViT outperforms ViT on non-object-centric datasets, especially with limited data.
sViT exhibits superior generalization to object-centric datasets, demonstrating scale invariance.
sViT significantly improves interpretability, highlighting semantically meaningful regions. |
sViT has higher computational cost during inference due to the additional segmentation step.
Future work could explore more efficient segmentation models to reduce computational overhead. |
vision transformer, semantic segmentation, interpretability, out-of-distribution generalization, data augmentation |
2402.17766
Report |
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction |
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, He Wang, Li Yi, Kaisheng Ma |
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model
(LLM) designed for embodied interaction, exploring a universal 3D object
understanding with 3D point clouds and languages. ShapeLLM is built upon an
improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view
image distillation for enhanced geometry understanding. By utilizing ReCon++ as
the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed
instruction-following data and tested on our newly human-curated evaluation
benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance
in 3D geometry understanding and language-unified 3D interaction tasks, such as
embodied visual grounding. |
Presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. |
To bridge the gap between LLMs and 3D object understanding, particularly in embodied interaction, where precise geometry and interaction knowledge are crucial. |
Leverages 3D point clouds as inputs, introduces selective multi-view distillation in the 3D encoder (ReCon), and employs 3D visual instruction tuning with data constructed using GPT-4V. |
ReCon achieves state-of-the-art performance in 3D object recognition, surpassing previous best records on ScanObjectNN and ModelNet40.
ShapeLLM successfully unifies various downstream tasks, including 3D captioning, 3D VQA, embodied task planning & decomposition, and 3D embodied visual grounding.
On the newly constructed 3D MM-Vet benchmark, ShapeLLM outperforms previous 3D point cloud-based methods, achieving 49.3% total accuracy. |
ShapeLLM's training data for embodied interaction is limited to indoor articulated furniture.
Real-time deployment requires addressing efficiency concerns, potentially through model compression techniques. |
multimodal large language model, 3d object understanding, embodied interaction, 3d point cloud processing, visual instruction tuning |
2402.17726
Report |
VRP-SAM: SAM with Visual Reference Prompt |
Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li |
In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that
empowers the Segment Anything Model (SAM) to utilize annotated reference images
as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM
can utilize annotated reference images to comprehend specific objects and
perform segmentation of specific objects in target image. It is note that the
VRP encoder can support a variety of annotation formats for reference images,
including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}.
VRP-SAM achieves a breakthrough within the SAM framework by extending its
versatility and applicability while preserving SAM's inherent strengths, thus
enhancing user-friendliness. To enhance the generalization ability of VRP-SAM,
the VRP encoder adopts a meta-learning strategy. To validate the effectiveness
of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO
datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual
reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM
demonstrates strong generalization capabilities, allowing it to perform
segmentation of unseen objects and enabling cross-domain segmentation. The
source code and models will be available at
\url{https://github.com/syp2ysy/VRP-SAM} |
This paper proposes VRP-SAM, an extension of the Segment Anything Model (SAM) by incorporating a Visual Reference Prompt (VRP) encoder, enabling SAM to perform visual reference segmentation using annotated reference images as prompts. |
Existing prompt formats in SAM pose challenges for complex scenes and numerous images as they require user familiarity with target objects and custom prompts for each image. VRP-SAM addresses these limitations by using annotated reference images to guide segmentation, improving efficiency and reducing reliance on user input. |
VRP-SAM introduces a VRP encoder that processes annotated reference images and generates prompt embeddings. These embeddings guide SAM's mask decoder to segment target objects with similar semantics. The VRP encoder utilizes meta-learning by extracting object prototypes from reference images to enhance target object representation in both reference and target images. Learnable queries interact with enhanced features to generate prompt embeddings for the SAM decoder. |
VRP-SAM achieves state-of-the-art performance in visual reference segmentation with minimal learnable parameters, outperforming previous methods on PASCAL-5i and COCO-20i datasets.
VRP-SAM effectively addresses limitations of geometric prompts, demonstrating superior performance by avoiding false-positive prompts often generated by geometric approaches.
VRP-SAM shows strong generalization capability, effectively handling unknown objects and cross-domain scenarios, as evidenced by domain shift experiments and visualization on diverse image styles. |
The current work focuses on few-shot semantic segmentation, with future exploration aimed at extending VRP-SAM to a wider range of vision tasks such as video object segmentation and object tracking.
Further investigation is needed to explore the full potential of VRP-SAM in more complex real-world applications and diverse datasets. |
visual reference segmentation, segment anything model (sam), meta-learning, visual reference prompt, few-shot learning |
2402.17563
Report |
Structure-Guided Adversarial Training of Diffusion Models |
Ling Yang, Haotian Qian, Zhilong Zhang, Jingwei Liu, Bin Cui |
Diffusion models have demonstrated exceptional efficacy in various generative
applications. While existing models focus on minimizing a weighted sum of
denoising score matching losses for data distribution modeling, their training
primarily emphasizes instance-level optimization, overlooking valuable
structural information within each mini-batch, indicative of pair-wise
relationships among samples. To address this limitation, we introduce
Structure-guided Adversarial training of Diffusion Models (SADM). In this
pioneering approach, we compel the model to learn manifold structures between
samples in each training batch. To ensure the model captures authentic manifold
structures in the data distribution, we advocate adversarial training of the
diffusion generator against a novel structure discriminator in a minimax game,
distinguishing real manifold structures from the generated ones. SADM
substantially improves existing diffusion transformers (DiT) and outperforms
existing methods in image generation and cross-domain fine-tuning tasks across
12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on
ImageNet for class-conditional image generation at resolutions of 256x256 and
512x512, respectively. |
Introduces Structure-guided Adversarial training of Diffusion Models (SADM) that compels the model to learn manifold structures between samples in each training batch, enhancing data distribution modeling. |
Existing diffusion models focus on instance-level optimization, neglecting valuable structural information within mini-batches that indicate pair-wise relationships among samples, hindering accurate data distribution modeling. |
Employs adversarial training between the diffusion generator and a novel structure discriminator. The discriminator distinguishes real manifold structures from generated ones, encouraging the generator to learn authentic data manifold structures. |
Significantly improves existing diffusion transformers and surpasses existing methods in image generation and cross-domain fine-tuning across 12 datasets.
Achieves state-of-the-art FID scores on ImageNet for class-conditional image generation (1.58 for 256x256 and 2.11 for 512x512 resolutions).
Demonstrates potential for rapid adaptation to new domains in cross-domain fine-tuning tasks. |
Reliance on pre-trained feature extractors for the structure discriminator.
Potential limitations in generalizing to highly complex or diverse datasets. |
diffusion models, generative models, adversarial training, manifold learning, image generation |
2402.17485
Report |
EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions |
Linrui Tian, Qi Wang, Bang Zhang, Liefeng Bo |
In this work, we tackle the challenge of enhancing the realism and
expressiveness in talking head video generation by focusing on the dynamic and
nuanced relationship between audio cues and facial movements. We identify the
limitations of traditional techniques that often fail to capture the full
spectrum of human expressions and the uniqueness of individual facial styles.
To address these issues, we propose EMO, a novel framework that utilizes a
direct audio-to-video synthesis approach, bypassing the need for intermediate
3D models or facial landmarks. Our method ensures seamless frame transitions
and consistent identity preservation throughout the video, resulting in highly
expressive and lifelike animations. Experimental results demonsrate that EMO is
able to produce not only convincing speaking videos but also singing videos in
various styles, significantly outperforming existing state-of-the-art
methodologies in terms of expressiveness and realism. |
This paper proposes EMO, an expressive audio-driven portrait-video generation framework that generates portrait videos with expressive facial expressions and head poses from a single reference image and audio. |
Existing talking head generation methods often lack realism and expressiveness, especially when it comes to capturing subtle facial movements and diverse speaking styles. |
EMO leverages a direct audio-to-video synthesis approach based on diffusion models. It utilizes audio embeddings for motion and expression, reference image features for identity preservation, and weak control mechanisms for stability and consistency. |
EMO generates high-quality talking head videos with natural head movements and vivid expressions synchronized with the input audio.
The framework can generate videos of any duration and adapt to various portrait styles, including realistic, anime, and 3D.
Quantitative evaluations on the HDTF dataset show that EMO outperforms state-of-the-art methods in terms of video quality (FVD), frame quality (FID), identity preservation (F-SIM), and expressiveness (E-FID). |
The method is more computationally expensive than non-diffusion-based approaches.
The lack of explicit control signals for body parts can sometimes lead to artifacts. |
diffusion models, video generation, talking head, audio-to-video synthesis, expressive facial animation |
2402.17412
Report |
DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Models |
Shyam Marjit, Harshit Singh, Nityanand Mathur, Sayak Paul, Chia-Mu Yu, Pin-Yu Chen |
In the realm of subject-driven text-to-image (T2I) generative models, recent
developments like DreamBooth and BLIP-Diffusion have led to impressive results
yet encounter limitations due to their intensive fine-tuning demands and
substantial parameter requirements. While the low-rank adaptation (LoRA) module
within DreamBooth offers a reduction in trainable parameters, it introduces a
pronounced sensitivity to hyperparameters, leading to a compromise between
parameter efficiency and the quality of T2I personalized image synthesis.
Addressing these constraints, we introduce \textbf{\textit{DiffuseKronA}}, a
novel Kronecker product-based adaptation module that not only significantly
reduces the parameter count by 35\% and 99.947\% compared to LoRA-DreamBooth
and the original DreamBooth, respectively, but also enhances the quality of
image synthesis. Crucially, \textit{DiffuseKronA} mitigates the issue of
hyperparameter sensitivity, delivering consistent high-quality generations
across a wide range of hyperparameters, thereby diminishing the necessity for
extensive fine-tuning. Furthermore, a more controllable decomposition makes
\textit{DiffuseKronA} more interpretable and even can achieve up to a 50\%
reduction with results comparable to LoRA-Dreambooth. Evaluated against diverse
and complex input images and text prompts, \textit{DiffuseKronA} consistently
outperforms existing models, producing diverse images of higher quality with
improved fidelity and a more accurate color distribution of objects, all the
while upholding exceptional parameter efficiency, thus presenting a substantial
advancement in the field of T2I generative modeling. Our project page,
consisting of links to the code, and pre-trained checkpoints, is available at
https://diffusekrona.github.io/. |
Introduces DiffuseKronA, a novel Kronecker product-based adaptation module for fine-tuning text-to-image diffusion models that significantly reduces parameter count while enhancing image synthesis quality. |
Addresses limitations of existing methods like DreamBooth and LoRA-DreamBooth, which suffer from high parameter requirements, hyperparameter sensitivity, and a trade-off between parameter efficiency and image quality. |
Leverages the Kronecker product to capture structured relationships in weight matrices, enabling more efficient and expressive parameter updates compared to low-rank decomposition methods. |
Reduces trainable parameters by 35% compared to LoRA-DreamBooth and 99.947% compared to DreamBooth.
Demonstrates enhanced stability across a wide range of hyperparameters, mitigating the need for extensive fine-tuning.
Produces higher-quality images with improved fidelity, more accurate color distribution, and better text alignment compared to existing state-of-the-art methods. |
Optimal Kronecker factor configuration requires manual exploration.
Further research can explore applying DiffuseKronA to other diffusion model architectures beyond SDXL. |
text-to-image generation, diffusion models, parameter-efficient fine-tuning, kronecker product, image synthesis |
2402.17403
Report |
Sora Generates Videos with Stunning Geometrical Consistency |
Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, Ming-Ming Cheng |
The recently developed Sora model [1] has exhibited remarkable capabilities
in video generation, sparking intense discussions regarding its ability to
simulate real-world phenomena. Despite its growing popularity, there is a lack
of established metrics to evaluate its fidelity to real-world physics
quantitatively. In this paper, we introduce a new benchmark that assesses the
quality of the generated videos based on their adherence to real-world physics
principles. We employ a method that transforms the generated videos into 3D
models, leveraging the premise that the accuracy of 3D reconstruction is
heavily contingent on the video quality. From the perspective of 3D
reconstruction, we use the fidelity of the geometric constraints satisfied by
the constructed 3D models as a proxy to gauge the extent to which the generated
videos conform to real-world physics rules. Project page:
https://sora-geometrical-consistency.github.io/ |
This paper introduces a novel benchmark to evaluate the physical realism, specifically the geometric consistency, of videos generated by the state-of-the-art text-to-video model, Sora. |
Existing metrics for evaluating video generation models fail to capture the adherence of generated content to real-world physics, a crucial aspect of realism, especially in light of models like Sora demonstrating such capabilities. |
The authors leverage the principles of 3D reconstruction, using the quality of 3D models generated from the videos as a proxy for their geometric consistency. They employ traditional computer vision techniques like Structure-from-Motion (SfM) and Gaussian Splatting, along with metrics based on feature matching and reprojection errors. |
Sora-generated videos exhibit significantly higher geometric consistency compared to videos generated by Pika Labs and Gen-2, evidenced by better 3D reconstruction quality and more accurate feature matching.
Sora maintains this geometric consistency over longer video durations, indicating its ability to preserve physical and geometric properties over time.
Visualizations of point clouds, Gaussian Splatting renderings, and stereo matching results further confirm the superior geometric fidelity of Sora-generated videos. |
The study primarily focuses on geometric consistency and acknowledges the need to incorporate additional physics-based metrics like texture authenticity and object interaction logic in future work.
The use of traditional computer vision techniques for 3D reconstruction might be complemented by exploring deep learning-based methods like NeRF for potentially more robust evaluations. |
video generation, geometric consistency, text-to-video synthesis, 3d reconstruction, sora |
2402.17323
Report |
SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection |
Junsu Kim, Hoseong Cho, Jihyeon Kim, Yihalem Yimolal Tiruneh, Seungryul Baek |
In the field of class incremental learning (CIL), generative replay has
become increasingly prominent as a method to mitigate the catastrophic
forgetting, alongside the continuous improvements in generative models.
However, its application in class incremental object detection (CIOD) has been
significantly limited, primarily due to the complexities of scenes involving
multiple labels. In this paper, we propose a novel approach called stable
diffusion deep generative replay (SDDGR) for CIOD. Our method utilizes a
diffusion-based generative model with pre-trained text-to-diffusion networks to
generate realistic and diverse synthetic images. SDDGR incorporates an
iterative refinement strategy to produce high-quality images encompassing old
classes. Additionally, we adopt an L2 knowledge distillation technique to
improve the retention of prior knowledge in synthetic images. Furthermore, our
approach includes pseudo-labeling for old objects within new task images,
preventing misclassification as background elements. Extensive experiments on
the COCO 2017 dataset demonstrate that SDDGR significantly outperforms existing
algorithms, achieving a new state-of-the-art in various CIOD scenarios. The
source code will be made available to the public. |
This paper introduces Stable Diffusion Deep Generative Replay (SDDGR), a novel method leveraging pre-trained text-to-image diffusion models to generate synthetic images for mitigating catastrophic forgetting in class incremental object detection (CIOD). |
Existing CIOD methods struggle to retain knowledge of previous classes when learning new ones, hindering their ability to handle complex, multi-label scenes. SDDGR addresses this by utilizing the power of diffusion models for realistic image synthesis and knowledge preservation. |
SDDGR utilizes a pre-trained text-to-image diffusion model with grounding inputs (classes and bounding boxes) to generate images of past objects. It then uses iterative refinement and L2 knowledge distillation to improve image quality and transfer knowledge to the updated model. Additionally, it employs pseudo-labeling on new task images to prevent misclassification of old objects as background. |
SDDGR outperforms existing CIOD methods, achieving state-of-the-art accuracy on the COCO dataset in both two-phase and multi-phase learning scenarios.
Ablation studies show the significance of each component (refinement, distillation, pseudo-labeling) in boosting performance and reducing forgetting.
The method demonstrates robustness to variations in hyperparameters like generated image count and refinement threshold. |
The method assumes access to a powerful pre-trained diffusion model, which may not always be readily available.
Future work could explore different prompt engineering techniques and diffusion model architectures for further performance improvement in CIOD. |
class incremental learning, object detection, generative replay, diffusion models, catastrophic forgetting |
2402.17298
Report |
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks |
Yang Liu, Xiaomin Yu, Gongyu Zhang, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin |
In this study, we address the challenging task of bridging the modality gap
between learning from language and inference for visual tasks, including Visual
Question Answering (VQA), Image Captioning (IC) and Visual Entailment (VE). We
train models for these tasks in a zero-shot cross-modal transfer setting, a
domain where the previous state-of-the-art method relied on the fixed scale
noise injection, often compromising the semantic content of the original
modality embedding. To combat it, we propose a novel method called Adaptive
ranged cosine Similarity injected noise (ArcSin). First, we introduce an
innovative adaptive noise scale that effectively generates the textual elements
with more variability while preserving the original text feature's integrity.
Second, a similarity pool strategy is employed, expanding the domain
generalization potential by broadening the overall noise scale. This dual
strategy effectively widens the scope of the original domain while safeguarding
content integrity. Our empirical results demonstrate that these models closely
rival those trained on images in terms of performance. Specifically, our method
exhibits substantial improvements over the previous state-of-the-art, achieving
gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively.
Additionally, we observe increases of 1.5 percentage points (pp), 1.4 pp, and
1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries
of what is achievable within the constraints of image-trained model benchmarks.
The code will be released. |
This paper introduces ArcSin, a novel adaptive noise injection technique for language-driven visual tasks that effectively bridges the modality gap between text and image data. |
Bridging the modality gap is crucial for zero-shot cross-modal transfer learning in vision-language tasks, enabling models to understand and interpret visual information using only textual data, which is abundant and cost-effective to acquire. |
ArcSin utilizes adaptive ranged noise injection based on cosine similarity and feature magnitude to expand the text feature domain while preserving semantic content. It also employs a similarity pool strategy to further broaden the noise scale and enhance domain generalization. |
ArcSin outperforms previous state-of-the-art methods in zero-shot cross-modal transfer learning for various vision-language tasks, including image captioning, visual question answering, and visual entailment.
The adaptive noise injection technique proves more effective than fixed-scale noise injection, demonstrating the importance of content preservation during domain generalization.
The performance improvement is consistent across various contrastive and language backbone models, highlighting the robustness and generalizability of the proposed method. |
ArcSin may struggle with disentangling and interpreting intricate visual details or differentiating between similar foreground and background elements.
Future work will focus on enhancing the comprehension of complex visual features through exclusively text-based learning. |
cross-modal transfer learning, vision-language tasks, modality gap, noise injection, zero-shot learning |
2402.17292
Report |
DivAvatar: Diverse 3D Avatar Generation with a Single Prompt |
Weijing Tao, Biwen Lei, Kunhao Liu, Shijian Lu, Miaomiao Cui, Xuansong Xie, Chunyan Miao |
Text-to-Avatar generation has recently made significant strides due to
advancements in diffusion models. However, most existing work remains
constrained by limited diversity, producing avatars with subtle differences in
appearance for a given text prompt. We design DivAvatar, a novel framework that
generates diverse avatars, empowering 3D creatives with a multitude of distinct
and richly varied 3D avatars from a single text prompt. Different from most
existing work that exploits scene-specific 3D representations such as NeRF,
DivAvatar finetunes a 3D generative model (i.e., EVA3D), allowing diverse
avatar generation from simply noise sampling in inference time. DivAvatar has
two key designs that help achieve generation diversity and visual quality. The
first is a noise sampling technique during training phase which is critical in
generating diverse appearances. The second is a semantic-aware zoom mechanism
and a novel depth loss, the former producing appearances of high textual
fidelity by separate fine-tuning of specific body parts and the latter
improving geometry quality greatly by smoothing the generated mesh in the
features space. Extensive experiments show that DivAvatar is highly versatile
in generating avatars of diverse appearances. |
DivAvatar: a novel framework for generating diverse 3D avatars from a single text prompt. |
Most existing text-to-avatar methods lack diversity, producing avatars with subtle differences. DivAvatar addresses this by enabling the generation of a variety of distinct and realistic avatars, crucial for inclusivity and efficiency in virtual environments. |
DivAvatar finetunes a pretrained 3D generative model (EVA3D) with a novel noise sampling technique during training, a semantic-aware zoom mechanism for textual fidelity, and a feature-based depth loss for geometry refinement. It leverages the inherent diversity of GANs and incorporates a diffusion prior (SDS) for text-guided generation. |
DivAvatar generates significantly more diverse avatars compared to existing methods like Stable Dreamfusion and AvatarCraft.
The noise sampling technique is crucial for achieving diversity, while the semantic zoom and depth loss improve texture fidelity and geometry quality respectively.
The method allows for flexible control over diversity levels by adjusting the probability of random noise sampling during training. |
Generated textures lack photorealistic details, requiring additional mesh optimization.
Limited diversity observed for specific uniforms, possibly due to the training dataset bias of the underlying generative model (EVA3D). |
text-to-3d, avatar generation, generative models, diversity, diffusion models |
2402.17245
Report |
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation |
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, Suhail Doshi |
In this work, we share three insights for achieving state-of-the-art
aesthetic quality in text-to-image generative models. We focus on three
critical aspects for model improvement: enhancing color and contrast, improving
generation across multiple aspect ratios, and improving human-centric fine
details. First, we delve into the significance of the noise schedule in
training a diffusion model, demonstrating its profound impact on realism and
visual fidelity. Second, we address the challenge of accommodating various
aspect ratios in image generation, emphasizing the importance of preparing a
balanced bucketed dataset. Lastly, we investigate the crucial role of aligning
model outputs with human preferences, ensuring that generated images resonate
with human perceptual expectations. Through extensive analysis and experiments,
Playground v2.5 demonstrates state-of-the-art performance in terms of aesthetic
quality under various conditions and aspect ratios, outperforming both
widely-used open-source models like SDXL and Playground v2, and closed-source
commercial systems such as DALLE 3 and Midjourney v5.2. Our model is
open-source, and we hope the development of Playground v2.5 provides valuable
guidelines for researchers aiming to elevate the aesthetic quality of
diffusion-based image generation models. |
Presents Playground v2.5, a text-to-image model with state-of-the-art aesthetic quality achieved by focusing on color and contrast enhancement, multi-aspect ratio generation, and human-centric detail refinement. |
Addresses the limitations of existing models in producing visually compelling images that align with human preferences, crucial for real-world applications and user satisfaction. |
Employs the EDM framework for enhanced noise scheduling and color vibrancy, implements a balanced dataset for multi-aspect ratio generation, and utilizes a human-in-the-loop approach for aligning outputs with human preferences. |
Outperforms state-of-the-art models, including Midjourney 5.2 and DALL·E 3, in aesthetic quality based on user studies.
Generates high-quality images across various aspect ratios, overcoming limitations of previous models.
Exhibits superior performance in rendering human-centric details, such as facial features and overall lighting. |
Future work will focus on improving text-to-image alignment and model variation capabilities.
Exploration of new architectures for enhanced image generation and editing. |
text-to-image generation, diffusion models, aesthetic quality, human preference alignment, multi-aspect ratio generation |
2402.17214
Report |
CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization |
Hao-Yang Peng, Jia-Peng Zhang, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu |
In the field of digital content creation, generating high-quality 3D
characters from single images is challenging, especially given the complexities
of various body poses and the issues of self-occlusion and pose ambiguity. In
this paper, we present CharacterGen, a framework developed to efficiently
generate 3D characters. CharacterGen introduces a streamlined generation
pipeline along with an image-conditioned multi-view diffusion model. This model
effectively calibrates input poses to a canonical form while retaining key
attributes of the input image, thereby addressing the challenges posed by
diverse poses. A transformer-based, generalizable sparse-view reconstruction
model is the other core component of our approach, facilitating the creation of
detailed 3D models from multi-view images. We also adopt a
texture-back-projection strategy to produce high-quality texture maps.
Additionally, we have curated a dataset of anime characters, rendered in
multiple poses and views, to train and evaluate our model. Our approach has
been thoroughly evaluated through quantitative and qualitative experiments,
showing its proficiency in generating 3D characters with high-quality shapes
and textures, ready for downstream applications such as rigging and animation. |
Presents CharacterGen, an efficient framework for generating high-quality 3D character models in a canonical pose from single images, overcoming challenges posed by diverse body poses and self-occlusion. |
Generating high-quality 3D characters from single images is crucial for various applications, but existing methods struggle with diverse poses and self-occlusion. This work offers a solution to these problems and streamlines the creation process. |
CharacterGen uses a two-stage approach: 1) an image-conditioned multi-view diffusion model to canonicalize input poses to a standard 'A-pose' while generating consistent multi-view images, and 2) a transformer-based sparse-view reconstruction model to create the 3D character from these images. |
Generates high-quality 3D characters in a canonical pose, suitable for rigging and animation.
Successfully addresses the challenges of self-occlusion and pose ambiguity in character generation.
Outperforms existing methods in terms of generation quality and speed, as evidenced by quantitative and qualitative comparisons. |
May not perfectly capture information from extreme poses or uncommon viewpoints.
Could further enhance texture quality by incorporating non-photorealistic rendering techniques. |
3d character generation, multi-view diffusion model, pose canonicalization, sparse-view reconstruction, texture refinement |
2402.17177
Report |
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models |
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, Lichao Sun |
Sora is a text-to-video generative AI model, released by OpenAI in February
2024. The model is trained to generate videos of realistic or imaginative
scenes from text instructions and show potential in simulating the physical
world. Based on public technical reports and reverse engineering, this paper
presents a comprehensive review of the model's background, related
technologies, applications, remaining challenges, and future directions of
text-to-video AI models. We first trace Sora's development and investigate the
underlying technologies used to build this "world simulator". Then, we describe
in detail the applications and potential impact of Sora in multiple industries
ranging from film-making and education to marketing. We discuss the main
challenges and limitations that need to be addressed to widely deploy Sora,
such as ensuring safe and unbiased video generation. Lastly, we discuss the
future development of Sora and video generation models in general, and how
advancements in the field could enable new ways of human-AI interaction,
boosting productivity and creativity of video generation. |
\texttt{Sora} is a text-to-video generative AI model that can produce videos of up to 1-minute long with high quality from text instructions. |
\texttt{Sora} is a breakthrough in AI-powered vision generation with the potential to revolutionize various fields, including film-making, education, and accessibility. |
\texttt{Sora} employs a pre-trained diffusion transformer model trained on a massive dataset of text-video pairs. It utilizes spacetime latent patches to compress and process video data efficiently. It also leverages techniques like caption improvement for enhanced instruction following and prompt engineering for guiding video generation. |
Sora can generate high-quality videos of up to 1 minute in length from text prompts, including complex scenes with multiple characters and intricate backgrounds.
It demonstrates emergent abilities in simulating aspects of the physical world and digital environments without explicit 3D modeling.
Sora allows for flexible video generation, accommodating variable durations, resolutions, and aspect ratios. |
Challenges remain in accurately simulating complex physical interactions and maintaining spatial and temporal consistency in intricate scenes.
Sora currently has a limitation in generating videos longer than one minute, restricting its application in scenarios requiring extended content. |
text-to-video generation, ai-powered vision, diffusion models, transformer models, generative ai |
2402.17139
Report |
Video as the New Language for Real-World Decision Making |
Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, Dale Schuurmans |
Both text and video data are abundant on the internet and support large-scale
self-supervised learning through next token or frame prediction. However, they
have not been equally leveraged: language models have had significant
real-world impact, whereas video generation has remained largely limited to
media entertainment. Yet video data captures important information about the
physical world that is difficult to express in language. To address this gap,
we discuss an under-appreciated opportunity to extend video generation to solve
tasks in the real world. We observe how, akin to language, video can serve as a
unified interface that can absorb internet knowledge and represent diverse
tasks. Moreover, we demonstrate how, like language models, video generation can
serve as planners, agents, compute engines, and environment simulators through
techniques such as in-context learning, planning and reinforcement learning. We
identify major impact opportunities in domains such as robotics, self-driving,
and science, supported by recent work that demonstrates how such advanced
capabilities in video generation are plausibly within reach. Lastly, we
identify key challenges in video generation that mitigate progress. Addressing
these challenges will enable video generation models to demonstrate unique
value alongside language models in a wider array of AI applications. |
This paper argues that video generation will be as impactful for the physical world as language modeling is for the digital world, serving as planners, agents, compute engines, and simulators. |
Video captures crucial physical world information difficult to express in text, offering potential benefits to robotics, self-driving, and science. |
The authors analyze how video generation, like language modeling, provides a unified representation and task interface, enabling techniques like in-context learning, planning, and reinforcement learning. |
Video generation can solve diverse vision tasks, answer questions with detailed actions, and exhibit visual reasoning capabilities.
Action-conditioned video generation can simulate complex game environments and generate novel ones from image prompts.
Video generation serves as a simulator for robotics, self-driving (with domain randomization), and scientific processes, enabling policy optimization and mitigating hardware limitations. |
Current video datasets have limited coverage and lack sufficient annotations.
Lack of a single best model architecture for video generation hinders progress, requiring exploration of hybrid approaches. |
video generation, language modeling, embodied ai, simulation, real-world applications |
2402.17128
Report |
OSCaR: Object State Captioning and State Change Representation |
Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu |
The capability of intelligent models to extrapolate and comprehend changes in
object states is a crucial yet demanding aspect of AI research, particularly
through the lens of human interaction in real-world settings. This task
involves describing complex visual environments, identifying active objects,
and interpreting their changes as conveyed through language. Traditional
methods, which isolate object captioning and state change detection, offer a
limited view of dynamic environments. Moreover, relying on a small set of
symbolic words to represent changes has restricted the expressiveness of the
language. To address these challenges, in this paper, we introduce the Object
State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique
objects from various egocentric video collections. It sets a new testbed for
evaluating multimodal large language models (MLLMs). Our experiments
demonstrate that while MLLMs show some skill, they lack a full understanding of
object state changes. The benchmark includes a fine-tuned model that, despite
initial capabilities, requires significant improvements in accuracy and
generalization ability for effective understanding of these changes. Our code
and dataset are available at https://github.com/nguyennm1024/OSCaR. |
This paper introduces a new task and benchmark, OSCaR, for understanding object states and their changes using natural language, leveraging a GPT-4V-assisted data generation pipeline. |
Understanding object state change is crucial for AI agents to reason, learn, and interact with the physical world, bridging the gap between human and machine perception. |
The authors collected egocentric videos from EPIC-KITCHENS and Ego4D, used GPT-4V and human annotations to generate captions, questions, and conversations about object states, and fine-tuned LLaVA on this data. |
OSCaR outperforms previous state-of-the-art models in text generation metrics (BLEU, ROUGE) for describing object states.
Human evaluation shows that OSCaR achieves near-parity with GPT-4V in caption quality.
The benchmark includes open-world evaluations on objects unseen during training, highlighting the challenge of generalizability in object state understanding. |
The study lacks audio integration, limiting its applicability to scenarios where sound is crucial.
Tracking long-term state transitions remains a challenge due to the limitations of current models in capturing long-term information. |
object state understanding, egocentric vision, multimodal large language models, gpt-4v, benchmarking |
2402.17113
Report |
Transparent Image Layer Diffusion using Latent Transparency |
Lvmin Zhang, Maneesh Agrawala |
We present LayerDiffuse, an approach enabling large-scale pretrained latent
diffusion models to generate transparent images. The method allows generation
of single transparent images or of multiple transparent layers. The method
learns a "latent transparency" that encodes alpha channel transparency into the
latent manifold of a pretrained latent diffusion model. It preserves the
production-ready quality of the large diffusion model by regulating the added
transparency as a latent offset with minimal changes to the original latent
distribution of the pretrained model. In this way, any latent diffusion model
can be converted into a transparent image generator by finetuning it with the
adjusted latent space. We train the model with 1M transparent image layer pairs
collected using a human-in-the-loop collection scheme. We show that latent
transparency can be applied to different open source image generators, or be
adapted to various conditional control systems to achieve applications like
foreground/background-conditioned layer generation, joint layer generation,
structural control of layer contents, etc. A user study finds that in most
cases (97%) users prefer our natively generated transparent content over
previous ad-hoc solutions such as generating and then matting. Users also
report the quality of our generated transparent images is comparable to real
commercial transparent assets like Adobe Stock. |
Presents LayerDiffuse, enabling large-scale pretrained latent diffusion models to generate single or multiple transparent image layers. |
Addresses the lack of research in layered/transparent content generation despite its high demand in visual content editing. |
Encodes transparency as a "latent transparency" offset in the latent space of a pretrained model, preserving its quality. Trains with 1M transparent image layer pairs collected via human-in-the-loop. |
Generates high-quality transparent images with diverse content and effects (glass, hair, fire, etc.).
Produces harmonious compositions of multiple layers with consistent illumination and geometry.
Integrates with control models (e.g., ControlNet) for enhanced functionality (e.g., structure control). |
Trade-off exists between generating "clean" transparent elements and achieving "harmonious blending".
Generating backgrounds for clean transparent elements without specific illumination or shadow can be challenging. |
transparent image generation, layered image generation, latent diffusion models, stable diffusion, image synthesis |
2402.16991
Report |
A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data |
Antonio Sclocchi, Alessandro Favero, Matthieu Wyart |
Understanding the structure of real data is paramount in advancing modern
deep-learning methodologies. Natural data such as images are believed to be
composed of features organised in a hierarchical and combinatorial manner,
which neural networks capture during learning. Recent advancements show that
diffusion models can generate high-quality images, hinting at their ability to
capture this underlying structure. We study this phenomenon in a hierarchical
generative model of data. We find that the backward diffusion process acting
after a time $t$ is governed by a phase transition at some threshold time,
where the probability of reconstructing high-level features, like the class of
an image, suddenly drops. Instead, the reconstruction of low-level features,
such as specific details of an image, evolves smoothly across the whole
diffusion process. This result implies that at times beyond the transition, the
class has changed but the generated sample may still be composed of low-level
elements of the initial image. We validate these theoretical insights through
numerical experiments on class-unconditional ImageNet diffusion models. Our
analysis characterises the relationship between time and scale in diffusion
models and puts forward generative models as powerful tools to model
combinatorial data properties. |
This paper studies how reversing time in denoising diffusion models reveals the hierarchical and compositional nature of data, particularly in image generation. |
Understanding this interplay between time and feature hierarchy in diffusion models can shed light on their remarkable success, including generalization abilities and data efficiency. |
The authors use a combination of theoretical analysis with a hierarchical generative model (Random Hierarchy Model) and empirical experiments on ImageNet. They analyze the denoising dynamics, specifically the probability of reconstructing features at different hierarchical levels as a function of time and noise. |
A phase transition exists in the denoising process where the probability of maintaining the original image class sharply drops at a specific time/noise level.
Low-level features of an image can change even at early denoising times, while the class remains stable.
Beyond the class transition, the model may still utilize low-level features from the original image to compose a new image belonging to a different class. |
The theoretical analysis assumes a simplified noise model and mean-field approximation.
Future work can explore these phenomena in other data domains like text, using diffusion language models. |
diffusion models, generative models, hierarchical data, compositionality, phase transition |
2402.16936
Report |
Disentangled 3D Scene Generation with Layout Learning |
Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A. Efros, Aleksander Holynski |
We introduce a method to generate 3D scenes that are disentangled into their
component objects. This disentanglement is unsupervised, relying only on the
knowledge of a large pretrained text-to-image model. Our key insight is that
objects can be discovered by finding parts of a 3D scene that, when rearranged
spatially, still produce valid configurations of the same scene. Concretely,
our method jointly optimizes multiple NeRFs from scratch - each representing
its own object - along with a set of layouts that composite these objects into
scenes. We then encourage these composited scenes to be in-distribution
according to the image generator. We show that despite its simplicity, our
approach successfully generates 3D scenes decomposed into individual objects,
enabling new capabilities in text-to-3D content creation. For results and an
interactive demo, see our project page at https://dave.ml/layoutlearning/ |
This paper introduces a novel method, called layout learning, for generating 3D scenes that are disentangled into their component objects using pretrained text-to-image models. |
Disentangling objects in 3D scenes is crucial for enabling object-level manipulation and editing, facilitating more controllable and interactive 3D content creation. |
The method optimizes multiple NeRFs, each representing a different object, along with a set of layouts defining their spatial arrangements. These NeRFs are jointly trained to produce realistic scenes evaluated by a pretrained text-to-image model. |
Layout learning successfully generates 3D scenes where individual NeRFs correspond to distinct objects, enabling object-level manipulation.
Quantitative evaluation using CLIP scores demonstrates that the generated scenes exhibit high visual quality and object disentanglement, nearing supervised per-object rendering performance.
The method facilitates several applications, including conditional scene generation around a given object, arranging 3D assets into semantically valid configurations, and decomposing existing scenes into objects. |
The model may encounter difficulties with object segmentation, occasionally grouping objects that always appear together or struggling with scenes containing many small objects.
Despite measures to ensure diversity, learned layouts can converge to overly similar configurations, limiting the variability of object arrangements. |
text-to-3d, disentanglement, unsupervised learning, object discovery, 3d scene generation |
2402.16889
Report |
Generative Models are Self-Watermarked: Declaring Model Authentication through Re-Generation |
Aditya Desu, Xuanli He, Qiongkai Xu, Wei Lu |
As machine- and AI-generated content proliferates, protecting the
intellectual property of generative models has become imperative, yet verifying
data ownership poses formidable challenges, particularly in cases of
unauthorized reuse of generated data. The challenge of verifying data ownership
is further amplified by using Machine Learning as a Service (MLaaS), which
often functions as a black-box system.
Our work is dedicated to detecting data reuse from even an individual sample.
Traditionally, watermarking has been leveraged to detect AI-generated content.
However, unlike watermarking techniques that embed additional information as
triggers into models or generated content, potentially compromising output
quality, our approach identifies latent fingerprints inherently present within
the outputs through re-generation. We propose an explainable verification
procedure that attributes data ownership through re-generation, and further
amplifies these fingerprints in the generative models through iterative data
re-generation. This methodology is theoretically grounded and demonstrates
viability and robustness using recent advanced text and image generative
models. Our methodology is significant as it goes beyond protecting the
intellectual property of APIs and addresses important issues such as the spread
of misinformation and academic misconduct. It provides a useful tool to ensure
the integrity of sources and authorship, expanding its application in different
scenarios where authenticity and ownership verification are essential. |
The paper proposes a novel approach for verifying data ownership in generative models, particularly in black-box settings, by leveraging inherent model fingerprints through re-generation. |
The increasing use of generative AI models raises concerns about unauthorized data reuse and plagiarism. Existing watermarking techniques can impact output quality, and classification-based methods may lack robustness. The proposed method addresses these challenges by utilizing the unique characteristics of generative models for verification. |
The methodology involves two stages: Generation and Verification. The Generation stage uses iterative re-generation to amplify model fingerprints in outputs. The Verification stage compares the distance between the original data and re-generated versions using authentic and contrasting models. This is grounded in fixed-point theory, ensuring convergence and distinct fingerprint separation. |
Iterative re-generation effectively enhances model fingerprints, leading to converging distances between consecutive re-generations.
Authentic models consistently exhibit smaller re-generation distances compared to contrasting models, facilitating ownership verification.
The method achieves high precision and recall in verifying data ownership across various text and image generation models and tasks. |
Robustness against sophisticated paraphrasing attacks is limited.
The effectiveness may be compromised by significant alterations to the generated content. |
generative models, data ownership verification, model fingerprints, re-generation, intellectual property protection |
2402.16843
Report |
Multi-LoRA Composition for Image Generation |
Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, Weizhu Chen |
Low-Rank Adaptation (LoRA) is extensively utilized in text-to-image models
for the accurate rendition of specific elements like distinct characters or
unique styles in generated images. Nonetheless, existing methods face
challenges in effectively composing multiple LoRAs, especially as the number of
LoRAs to be integrated grows, thus hindering the creation of complex imagery.
In this paper, we study multi-LoRA composition through a decoding-centric
perspective. We present two training-free methods: LoRA Switch, which
alternates between different LoRAs at each denoising step, and LoRA Composite,
which simultaneously incorporates all LoRAs to guide more cohesive image
synthesis. To evaluate the proposed approaches, we establish ComposLoRA, a new
comprehensive testbed as part of this research. It features a diverse range of
LoRA categories with 480 composition sets. Utilizing an evaluation framework
based on GPT-4V, our findings demonstrate a clear improvement in performance
with our methods over the prevalent baseline, particularly evident when
increasing the number of LoRAs in a composition. |
This paper introduces two novel training-free methods, LoRA Switch and LoRA Composite, for composing multiple Low-Rank Adaptations (LoRAs) in text-to-image generation, improving the accuracy and quality of composing multiple user-specified elements in generated images. |
Existing LoRA composition methods struggle with effectively integrating multiple elements, especially as the number of LoRAs increases, limiting the controllability and complexity of generated images. This paper addresses these limitations by focusing on the denoising process of diffusion models. |
LoRA Switch alternates between activating different LoRAs at each denoising step, ensuring each element receives dedicated attention. LoRA Composite leverages all LoRAs simultaneously, drawing inspiration from classifier-free guidance to provide balanced guidance throughout image generation. |
Both LoRA Switch and LoRA Composite outperform the conventional LoRA merging approach, particularly when composing a higher number of LoRAs.
LoRA Switch demonstrates superior performance in composition quality, while LoRA Composite excels in overall image quality.
The study reveals a style dependency, with LoRA Switch excelling in realistic styles and LoRA Composite proving more effective in anime styles. |
Composable image generation, particularly with multiple elements, remains challenging despite the improvements offered by the proposed methods.
The evaluation using GPT-4V, while effective, reveals a positional bias that necessitates averaging scores across different input orders to ensure fairness. |
image generation, composable image generation, low-rank adaptation (lora), diffusion models, multimodal evaluation |
2402.16828
Report |
Training Neural Networks from Scratch with Parallel Low-Rank Adapters |
Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal |
The scalability of deep learning models is fundamentally limited by computing
resources, memory, and communication. Although methods like low-rank adaptation
(LoRA) have reduced the cost of model finetuning, its application in model
pre-training remains largely unexplored. This paper explores extending LoRA to
model pre-training, identifying the inherent constraints and limitations of
standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel
bi-level optimization algorithm designed to enable parallel training of
multiple low-rank heads across computing nodes, thereby reducing the need for
frequent synchronization. Our approach includes extensive experimentation on
vision transformers using various vision datasets, demonstrating that LTE is
competitive with standard pre-training. |
This paper proposes LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm, to enable the pre-training of large neural networks from scratch using low-rank adapters, addressing the limitations of standard LoRA in this context. |
Training large models is challenging due to hardware constraints (compute, memory, communication). LTE aims to mitigate these issues by utilizing parallel low-rank updates, making large model pre-training feasible on low-memory devices. |
LTE employs multiple low-rank adapter heads trained in parallel on different data shards with infrequent synchronization. It leverages low-precision storage for main weights and efficient communication of only the LoRA parameters. |
LTE achieves comparable performance to standard pre-training across various vision tasks and datasets.
Infrequent merging of LoRA heads is crucial for performance, striking a balance between accuracy and communication cost.
Parallel LTE heads explore diverse subspaces, contributing to its effectiveness. |
LTE currently exhibits slower convergence in the final stages of training on ImageNet-1K compared to standard training.
Further optimization is needed to determine the ideal number of ranks/heads and explore heterogeneous LoRA parameterization.
Future work will focus on smarter merging strategies for improved efficiency with larger local steps. |
model pre-training, low-rank adaptation, parallel training, low-memory devices, federated learning |
2402.16806
Report |
Multi-Human Mesh Recovery with Transformers |
Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy |
Conventional approaches to human mesh recovery predominantly employ a
region-based strategy. This involves initially cropping out a human-centered
region as a preprocessing step, with subsequent modeling focused on this
zoomed-in image. While effective for single figures, this pipeline poses
challenges when dealing with images featuring multiple individuals, as
different people are processed separately, often leading to inaccuracies in
relative positioning. Despite the advantages of adopting a whole-image-based
approach to address this limitation, early efforts in this direction have
fallen short in performance compared to recent region-based methods. In this
work, we advocate for this under-explored area of modeling all people at once,
emphasizing its potential for improved accuracy in multi-person scenarios
through considering all individuals simultaneously and leveraging the overall
context and interactions. We introduce a new model with a streamlined
transformer-based design, featuring three critical design choices: multi-scale
feature incorporation, focused attention mechanisms, and relative joint
supervision. Our proposed model demonstrates a significant performance
improvement, surpassing state-of-the-art region-based and whole-image-based
methods on various benchmarks involving multiple individuals. |
This paper introduces a novel whole-image-based human mesh recovery (HMR) method that addresses limitations of conventional region-based approaches in multi-person scenarios. |
Region-based HMR methods, while effective for single figures, struggle to accurately capture relative positioning in multi-person images due to independent processing. Whole-image-based methods offer a solution by processing all individuals simultaneously, but have lagged behind in performance. |
The proposed model employs a streamlined transformer-based architecture with multi-scale feature incorporation, focused attention mechanisms using deformable attention, and a novel relative joint loss function to supervise relative joint locations. |
The method outperforms state-of-the-art region-based and whole-image-based methods on multiple multi-person benchmarks (CHI3D, Hi4D, BEDLAM).
Significant improvements are observed in the joint PA-MPJPE metric, highlighting its superior ability to model relative human positions.
Ablation studies confirm the importance of multi-scale features, focused attention, and relative joint loss for achieving superior performance. |
The method, while showing promise, still faces limitations in handling mesh penetration during close interactions, a common challenge for regression-based methods.
Future work could explore incorporating contact optimization strategies to further enhance the fidelity of reconstructions in such scenarios. |
human mesh recovery, multi-person pose estimation, whole-image-based modeling, transformer networks, deformable attention |
2402.16641
Report |
Towards Open-ended Visual Quality Comparison |
Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, Xiaohong Liu, Guangtao Zhai, Shiqi Wang, Weisi Lin |
Comparative settings (e.g. pairwise choice, listwise ranking) have been
adopted by a wide range of subjective studies for image quality assessment
(IQA), as it inherently standardizes the evaluation criteria across different
observers and offer more clear-cut responses. In this work, we extend the edge
of emerging large multi-modality models (LMMs) to further advance visual
quality comparison into open-ended settings, that 1) can respond to open-range
questions on quality comparison; 2) can provide detailed reasonings beyond
direct answers. To this end, we propose the Co-Instruct. To train this
first-of-its-kind open-source open-ended visual quality comparer, we collect
the Co-Instruct-562K dataset, from two sources: (a) LLM-merged single image
quality description, (b) GPT-4V "teacher" responses on unlabeled data.
Furthermore, to better evaluate this setting, we propose the MICBench, the
first benchmark on multi-image comparison for LMMs. We demonstrate that
Co-Instruct not only achieves in average 30% higher accuracy than
state-of-the-art open-source LMMs, but also outperforms GPT-4V (its teacher),
on both existing related benchmarks and the proposed MICBench. Our model is
published at https://huggingface.co/q-future/co-instruct. |
This paper introduces Co-Instruct, a novel instruction-tuning dataset designed for open-ended visual quality comparison, along with Co-Instruct-Comparer, an LMM model trained on this dataset, and MICBench, a benchmark for evaluating multi-image quality comparison in LMMs. |
This work addresses the limitations of existing IQA methods that struggle with open-ended questions and detailed reasoning, particularly in comparative settings. It leverages the strengths of LMMs to provide more human-like and informative quality assessments. |
The authors construct the Co-Instruct-562K dataset by merging single-image quality descriptions using LLMs (Merge2Compare) and leveraging GPT-4V responses on unlabeled images (Teach2Compare). They propose Co-Instruct-Comparer, an LMM with reduced visual tokens and an image-text interleaved format, trained on Co-Instruct-562K. Finally, they introduce MICBench, a benchmark with 2,000 MCQs, to evaluate multi-image quality comparison. |
Co-Instruct-Comparer surpasses all existing LMMs, including GPT-4V, on various visual quality comparison benchmarks.
The model achieves human-level accuracy on Q-BenchPAIR-A1 and even outperforms non-expert humans on specific settings.
Analysis reveals that Co-Instruct-Comparer's detailed reasoning capabilities match GPT-4V while significantly exceeding other LMMs. |
GPT evaluation, used in Q-BenchPAIR-A2, might be biased towards longer text outputs, potentially underestimating Co-Instruct-Comparer's performance.
Future research could explore better evaluation metrics and datasets for fine-grained comparisons, particularly for highly similar image pairs. |
large multi-modality models (lmms), visual quality assessment, visual quality comparison, visual question answering, benchmarking |
2402.16627
Report |
Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing |
Ling Yang, Zhilong Zhang, Zhaochen Yu, Jingwei Liu, Minkai Xu, Stefano Ermon, Bin Cui |
Conditional diffusion models have exhibited superior performance in
high-fidelity text-guided visual generation and editing. Nevertheless,
prevailing text-guided visual diffusion models primarily focus on incorporating
text-visual relationships exclusively into the reverse process, often
disregarding their relevance in the forward process. This inconsistency between
forward and reverse processes may limit the precise conveyance of textual
semantics in visual synthesis results. To address this issue, we propose a
novel and general contextualized diffusion model (ContextDiff) by incorporating
the cross-modal context encompassing interactions and alignments between text
condition and visual sample into forward and reverse processes. We propagate
this context to all timesteps in the two processes to adapt their trajectories,
thereby facilitating cross-modal conditional modeling. We generalize our
contextualized diffusion to both DDPMs and DDIMs with theoretical derivations,
and demonstrate the effectiveness of our model in evaluations with two
challenging tasks: text-to-image generation, and text-to-video editing. In each
task, our ContextDiff achieves new state-of-the-art performance, significantly
enhancing the semantic alignment between text condition and generated samples,
as evidenced by quantitative and qualitative evaluations. Our code is available
at https://github.com/YangLing0818/ContextDiff |
This paper introduces ContextDiff, a novel conditional diffusion model designed to enhance text-guided visual generation and editing by incorporating cross-modal context into both forward and reverse processes. |
Existing text-guided visual diffusion models primarily incorporate text-visual relationships only in the reverse process, limiting their ability to precisely convey textual semantics in generated visuals. ContextDiff aims to address this limitation by leveraging cross-modal context for improved semantic alignment. |
ContextDiff utilizes a relational network (e.g., cross-attention) to model cross-modal interactions between text and visual data. This context is then propagated to all timesteps of both the forward and reverse diffusion processes, acting as a context-aware trajectory adapter. The method is generalized and theoretically derived for both DDPMs and DDIMs, benefiting both cross-modal generation and editing tasks. |
ContextDiff achieves state-of-the-art performance in text-to-image generation, outperforming dominant diffusion models like Stable Diffusion, DALL-E 2, and Imagen.
In text-to-video editing, ContextDiff surpasses existing methods in textual alignment and temporal consistency, as evidenced by quantitative metrics and user studies.
The context-aware adapter in ContextDiff generalizes well to other text-guided video diffusion models, consistently improving their generation quality. |
The theoretical analysis focuses on optimal estimation and doesn't fully address convergence behavior due to the complexity of neural network optimization.
Future work could explore incorporating more sophisticated relational networks or alternative cross-modal interaction modeling techniques. |
diffusion models, text-to-image generation, text-to-video editing, cross-modal learning, contextualized diffusion |
2402.16607
Report |
GVA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos |
Xinqi Liu, Chenming Wu, Jialun Liu, Xing Liu, Jinbo Wu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang |
In this paper, we present a novel method that facilitates the creation of
vivid 3D Gaussian avatars from monocular video inputs (GVA). Our innovation
lies in addressing the intricate challenges of delivering high-fidelity human
body reconstructions and aligning 3D Gaussians with human skin surfaces
accurately. The key contributions of this paper are twofold. Firstly, we
introduce a pose refinement technique to improve hand and foot pose accuracy by
aligning normal maps and silhouettes. Precise pose is crucial for correct shape
and appearance reconstruction. Secondly, we address the problems of unbalanced
aggregation and initialization bias that previously diminished the quality of
3D Gaussian avatars, through a novel surface-guided re-initialization method
that ensures accurate alignment of 3D Gaussian points with avatar surfaces.
Experimental results demonstrate that our proposed method achieves
high-fidelity and vivid 3D Gaussian avatar reconstruction. Extensive
experimental analyses validate the performance qualitatively and
quantitatively, demonstrating that it achieves state-of-the-art performance in
photo-realistic novel view synthesis while offering fine-grained control over
the human body and hand pose. Project page: https://3d-aigc.github.io/GVA/. |
This paper proposes GVA, a novel method to reconstruct high-fidelity, hand-controllable 3D Gaussian avatars from monocular videos. |
Existing methods struggle with accurate hand and foot pose estimation, leading to limitations in avatar expressiveness and controllability, especially for hand movements. |
The method uses a two-stage pose refinement technique aligning normal maps and silhouettes for accurate pose estimation. It then introduces a surface-guided re-initialization mechanism to address unbalanced aggregation and initialization bias in 3D Gaussian point distribution. |
The method achieves high-fidelity avatar reconstruction with detailed hand movements, as demonstrated on the ZJU-MoCap, People-Snapshot, and GVA-Snapshot datasets.
It outperforms existing state-of-the-art methods in both qualitative and quantitative evaluations, showing better accuracy in shape, appearance, and perceptual quality.
The ablation study confirms the effectiveness of each proposed component, including pose refinement and surface-guided re-initialization. |
The method currently lacks facial expression control and struggles with very loose clothing.
Future work will explore incorporating learnable blendshapes for facial expressions and physics-based deformation priors for handling loose garments. |
3d gaussian avatar, monocular reconstruction, pose refinement, surface-guided re-initialization, hand controllable |
2402.16506
Report |
Stochastic Conditional Diffusion Models for Semantic Image Synthesis |
Juyeon Ko, Inho Kong, Dogyun Park, Hyunwoo J. Kim |
Semantic image synthesis (SIS) is a task to generate realistic images
corresponding to semantic maps (labels). It can be applied to diverse
real-world practices such as photo editing or content creation. However, in
real-world applications, SIS often encounters noisy user inputs. To address
this, we propose Stochastic Conditional Diffusion Model (SCDM), which is a
robust conditional diffusion model that features novel forward and generation
processes tailored for SIS with noisy labels. It enhances robustness by
stochastically perturbing the semantic label maps through Label Diffusion,
which diffuses the labels with discrete diffusion. Through the diffusion of
labels, the noisy and clean semantic maps become similar as the timestep
increases, eventually becoming identical at $t=T$. This facilitates the
generation of an image close to a clean image, enabling robust generation.
Furthermore, we propose a class-wise noise schedule to differentially diffuse
the labels depending on the class. We demonstrate that the proposed method
generates high-quality samples through extensive experiments and analyses on
benchmark datasets, including a novel experimental setup simulating human
errors during real-world applications. |
This paper proposes Stochastic Conditional Diffusion Model (SCDM), a robust conditional diffusion model for semantic image synthesis (SIS) that can handle noisy user inputs (labels). |
SIS is important for real-world applications like photo editing, but user-provided labels are often noisy, leading to poor image generation. SCDM addresses this challenge by improving robustness to noisy labels. |
SCDM introduces 'Label Diffusion,' a discrete diffusion process that stochastically perturbs semantic labels during training. This makes the model robust to discrepancies between clean training labels and noisy user-provided labels during inference. Additionally, SCDM uses a class-wise noise schedule to preserve information for small and rare objects. |
SCDM outperforms existing GAN-based and diffusion-based SIS models on noisy label benchmarks, showing significant FID improvements.
SCDM demonstrates strong performance even with highly corrupted labels, generating images similar to those produced with clean labels.
The proposed class-wise noise schedule significantly improves the generation quality of small and rare objects. |
The reliance on pre-trained segmentation models for mIoU evaluation might not perfectly reflect the true semantic correspondence.
Exploring different noise schedule hyperparameters could further improve performance. |
diffusion models, semantic image synthesis, conditional generation, robustness, noisy labels |
2402.16421
Report |
Outline-Guided Object Inpainting with Diffusion Models |
Markus Pobitzer, Filip Janicki, Mattia Rigotti, Cristiano Malossi |
Instance segmentation datasets play a crucial role in training accurate and
robust computer vision models. However, obtaining accurate mask annotations to
produce high-quality segmentation datasets is a costly and labor-intensive
process. In this work, we show how this issue can be mitigated by starting with
small annotated instance segmentation datasets and augmenting them to
effectively obtain a sizeable annotated dataset. We achieve that by creating
variations of the available annotated object instances in a way that preserves
the provided mask annotations, thereby resulting in new image-mask pairs to be
added to the set of annotated images. Specifically, we generate new images
using a diffusion-based inpainting model to fill out the masked area with a
desired object class by guiding the diffusion through the object outline. We
show that the object outline provides a simple, but also reliable and
convenient training-free guidance signal for the underlying inpainting model
that is often sufficient to fill out the mask with an object of the correct
class without further text guidance and preserve the correspondence between
generated images and the mask annotations with high precision. Our experimental
results reveal that our method successfully generates realistic variations of
object instances, preserving their shape characteristics while introducing
diversity within the augmented area. We also show that the proposed method can
naturally be combined with text guidance and other image augmentation
techniques. |
This paper presents a novel data augmentation method for instance segmentation datasets leveraging diffusion-based inpainting guided by object outlines. |
Annotating instance segmentation datasets is expensive and time-consuming. This method offers a way to augment existing datasets and potentially improve model performance. |
The method erodes object masks to create outlines, then uses a diffusion-based inpainting model (Stable Diffusion) to generate variations of the original object within the outline, optionally guided by text prompts (object class). |
Augmenting a few-shot instance segmentation dataset with generated images improved segmentation average precision (AP).
The method achieved state-of-the-art Fréchet Inception Distance (FID) scores, indicating high-quality generated images.
Object outlines proved to be a strong guidance signal for the inpainting model, enabling realistic and diverse object variations. |
The method can fail when objects are severely occluded or out-of-distribution.
Further research is needed to understand the complex relationship between scene context, object outline, and generated image quality. |
image inpainting, data augmentation, instance segmentation, diffusion models, stable diffusion |
2402.16370
Report |
DEYO: DETR with YOLO for End-to-End Object Detection |
Haodong Ouyang |
The training paradigm of DETRs is heavily contingent upon pre-training their
backbone on the ImageNet dataset. However, the limited supervisory signals
provided by the image classification task and one-to-one matching strategy
result in an inadequately pre-trained neck for DETRs. Additionally, the
instability of matching in the early stages of training engenders
inconsistencies in the optimization objectives of DETRs. To address these
issues, we have devised an innovative training methodology termed step-by-step
training. Specifically, in the first stage of training, we employ a classic
detector, pre-trained with a one-to-many matching strategy, to initialize the
backbone and neck of the end-to-end detector. In the second stage of training,
we froze the backbone and neck of the end-to-end detector, necessitating the
training of the decoder from scratch. Through the application of step-by-step
training, we have introduced the first real-time end-to-end object detection
model that utilizes a purely convolutional structure encoder, DETR with YOLO
(DEYO). Without reliance on any supplementary training data, DEYO surpasses all
existing real-time object detectors in both speed and accuracy. Moreover, the
comprehensive DEYO series can complete its second-phase training on the COCO
dataset using a single 8GB RTX 4060 GPU, significantly reducing the training
expenditure. Source code and pre-trained models are available at
https://github.com/ouyanghaodong/DEYO. |
The paper introduces DEYO, the first real-time end-to-end object detector that uses a purely convolutional encoder, and a novel step-by-step training method for DETRs that eliminates the need for pre-training on additional datasets like ImageNet. |
Existing DETR models rely on pre-training on datasets like ImageNet, limiting flexibility and increasing development costs. Additionally, limited supervisory signals in DETR training lead to inadequately pre-trained necks and unstable optimization. |
The step-by-step training method involves two stages: 1) Pre-train a classic detector (YOLOv8) with a one-to-many matching strategy to initialize the backbone and neck. 2) Freeze the backbone and neck and train the decoder from scratch. DEYO utilizes this method and incorporates a YOLO backbone and neck with a lightweight convolutional encoder and a Transformer-based decoder. |
DEYO surpasses state-of-the-art real-time object detectors in both speed and accuracy without any additional training data.
The step-by-step training method significantly improves performance compared to conventional DETR training.
DEYO demonstrates superior performance in dense scenarios, achieving 92.3 AP and 43.3 mMR on the CrowdHuman dataset. |
The neck of YOLOv8 and the model scaling strategy are not fully optimized for DEYO, leading to diminishing performance gains with increasing model size.
The mismatch between the output dimensions of YOLOv8's neck and the hidden dimensions of DEYO's decoder needs further investigation. |
object detection, detr, yolo, step-by-step training, real-time |
2402.16366
Report |
SPC-NeRF: Spatial Predictive Compression for Voxel Based Radiance Field |
Zetian Song, Wenhong Duan, Yuhuai Zhang, Shiqi Wang, Siwei Ma, Wen Gao |
Representing the Neural Radiance Field (NeRF) with the explicit voxel grid
(EVG) is a promising direction for improving NeRFs. However, the EVG
representation is not efficient for storage and transmission because of the
terrific memory cost. Current methods for compressing EVG mainly inherit the
methods designed for neural network compression, such as pruning and
quantization, which do not take full advantage of the spatial correlation of
voxels. Inspired by prosperous digital image compression techniques, this paper
proposes SPC-NeRF, a novel framework applying spatial predictive coding in EVG
compression. The proposed framework can remove spatial redundancy efficiently
for better compression performance.Moreover, we model the bitrate and design a
novel form of the loss function, where we can jointly optimize compression
ratio and distortion to achieve higher coding efficiency. Extensive experiments
demonstrate that our method can achieve 32% bit saving compared to the
state-of-the-art method VQRF on multiple representative test datasets, with
comparable training time. |
SPC-NeRF, a novel framework for compressing Explicit Voxel Grid (EVG) represented Neural Radiance Fields (NeRFs) using spatial predictive coding, leading to significant bitrate reduction without substantial quality loss. |
EVG NeRFs offer fast training and rendering but have large memory footprints, hindering storage and transmission. Existing compression methods often ignore spatial correlation among voxels. |
1) Importance pruning and identification of critical voxels. 2) Construction of a reference graph for spatial prediction. 3) Scalar quantization and prediction on the feature grid. 4) Joint rate-distortion optimization during finetuning with entropy modeling. 5) Two-step finetuning with coarse and fine quantization for critical voxels. |
SPC-NeRF achieves 32% bit saving compared to VQRF on the Synthetic-NeRF dataset.
The method demonstrates over 100x compression on uncompressed DVGO with negligible quality degradation.
SPC-NeRF generates a smooth, approximate logarithmic rate-distortion curve by adjusting a single trade-off factor. |
The current implementation only uses complete voxels for prediction, limiting potential compression gains.
Future work can explore more efficient entropy coding and block-based prediction modes to further reduce bitrate. |
neural radiance fields, nerf compression, explicit voxel grid, spatial predictive coding, rate-distortion optimization |
2402.16359
Report |
Feedback Efficient Online Fine-Tuning of Diffusion Models |
Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Sergey Levine, Tommaso Biancalani |
Diffusion models excel at modeling complex data distributions, including
those of images, proteins, and small molecules. However, in many cases, our
goal is to model parts of the distribution that maximize certain properties:
for example, we may want to generate images with high aesthetic quality, or
molecules with high bioactivity. It is natural to frame this as a reinforcement
learning (RL) problem, in which the objective is to fine-tune a diffusion model
to maximize a reward function that corresponds to some property. Even with
access to online queries of the ground-truth reward function, efficiently
discovering high-reward samples can be challenging: they might have a low
probability in the initial distribution, and there might be many infeasible
samples that do not even have a well-defined reward (e.g., unnatural images or
physically impossible molecules). In this work, we propose a novel
reinforcement learning procedure that efficiently explores on the manifold of
feasible samples. We present a theoretical analysis providing a regret
guarantee, as well as empirical validation across three domains: images,
biological sequences, and molecules. |
This paper proposes SEIKO, a feedback-efficient online fine-tuning approach for diffusion models, tailored for scenarios where querying the reward function is expensive. |
Fine-tuning diffusion models with RL often requires numerous expensive queries to the true reward function. SEIKO minimizes this by efficiently exploring the space of valid samples to quickly discover high-reward designs. |
SEIKO interleaves reward learning and diffusion model updates. It leverages KL regularization to preserve information from a pre-trained diffusion model, ensuring exploration within the manifold of feasible designs. Additionally, it employs an uncertainty model to guide exploration towards novel, potentially high-reward regions. |
SEIKO demonstrates superior feedback efficiency compared to non-adaptive baselines and naive online fine-tuning methods, highlighting the importance of adaptive data collection and KL regularization.
Empirical validation across image generation (aesthetic quality), protein sequence design (fluorescence), and molecule generation (QED score) confirms SEIKO's ability to efficiently discover high-reward designs.
Theoretical analysis provides a regret guarantee for SEIKO, demonstrating its provable feedback efficiency. |
The current work focuses on Markovian diffusion models. Extending the approach to non-Markovian settings could be explored.
Future research can investigate the application of SEIKO to diffusion models specifically designed for biological or chemical applications, such as those generating molecular graphs or protein structures. |
diffusion models, reinforcement learning, online learning, feedback efficiency, generative models |
2402.16124
Report |
AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation |
Yasheng Sun, Wenqing Chu, Hang Zhou, Kaisiyuan Wang, Hideki Koike |
While considerable progress has been made in achieving accurate lip
synchronization for 3D speech-driven talking face generation, the task of
incorporating expressive facial detail synthesis aligned with the speaker's
speaking status remains challenging. Our goal is to directly leverage the
inherent style information conveyed by human speech for generating an
expressive talking face that aligns with the speaking status. In this paper, we
propose AVI-Talking, an Audio-Visual Instruction system for expressive Talking
face generation. This system harnesses the robust contextual reasoning and
hallucination capability offered by Large Language Models (LLMs) to instruct
the realistic synthesis of 3D talking faces. Instead of directly learning
facial movements from human speech, our two-stage strategy involves the LLMs
first comprehending audio information and generating instructions implying
expressive facial details seamlessly corresponding to the speech. Subsequently,
a diffusion-based generative network executes these instructions. This
two-stage process, coupled with the incorporation of LLMs, enhances model
interpretability and provides users with flexibility to comprehend instructions
and specify desired operations or modifications. Extensive experiments showcase
the effectiveness of our approach in producing vivid talking faces with
expressive facial movements and consistent emotional status. |
Presents AVI-Talking, an Audio-Visual Instruction system for generating expressive 3D talking faces by leveraging the inherent style information in human speech. |
Addresses the challenge of incorporating expressive facial details aligned with speaking status in 3D talking face generation, which previous methods struggle to achieve. |
A two-stage strategy is employed: 1) LLMs comprehend audio and generate instructions for expressive facial details, 2) A diffusion-based network synthesizes talking faces following these instructions. |
Generates vivid 3D talking faces with expressive facial movements and consistent emotional status.
Outperforms previous state-of-the-art methods in subjective user studies on aspects of lip sync quality, movement expressiveness, and expression consistency.
Demonstrates the ability to generate diverse facial expressions for a given speech input and handle out-of-distribution instructions to some extent. |
Model's performance depends on the quality and diversity of the training dataset, potentially leading to insensitivity to certain speaking styles.
Effectiveness of instruction following is limited to instructions similar to the dataset distribution. |
talking face generation, 3d facial animation, audio-visual instruction, large language models (llms), diffusion models |
2402.16013
Report |
Semi-supervised Open-World Object Detection |
Sahal Shaji Mullappilly, Abhishek Singh Gehlot, Rao Muhammad Anwer, Fahad Shahbaz Khan, Hisham Cholakkal |
Conventional open-world object detection (OWOD) problem setting first
distinguishes known and unknown classes and then later incrementally learns the
unknown objects when introduced with labels in the subsequent tasks. However,
the current OWOD formulation heavily relies on the external human oracle for
knowledge input during the incremental learning stages. Such reliance on
run-time makes this formulation less realistic in a real-world deployment. To
address this, we introduce a more realistic formulation, named semi-supervised
open-world detection (SS-OWOD), that reduces the annotation cost by casting the
incremental learning stages of OWOD in a semi-supervised manner. We demonstrate
that the performance of the state-of-the-art OWOD detector dramatically
deteriorates in the proposed SS-OWOD setting. Therefore, we introduce a novel
SS-OWOD detector, named SS-OWFormer, that utilizes a feature-alignment scheme
to better align the object query representations between the original and
augmented images to leverage the large unlabeled and few labeled data. We
further introduce a pseudo-labeling scheme for unknown detection that exploits
the inherent capability of decoder object queries to capture object-specific
information. We demonstrate the effectiveness of our SS-OWOD problem setting
and approach for remote sensing object detection, proposing carefully curated
splits and baseline performance evaluations. Our experiments on 4 datasets
including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of
our approach. Our source code, models and splits are available here -
https://github.com/sahalshajim/SS-OWFormer |
Introduces Semi-supervised Open-World Object Detection (SS-OWOD) setting and SS-OWFormer, a novel transformer-based detector, to reduce annotation reliance in open-world object detection. |
Existing OWOD methods depend heavily on human oracles for labeling unknown objects, which is costly and impractical in real-world applications. |
SS-OWFormer utilizes a feature alignment scheme to align object queries between original and augmented images, leveraging unlabeled data. It also employs an object query-guided pseudo-labeling scheme for improved unknown object detection. |
SS-OWFormer with 10% labeled data outperforms state-of-the-art OW-DETR with 50% labeled data on COCO.
SS-OWFormer achieves a 4.8% absolute gain in unknown recall over OW-DETR.
Demonstrated effectiveness of SS-OWOD and SS-OWFormer on remote sensing object detection with curated splits and baseline evaluations. |
SS-OWFormer lacks an explicit mechanism for forgetting previously seen categories.
Performance can be further improved for challenging scenarios in satellite imagery with overlapping objects. |
open-world object detection, semi-supervised learning, transformer, pseudo-labeling, remote sensing |
2402.15870
Report |
Spec-Gaussian: Anisotropic View-Dependent Appearance for 3D Gaussian Splatting |
Ziyi Yang, Xinyu Gao, Yangtian Sun, Yihua Huang, Xiaoyang Lyu, Wen Zhou, Shaohui Jiao, Xiaojuan Qi, Xiaogang Jin |
The recent advancements in 3D Gaussian splatting (3D-GS) have not only
facilitated real-time rendering through modern GPU rasterization pipelines but
have also attained state-of-the-art rendering quality. Nevertheless, despite
its exceptional rendering quality and performance on standard datasets, 3D-GS
frequently encounters difficulties in accurately modeling specular and
anisotropic components. This issue stems from the limited ability of spherical
harmonics (SH) to represent high-frequency information. To overcome this
challenge, we introduce Spec-Gaussian, an approach that utilizes an anisotropic
spherical Gaussian (ASG) appearance field instead of SH for modeling the
view-dependent appearance of each 3D Gaussian. Additionally, we have developed
a coarse-to-fine training strategy to improve learning efficiency and eliminate
floaters caused by overfitting in real-world scenes. Our experimental results
demonstrate that our method surpasses existing approaches in terms of rendering
quality. Thanks to ASG, we have significantly improved the ability of 3D-GS to
model scenes with specular and anisotropic components without increasing the
number of 3D Gaussians. This improvement extends the applicability of 3D GS to
handle intricate scenarios with specular and anisotropic surfaces. |
Introduced "Spec-Gaussian", a novel 3D Gaussian splitting approach featuring an anisotropic view-dependent appearance using an ASG appearance field, and a coarse-to-fine training mechanism to eliminate floaters in rendered scenes. |
Addresses limitations of 3D Gaussian Splatting (3D-GS) in modeling specular and anisotropic components, which are common in real-world scenes and crucial for photorealistic rendering. |
Replaces spherical harmonics in 3D-GS with an anisotropic spherical Gaussian (ASG) appearance field to model high-frequency information. Employs a hybrid approach with anchor Gaussians to reduce computational and storage overhead. Introduces a coarse-to-fine training scheme to learn global information and reduce overfitting, minimizing floaters. |
Achieves state-of-the-art rendering quality on multiple benchmarks, including NeRF, NSVF, and anisotropic scenes.
Significantly improves 3D-GS's ability to model complex specular reflections and anisotropic materials, exceeding NeRF-based methods in some cases.
Maintains fast rendering speeds comparable to other 3D-GS-based methods while improving visual quality. |
Faces challenges in handling reflections due to the lack of explicit geometry in 3D-GS.
Reliance on ground truth geometry for better reflection modeling can lead to a decline in overall rendering quality. |
3d gaussian splatting, neural rendering, anisotropy, specular highlights, real-time rendering |
2402.15784
Report |
IRConStyle: Image Restoration Framework Using Contrastive Learning and Style Transfer |
Dongqi Fan, Xin Zhao, Liang Chang |
Recently, the contrastive learning paradigm has achieved remarkable success
in high-level tasks such as classification, detection, and segmentation.
However, contrastive learning applied in low-level tasks, like image
restoration, is limited, and its effectiveness is uncertain. This raises a
question: Why does the contrastive learning paradigm not yield satisfactory
results in image restoration? In this paper, we conduct in-depth analyses and
propose three guidelines to address the above question. In addition, inspired
by style transfer and based on contrastive learning, we propose a novel module
for image restoration called \textbf{ConStyle}, which can be efficiently
integrated into any U-Net structure network. By leveraging the flexibility of
ConStyle, we develop a \textbf{general restoration network} for image
restoration. ConStyle and the general restoration network together form an
image restoration framework, namely \textbf{IRConStyle}. To demonstrate the
capability and compatibility of ConStyle, we replace the general restoration
network with transformer-based, CNN-based, and MLP-based networks,
respectively. We perform extensive experiments on various image restoration
tasks, including denoising, deblurring, deraining, and dehazing. The results on
19 benchmarks demonstrate that ConStyle can be integrated with any U-Net-based
network and significantly enhance performance. For instance, ConStyle NAFNet
significantly outperforms the original NAFNet on SOTS outdoor (dehazing) and
Rain100H (deraining) datasets, with PSNR improvements of 4.16 dB and 3.58 dB
with 85% fewer parameters. |
This paper analyzes the limitations of contrastive learning (CL) in image restoration (IR) and proposes a novel plug-and-play module called ConStyle, integrated into a general IR framework (IRConStyle) to enhance IR performance. |
Contrastive learning, highly successful in high-level vision tasks, has shown limited effectiveness in low-level tasks like IR. This paper addresses this gap by analyzing the reasons behind this limitation and proposing a novel framework to leverage CL for improved IR. |
The paper proposes three guidelines for enhancing CL in IR: using additional data structures for storing samples, utilizing encoder's latent features, and adopting a suitable pretext task. It introduces ConStyle, a module inspired by style transfer, and integrates it into a general U-Net based restoration network (IRConStyle). Experiments are conducted by replacing the restoration network with transformer-based, CNN-based, and MLP-based networks for various IR tasks. |
ConStyle significantly improves the performance of existing IR models on various benchmarks, including denoising, deblurring, dehazing, and deraining.
ConStyle NAFNet, for instance, achieves significant PSNR improvements over the original NAFNet on SOTS outdoor (dehazing) and Rain100H (deraining) datasets, with 4.16 dB and 3.58 dB improvements respectively, while using 85% fewer parameters.
Ablation studies demonstrate the effectiveness of the proposed guidelines and the individual components of ConStyle. |
The computational complexity increase caused by replacing different upsampling and downsampling methods within the restoration network.
Exploring other pretext tasks for better utilization of CL in the IR domain. |
image restoration, contrastive learning, style transfer, deep learning, computer vision |
2402.15648
Report |
MambaIR: A Simple Baseline for Image Restoration with State-Space Model |
Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, Shu-Tao Xia |
Recent years have seen significant advancements in image restoration, largely
attributed to the development of modern deep neural networks, such as CNNs and
Transformers. However, existing restoration backbones often face the dilemma
between global receptive fields and efficient computation, hindering their
application in practice. Recently, the Selective Structured State Space Model,
especially the improved version Mamba, has shown great potential for long-range
dependency modeling with linear complexity, which offers a way to resolve the
above dilemma. However, the standard Mamba still faces certain challenges in
low-level vision such as local pixel forgetting and channel redundancy. In this
work, we introduce a simple but effective baseline, named MambaIR, which
introduces both local enhancement and channel attention to improve the vanilla
Mamba. In this way, our MambaIR takes advantage of the local pixel similarity
and reduces the channel redundancy. Extensive experiments demonstrate the
superiority of our method, for example, MambaIR outperforms SwinIR by up to
0.45dB on image SR, using similar computational cost but with a global
receptive field. Code is available at \url{https://github.com/csguoh/MambaIR}. |
This paper introduces MambaIR, a novel image restoration model based on the Mamba state-space model, aiming to address the trade-off between computational efficiency and global receptive fields in existing methods. |
Current image restoration methods, employing CNNs or Transformers, struggle to simultaneously achieve global receptive fields for high-quality reconstruction and efficient computation for practical application. MambaIR, leveraging the strengths of the Mamba model, provides a solution to overcome this limitation. |
MambaIR consists of three stages: shallow feature extraction, deep feature extraction using stacked Residual State Space Blocks (RSSBs), and high-quality image reconstruction. RSSB, as the core component, incorporates local convolution to mitigate local pixel forgetting and channel attention to reduce channel redundancy in the standard Mamba model. |
MambaIR consistently outperforms SwinIR, a state-of-the-art Transformer-based method, on various image super-resolution benchmarks, achieving up to 0.45dB PSNR improvement with similar computational cost.
The ablation study validates the effectiveness of local enhancement and channel attention in RSSB, highlighting their contribution to MambaIR's superior performance.
MambaIR exhibits strong performance on image denoising tasks, both for synthetic Gaussian noise and real-world noise, demonstrating its robustness and generalization ability. |
The current implementation of MambaIR primarily focuses on single-image restoration tasks, and extending it to video restoration could be a potential future direction.
Further exploration of more efficient and effective unfolding strategies in the 2D Selective Scan Module (2D-SSM) could further enhance MambaIR's performance. |
image restoration, state space model, mamba, global receptive field, efficient computation |
2402.15555
Report |
Deep Networks Always Grok and Here is Why |
Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk |
Grokking, or delayed generalization, is a phenomenon where generalization in
a deep neural network (DNN) occurs long after achieving near zero training
error. Previous studies have reported the occurrence of grokking in specific
controlled settings, such as DNNs initialized with large-norm parameters or
transformers trained on algorithmic datasets. We demonstrate that grokking is
actually much more widespread and materializes in a wide range of practical
settings, such as training of a convolutional neural network (CNN) on CIFAR10
or a Resnet on Imagenette. We introduce the new concept of delayed robustness,
whereby a DNN groks adversarial examples and becomes robust, long after
interpolation and/or generalization. We develop an analytical explanation for
the emergence of both delayed generalization and delayed robustness based on a
new measure of the local complexity of a DNN's input-output mapping. Our local
complexity measures the density of the so-called 'linear regions' (aka, spline
partition regions) that tile the DNN input space, and serves as a utile
progress measure for training. We provide the first evidence that for
classification problems, the linear regions undergo a phase transition during
training whereafter they migrate away from the training samples (making the DNN
mapping smoother there) and towards the decision boundary (making the DNN
mapping less smooth there). Grokking occurs post phase transition as a robust
partition of the input space emerges thanks to the linearization of the DNN
mapping around the training points. Website: https://bit.ly/grok-adversarial |
This paper demonstrates that grokking, a phenomenon where deep neural networks (DNNs) exhibit delayed generalization, is more widespread than previously thought and occurs in various practical settings. The paper also introduces the concept of 'delayed robustness,' where DNNs achieve robustness to adversarial examples long after generalization. |
Understanding grokking is crucial as it challenges the conventional understanding of DNN training and generalization. This work provides a novel perspective on grokking by linking it to the dynamics of DNNs' input space partitioning during training. |
The authors leverage the interpretation of DNNs as continuous piecewise affine spline operators. They introduce 'local complexity,' a new progress measure that quantifies the density of linear regions in the DNN's input space partition. By analyzing the evolution of local complexity throughout training, the authors reveal a consistent pattern leading to grokking. |
DNNs exhibit 'delayed robustness,' achieving robustness to adversarial examples long after generalization occurs.
Local complexity, a measure of non-linearity density in the DNN's input space, follows a double descent pattern during training, with grokking occurring during the final descent phase.
During the final descent phase, the DNN's linear regions migrate away from training data points and concentrate around the decision boundary, forming a 'robust partition'. |
The study primarily relies on empirical analysis, lacking a complete theoretical justification for the observed double descent behavior in local complexity.
Future work could explore the connection between region migration and other phenomena like neural collapse, as well as investigate the impact of different optimizers and sharpness-aware minimization techniques on the training dynamics. |
grokking, delayed generalization, adversarial robustness, deep neural networks, spline theory |
2402.15504
Report |
Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition |
Chun-Hsiao Yeh, Ta-Ying Cheng, He-Yen Hsieh, Chuan-En Lin, Yi Ma, Andrew Markham, Niki Trigoni, H. T. Kung, Yubei Chen |
Recent text-to-image diffusion models are able to learn and synthesize images
containing novel, personalized concepts (e.g., their own pets or specific
items) with just a few examples for training. This paper tackles two
interconnected issues within this realm of personalizing text-to-image
diffusion models. First, current personalization techniques fail to reliably
extend to multiple concepts -- we hypothesize this to be due to the mismatch
between complex scenes and simple text descriptions in the pre-training dataset
(e.g., LAION). Second, given an image containing multiple personalized
concepts, there lacks a holistic metric that evaluates performance on not just
the degree of resemblance of personalized concepts, but also whether all
concepts are present in the image and whether the image accurately reflects the
overall text description. To address these issues, we introduce Gen4Gen, a
semi-automated dataset creation pipeline utilizing generative models to combine
personalized concepts into complex compositions along with text-descriptions.
Using this, we create a dataset called MyCanvas, that can be used to benchmark
the task of multi-concept personalization. In addition, we design a
comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better
quantifying the performance of multi-concept, personalized text-to-image
diffusion methods. We provide a simple baseline built on top of Custom
Diffusion with empirical prompting strategies for future researchers to
evaluate on MyCanvas. We show that by improving data quality and prompting
strategies, we can significantly increase multi-concept personalized image
generation quality, without requiring any modifications to model architecture
or training algorithms. |
This paper introduces Gen4Gen, a semi-automated pipeline for creating personalized image datasets with complex multi-concept compositions and detailed text descriptions, named MyCanvas. MyCanvas, along with a novel evaluation metric, addresses limitations in existing personalized text-to-image generation methods and benchmarks. |
Existing personalization techniques struggle with multiple concepts, particularly when semantically similar, due to limitations in pre-training datasets like LAION. Existing benchmarks also lack a holistic approach to evaluate multi-concept personalization. |
Gen4Gen leverages object detectors, LLMs, inpainting models, and MLLMs to compose user-provided concept images into new scenes with aligned descriptions. It utilizes prompt engineering to enhance training and proposes a novel metric combining Composition-Personalization-CLIP (CP-CLIP) and Text-Image Alignment CLIP (TI-CLIP) scores. |
MyCanvas significantly improves multi-concept personalization performance in models like Custom Diffusion and DreamBooth.
Proposed prompting strategies further enhance generation quality, particularly in complex compositions.
The study highlights the importance of high-quality, well-aligned datasets for personalized image generation. |
Gen4Gen's reliance on LLMs and diffusion inpainting can sometimes lead to unrealistic compositions or artifacts, requiring manual filtering.
Future work could explore automating the filtering process and leveraging richer multi-modal understanding in MLLMs for better composition guidance. |
text-to-image generation, personalization, dataset creation, multi-concept composition, evaluation metric |
2402.15429
Report |
ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation |
Yi Zhang, Yun Tang, Wenjie Ruan, Xiaowei Huang, Siddartha Khastgir, Paul Jennings, Xingyu Zhao |
Text-to-Image (T2I) Diffusion Models (DMs) have shown impressive abilities in
generating high-quality images based on simple text descriptions. However, as
is common with many Deep Learning (DL) models, DMs are subject to a lack of
robustness. While there are attempts to evaluate the robustness of T2I DMs as a
binary or worst-case problem, they cannot answer how robust in general the
model is whenever an adversarial example (AE) can be found. In this study, we
first introduce a probabilistic notion of T2I DMs' robustness; and then
establish an efficient framework, ProTIP, to evaluate it with statistical
guarantees. The main challenges stem from: i) the high computational cost of
the generation process; and ii) determining if a perturbed input is an AE
involves comparing two output distributions, which is fundamentally harder
compared to other DL tasks like classification where an AE is identified upon
misprediction of labels. To tackle the challenges, we employ sequential
analysis with efficacy and futility early stopping rules in the statistical
testing for identifying AEs, and adaptive concentration inequalities to
dynamically determine the "just-right" number of stochastic perturbations
whenever the verification target is met. Empirical experiments validate the
effectiveness and efficiency of ProTIP over common T2I DMs. Finally, we
demonstrate an application of ProTIP to rank commonly used defence methods. |
This paper introduces ProTIP, the first probabilistic robustness verification framework for text-to-image diffusion models against stochastic perturbations. |
Existing robustness evaluations of these models are binary or worst-case, failing to quantify overall robustness and posing scalability issues for large models. |
ProTIP employs sequential analysis with early stopping rules for efficient identification of adversarial examples, and adaptive concentration inequalities to dynamically determine the necessary number of perturbations. |
ProTIP accurately estimates probabilistic robustness, converging to the approximated ground truth with sufficient perturbations.
Early stopping rules significantly reduce computation by up to 4 times compared to fixed-sample methods.
Adaptive sample sizing in ProTIP proves more efficient than using a predetermined sample size with Hoeffding's inequality. |
Ground truth robustness is approximated due to the infeasibility of exhaustive testing.
Exploration of more sophisticated text perturbation methods beyond character-level is left for future work. |
diffusion models, probabilistic robustness, safe ai, text-to-image generation, adversarial examples |
2402.15194
Report |
Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control |
Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, Sergey Levine |
Diffusion models excel at capturing complex data distributions, such as those
of natural images and proteins. While diffusion models are trained to represent
the distribution in the training dataset, we often are more concerned with
other properties, such as the aesthetic quality of the generated images or the
functional properties of generated proteins. Diffusion models can be finetuned
in a goal-directed way by maximizing the value of some reward function (e.g.,
the aesthetic quality of an image). However, these approaches may lead to
reduced sample diversity, significant deviations from the training data
distribution, and even poor sample quality due to the exploitation of an
imperfect reward function. The last issue often occurs when the reward function
is a learned model meant to approximate a ground-truth "genuine" reward, as is
the case in many practical applications. These challenges, collectively termed
"reward collapse," pose a substantial obstacle. To address this reward
collapse, we frame the finetuning problem as entropy-regularized control
against the pretrained diffusion model, i.e., directly optimizing
entropy-enhanced rewards with neural SDEs. We present theoretical and empirical
evidence that demonstrates our framework is capable of efficiently generating
diverse samples with high genuine rewards, mitigating the overoptimization of
imperfect reward models. |
This paper introduces ELEGANT, a novel method for fine-tuning diffusion models using entropy-regularized control, addressing limitations of existing techniques. |
Fine-tuning diffusion models with reward functions often leads to reward collapse, sacrificing sample diversity and quality due to over-optimization of imperfect reward signals. |
ELEGANT frames fine-tuning as entropy-regularized control against a pre-trained diffusion model, learning both the drift term and initial distribution using neural SDEs to generate samples from a target distribution balancing reward maximization and proximity to the original data. |
ELEGANT effectively mitigates reward collapse, generating high-reward samples that are diverse and stay close to the training data distribution.
Compared to KL-penalized RL fine-tuning, ELEGANT demonstrates superior performance in terms of reward, KL divergence, and diversity across image and biological sequence generation tasks.
The paper provides theoretical results demonstrating the effectiveness of ELEGANT in targeting the desired distribution and maintaining bridges with the pre-trained diffusion model. |
The effectiveness of ELEGANT relies on the accuracy of neural SDE solvers and the expressiveness of neural networks used for value function estimation.
Future work includes exploring the application of ELEGANT to more specialized diffusion models in biology and chemistry. |
diffusion models, fine-tuning, entropy regularization, stochastic control, reward collapse |
2402.15120
Report |
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing |
Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang |
Contrastive language-image pre-training (CLIP) models have demonstrated
considerable success across various vision-language tasks, such as
text-to-image retrieval, where the model is required to effectively process
natural language input to produce an accurate visual output. However, current
models still face limitations in dealing with linguistic variations in input
queries, such as paraphrases, making it challenging to handle a broad range of
user queries in real-world applications. In this study, we introduce a
straightforward fine-tuning approach to enhance the representations of CLIP
models for paraphrases. Our approach involves a two-step paraphrase generation
process, where we automatically create two categories of paraphrases from
web-scale image captions by leveraging large language models. Subsequently, we
fine-tune the CLIP text encoder using these generated paraphrases while
freezing the image encoder. Our resulting model, which we call ParaCLIP,
exhibits significant improvements over baseline CLIP models across various
tasks, including paraphrased retrieval (with rank similarity scores improved by
up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven
semantic textual similarity tasks. |
This paper introduces ParaCLIP, a fine-tuning approach using synthetic paraphrases to enhance the representation robustness of CLIP's text encoder against linguistic variations in input queries. |
Current CLIP models struggle with linguistic variations like paraphrases, hindering their effectiveness in real-world applications with diverse user queries. |
The method involves a two-step paraphrase generation process from web-scale image captions using LLMs. Then, CLIP's text encoder is fine-tuned with these paraphrases while freezing the image encoder. |
ParaCLIP significantly outperforms baseline CLIP models on tasks like paraphrased retrieval, Visual Genome Relation and Attribution, and semantic textual similarity.
The approach demonstrates the effectiveness of leveraging synthetic paraphrases for improving CLIP's robustness to linguistic variations.
ParaCLIP maintains competitive performance on standard tasks like zero-shot image classification and text/image retrieval. |
The method can sometimes degrade performance on certain vision and vision-language tasks, potentially due to the sensitivity of the InfoNCE loss to batch size variations.
Future work includes investigating factors contributing to performance degradation and exploring the approach's potential for addressing limitations in compositional understanding. |
clip, paraphrase, fine-tuning, vision-language model, text-to-image retrieval |
2402.14797
Report |
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis |
Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, Sergey Tulyakov |
Contemporary models for generating images show remarkable quality and
versatility. Swayed by these advantages, the research community repurposes them
to generate videos. Since video content is highly redundant, we argue that
naively bringing advances of image models to the video generation domain
reduces motion fidelity, visual quality and impairs scalability. In this work,
we build Snap Video, a video-first model that systematically addresses these
challenges. To do that, we first extend the EDM framework to take into account
spatially and temporally redundant pixels and naturally support video
generation. Second, we show that a U-Net - a workhorse behind image generation
- scales poorly when generating videos, requiring significant computational
overhead. Hence, we propose a new transformer-based architecture that trains
3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us
to efficiently train a text-to-video model with billions of parameters for the
first time, reach state-of-the-art results on a number of benchmarks, and
generate videos with substantially higher quality, temporal consistency, and
motion complexity. The user studies showed that our model was favored by a
large margin over the most recent methods. See our website at
https://snap-research.github.io/snapvideo/. |
Introduces \methodname{}, a scalable, video-first text-to-video generation model that leverages a compressed video representation and joint spatiotemporal modeling to achieve state-of-the-art performance in terms of generation quality, temporal consistency, and motion complexity. |
Existing text-to-video generation models, often adapted from image models, struggle with motion fidelity, scalability, and maintaining visual quality in videos. This work addresses these limitations by proposing a video-centric approach. |
The authors propose a modified EDM diffusion framework tailored for high-resolution videos and introduce a scalable transformer-based architecture inspired by FITs, which learns a compressed video representation for efficient joint spatiotemporal modeling. They train their model on a large internal dataset of images and videos. |
The proposed FIT-based architecture trains 3.31 times faster than U-Nets and performs inference 4.49 times faster, while achieving better generation quality.
\methodname{} outperforms previous state-of-the-art models on UCF101 and MSR-VTT benchmarks, particularly in metrics evaluating motion quality.
User studies show a strong preference for \methodname{} over other state-of-the-art methods in terms of photorealism, text alignment, and motion quality. |
The model exhibits limitations in text rendering accuracy, object count control, complex positional understanding, and handling negations in prompts.
Further research can explore higher resolution generation, improved text rendering, and mitigating potential biases present in the training data. |
text-to-video generation, diffusion models, transformer, compressed video representation, joint spatiotemporal modeling |
2402.14792
Report |
Consolidating Attention Features for Multi-view Image Editing |
Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre |
Large-scale text-to-image models enable a wide range of image editing
techniques, using text prompts or even spatial controls. However, applying
these editing methods to multi-view images depicting a single scene leads to
3D-inconsistent results. In this work, we focus on spatial control-based
geometric manipulations and introduce a method to consolidate the editing
process across various views. We build on two insights: (1) maintaining
consistent features throughout the generative process helps attain consistency
in multi-view editing, and (2) the queries in self-attention layers
significantly influence the image structure. Hence, we propose to improve the
geometric consistency of the edited images by enforcing the consistency of the
queries. To do so, we introduce QNeRF, a neural radiance field trained on the
internal query features of the edited images. Once trained, QNeRF can render
3D-consistent queries, which are then softly injected back into the
self-attention layers during generation, greatly improving multi-view
consistency. We refine the process through a progressive, iterative method that
better consolidates queries across the diffusion timesteps. We compare our
method to a range of existing techniques and demonstrate that it can achieve
better multi-view consistency and higher fidelity to the input scene. These
advantages allow us to train NeRFs with fewer visual artifacts, that are better
aligned with the target geometry. |
This paper introduces a novel method for consistent multi-view image editing, enabling significant articulations and shape changes in objects while preserving visual consistency across different views. |
Existing multi-view editing methods often struggle with maintaining consistency when dealing with complex geometric changes, particularly in tasks involving significant shape modifications. |
The method leverages a query feature space neural radiance field (QNeRF) trained on the internal query features of edited images generated by a diffusion model. QNeRF consolidates these queries, enhancing consistency during a progressive, iterative denoising process. |
The proposed method achieves superior visual quality and multi-view consistency compared to alternative approaches like IN2N and TokenFlow.
Evaluations based on KID and FID metrics demonstrate that the method retains higher fidelity to the original scene with fewer visual artifacts.
User study results show a strong preference for the proposed method, indicating better alignment with the desired edits and higher visual quality in the generated 3D representations. |
The method inherits limitations of text-to-image models, such as struggling with complex structures like hands and generating inconsistent fine details, particularly in high-frequency textures.
The black-box optimization of QNeRF may lead to averaging outlier data, suggesting potential improvements through robust statistics techniques or alternative 3D representations like Gaussian Splats. |
multi-view image editing, neural radiance fields (nerf), diffusion models, self-attention, 3d consistency |
2402.14780
Report |
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models |
Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava |
Image customization has been extensively studied in text-to-image (T2I)
diffusion models, leading to impressive outcomes and applications. With the
emergence of text-to-video (T2V) diffusion models, its temporal counterpart,
motion customization, has not yet been well investigated. To address the
challenge of one-shot motion customization, we propose Customize-A-Video that
models the motion from a single reference video and adapting it to new subjects
and scenes with both spatial and temporal varieties. It leverages low-rank
adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V
diffusion model for specific motion modeling from the reference videos. To
disentangle the spatial and temporal information during the training pipeline,
we introduce a novel concept of appearance absorbers that detach the original
appearance from the single reference video prior to motion learning. Our
proposed method can be easily extended to various downstream tasks, including
custom video generation and editing, video appearance customization, and
multiple motion combination, in a plug-and-play fashion. Our project page can
be found at https://anonymous-314.github.io. |
This paper proposes Customize-A-Video, a novel one-shot motion customization method for videos. It leverages the motion learned from a single reference video and applies it to new subjects and scenes with both spatial and temporal variations. |
Existing text-to-video generation models struggle with precise motion control, while video editing methods often lack temporal variability in motion transfer. This method addresses the need for one-shot motion customization with plausible variations. |
The method utilizes Temporal LoRA (T-LoRA) applied to temporal attention layers of pre-trained T2V diffusion models. To disentangle spatial and temporal information, an 'Appearance Absorber' module is introduced. This module, trained on unordered video frames, detaches the original appearance from the reference video before motion learning. |
Customize-A-Video successfully transfers motion from a single reference video to new subjects and scenes with variations in motion intensity, position, and camera view.
The proposed T-LoRA effectively captures temporal motion dynamics, outperforming LoRA applications on non-temporal layers.
Appearance Absorbers, such as Spatial LoRA (S-LoRA) and Textual Inversion, successfully decompose spatial information, leading to better motion modeling. |
The standalone finetuning of spatial layers using appearance absorbers may lead to domain shift if overfitting occurs.
The model may struggle to learn and transfer motions intrinsically tied to static poses, as these are primarily captured by appearance absorbers. |
motion customization, text-to-video generation, diffusion models, temporal lora, appearance absorber |
2402.14767
Report |
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models |
Yuhang Cao, Pan Zhang, Xiaoyi Dong, Dahua Lin, Jiaqi Wang |
We present DualFocus, a novel framework for integrating macro and micro
perspectives within multi-modal large language models (MLLMs) to enhance
vision-language task performance. Current MLLMs typically singularly focus on
inputs at a predefined resolution, resulting in deficiencies in detailed
questions involving local regions. We introduced a DualFocus mechanism where
the model concentrates on the image from a macro perspective, responses to the
question, and identifies suitable sub-regions to zoom in for subsequent micro
perspective analysis. Via the integration of answers from both macro and micro
perspectives, the model is adept at addressing tasks that encompass global,
detailed, and combined considerations. To endows the DualFocus mechanism in
MLLMs, we curated a tailored dataset derived from the Visual Genome (VG) and
adapted it to align with the training regimen of DualFocus. Through comparative
studies across different model sizes and benchmarks, we demonstrate DualFocus's
superiority in balancing detailed examination with holistic insight,
significantly reducing hallucination instances in MLLMs and improving their
performance in various vision-language tasks. |
DualFocus, a novel framework that integrates macro and micro perspectives within multi-modal large language models (MLLMs) to enhance vision-language task performance. |
Current MLLMs struggle to balance detailed examination with holistic insight, often failing on questions requiring understanding of both global context and local details. |
DualFocus first analyzes the entire image to grasp the macro context. It then identifies and zooms into important sub-regions for detailed examination, combining insights from both perspectives using Perplexity (PPL) for answer selection. |
DualFocus consistently improves performance across different MLLM architectures (LLaVA, Qwen-VL) and various benchmarks (SEED, MMBench, GQA, TextVQA).
It significantly enhances accuracy in tasks requiring detailed perception, like instance attributes and text understanding.
DualFocus effectively mitigates hallucination in MLLMs, as demonstrated by improved performance on the POPE benchmark. |
The current implementation relies on a two-stage training process, which could be streamlined in future work.
The effectiveness of DualFocus across a broader range of visual-language tasks, such as image captioning, remains to be explored. |
multi-modal learning, large language models, visual question answering, fine-grained visual recognition, hallucination mitigation |
2402.14654
Report |
Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot |
Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, Thomas Lucas |
We present Multi-HMR, a strong single-shot model for multi-person 3D human
mesh recovery from a single RGB image. Predictions encompass the whole body,
i.e, including hands and facial expressions, using the SMPL-X parametric model
and spatial location in the camera coordinate system. Our model detects people
by predicting coarse 2D heatmaps of person centers, using features produced by
a standard Vision Transformer (ViT) backbone. It then predicts their whole-body
pose, shape and spatial location using a new cross-attention module called the
Human Prediction Head (HPH), with one query per detected center token,
attending to the entire set of features. As direct prediction of SMPL-X
parameters yields suboptimal results, we introduce CUFFS; the Close-Up Frames
of Full-Body Subjects dataset, containing humans close to the camera with
diverse hand poses. We show that incorporating this dataset into training
further enhances predictions, particularly for hands, enabling us to achieve
state-of-the-art performance. Multi-HMR also optionally accounts for camera
intrinsics, if available, by encoding camera ray directions for each image
token. This simple design achieves strong performance on whole-body and
body-only benchmarks simultaneously. We train models with various backbone
sizes and input resolutions. In particular, using a ViT-S backbone and
$448\times448$ input images already yields a fast and competitive model with
respect to state-of-the-art methods, while considering larger models and higher
resolutions further improve performance. |
This paper introduces \Ours, the first single-shot method for multi-person whole-body human mesh recovery from a single RGB image, which accurately estimates expressive 3D meshes (body, face and hands) and 3D positions in the scene, optionally adapting to camera information. |
Recovering whole-body human meshes from monocular images is important for various applications, including virtual/augmented reality, human-robot interaction, and human understanding from images and videos. Existing methods are limited to either single-person whole-body or multi-person body-only estimations. |
\Ours employs a Vision Transformer (ViT) backbone to extract image features and uses a CenterNet-like framework for human detection at the patch level. A novel Human Perception Head (HPH), based on cross-attention, then predicts SMPL-X parameters and depth for each detected individual. Optionally, camera intrinsics can be incorporated via Fourier-encoded ray directions. |
\Ours outperforms state-of-the-art methods in multi-person body-only mesh recovery, achieving significant gains on benchmarks like 3DPW, MuPoTs, CMU Panoptic, and AGORA.
It achieves competitive performance in whole-body mesh recovery compared to single-person methods, demonstrating its ability to accurately estimate hand and facial poses alongside body pose.
The model effectively leverages camera intrinsics for accurate 3D position estimation, outperforming previous approaches in human depth estimation on several benchmarks. |
The patch-level detection may lead to collisions when multiple person-centers fall within the same patch, limiting detection accuracy in crowded scenes.
The use of a relative rotation representation for the SMPL-X pose can lead to error accumulation, particularly in extreme body parts like hands and feet. |
human mesh recovery, whole-body pose estimation, single-shot detection, vision transformer, cross-attention |
2402.14650
Report |
GaussianPro: 3D Gaussian Splatting with Progressive Propagation |
Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, Xuejin Chen |
The advent of 3D Gaussian Splatting (3DGS) has recently brought about a
revolution in the field of neural rendering, facilitating high-quality
renderings at real-time speed. However, 3DGS heavily depends on the initialized
point cloud produced by Structure-from-Motion (SfM) techniques. When tackling
with large-scale scenes that unavoidably contain texture-less surfaces, the SfM
techniques always fail to produce enough points in these surfaces and cannot
provide good initialization for 3DGS. As a result, 3DGS suffers from difficult
optimization and low-quality renderings. In this paper, inspired by classical
multi-view stereo (MVS) techniques, we propose GaussianPro, a novel method that
applies a progressive propagation strategy to guide the densification of the 3D
Gaussians. Compared to the simple split and clone strategies used in 3DGS, our
method leverages the priors of the existing reconstructed geometries of the
scene and patch matching techniques to produce new Gaussians with accurate
positions and orientations. Experiments on both large-scale and small-scale
scenes validate the effectiveness of our method, where our method significantly
surpasses 3DGS on the Waymo dataset, exhibiting an improvement of 1.15dB in
terms of PSNR. |
GaussianPro, a novel progressive propagation strategy to guide Gaussian densification in 3D Gaussian Splatting (3DGS) for improved rendering quality and compactness, especially in texture-less regions. |
3DGS relies on initialized point clouds from SfM, which often fails in texture-less regions, leading to difficulties in optimization and low-quality renderings. |
The method utilizes a hybrid 3D-2D representation and iteratively propagates depth and normal information from neighboring pixels via patch matching. New Gaussians are initialized based on pixels with significant depth differences between rendered and propagated depth maps. Additionally, a planar loss is incorporated to regularize the geometry of Gaussians. |
Significant improvement over 3DGS on the Waymo dataset, with a 1.15dB PSNR increase.
Comparable results to state-of-the-art methods on the MipNeRF360 dataset, with improvements in weak-texture regions.
Robustness against sparse training images, outperforming 3DGS with different training view ratios. |
Lacks specific modeling for dynamic objects, leading to potential artifacts.
Future work includes incorporating dynamic Gaussian techniques to handle dynamic objects. |
3d gaussian splatting, neural rendering, novel view synthesis, progressive propagation, gaussian densification |
2402.14586
Report |
FrameNeRF: A Simple and Efficient Framework for Few-shot Novel View Synthesis |
Yan Xing, Pan Wang, Ligang Liu, Daolun Li, Li Zhang |
We present a novel framework, called FrameNeRF, designed to apply
off-the-shelf fast high-fidelity NeRF models with fast training speed and high
rendering quality for few-shot novel view synthesis tasks. The training
stability of fast high-fidelity models is typically constrained to dense views,
making them unsuitable for few-shot novel view synthesis tasks. To address this
limitation, we utilize a regularization model as a data generator to produce
dense views from sparse inputs, facilitating subsequent training of fast
high-fidelity models. Since these dense views are pseudo ground truth generated
by the regularization model, original sparse images are then used to fine-tune
the fast high-fidelity model. This process helps the model learn realistic
details and correct artifacts introduced in earlier stages. By leveraging an
off-the-shelf regularization model and a fast high-fidelity model, our approach
achieves state-of-the-art performance across various benchmark datasets. |
FrameNeRF, a novel framework that leverages off-the-shelf regularization and fast high-fidelity NeRF models for few-shot novel view synthesis. |
Fast high-fidelity NeRF models struggle with few-shot scenarios due to overfitting. This work introduces a framework to utilize their strengths in rendering quality and training speed for few-shot tasks. |
Three stage training process: 1) Train a regularization model on sparse views and generate dense pseudo-ground-truth images. 2) Train a fast high-fidelity model on these dense views. 3) Fine-tune the high-fidelity model on the original sparse views. |
Achieves state-of-the-art performance on Blender, LLFF, and DTU datasets for few-shot novel view synthesis.
Demonstrates the effectiveness of the three-stage training process through ablation studies.
Shows flexibility in choosing sub-modules and their impact on handling artifacts and reconstructing details. |
The choice of sub-modules (regularization and high-fidelity models) impacts the performance and requires careful selection.
The framework's reliance on existing models might limit its performance improvement compared to developing novel, specialized models. |
novel view synthesis, neural radiance fields (nerf), few-shot learning, regularization, 3d reconstruction |
2402.14577
Report |
Debiasing Text-to-Image Diffusion Models |
Ruifei He, Chuhui Xue, Haoru Tan, Wenqing Zhang, Yingchen Yu, Song Bai, Xiaojuan Qi |
Learning-based Text-to-Image (TTI) models like Stable Diffusion have
revolutionized the way visual content is generated in various domains. However,
recent research has shown that nonnegligible social bias exists in current
state-of-the-art TTI systems, which raises important concerns. In this work, we
target resolving the social bias in TTI diffusion models. We begin by
formalizing the problem setting and use the text descriptions of bias groups to
establish an unsafe direction for guiding the diffusion process. Next, we
simplify the problem into a weight optimization problem and attempt a
Reinforcement solver, Policy Gradient, which shows sub-optimal performance with
slow convergence. Further, to overcome limitations, we propose an iterative
distribution alignment (IDA) method. Despite its simplicity, we show that IDA
shows efficiency and fast convergence in resolving the social bias in TTI
diffusion models. Our code will be released. |
This paper proposes an iterative distribution alignment (IDA) method to resolve social bias (gender and ethnicity) in text-to-image diffusion models. |
Current text-to-image models exhibit significant social biases, raising ethical concerns regarding the generation of millions of biased synthetic data. |
The method utilizes text descriptions of bias groups to guide the diffusion process, iteratively adjusting weights assigned to these descriptions to achieve a balanced distribution in generated images. |
IDA successfully reduces gender and ethnic bias in generated images, achieving a more balanced representation.
The method demonstrates fast convergence, typically requiring only 1-3 iterations to achieve significant debiasing.
IDA effectively mitigates gender bias across various occupations, even those with extreme initial biases. |
The algorithm needs to be re-run for each new prompt, potentially limiting its practical application.
While effective, the method lacks a formal explanation for its success, warranting further investigation. |
text-to-image synthesis, diffusion models, social bias, debiasing, ethics in ai |
2402.14401
Report |
Diffusion Model Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality Assessment |
Zhaoyang Wang, Bo Hu, Mingyang Zhang, Jie Li, Leida Li, Maoguo Gong, Xinbo Gao |
Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA)
methods still suffer from finding a balance between learning feature
information at the pixel level of the image and capturing high-level feature
information and the efficient utilization of the obtained high-level feature
information remains a challenge. As a novel class of state-of-the-art (SOTA)
generative model, the diffusion model exhibits the capability to model
intricate relationships, enabling a comprehensive understanding of images and
possessing a better learning of both high-level and low-level visual features.
In view of these, we pioneer the exploration of the diffusion model into the
domain of NR-IQA. Firstly, we devise a new diffusion restoration network that
leverages the produced enhanced image and noise-containing images,
incorporating nonlinear features obtained during the denoising process of the
diffusion model, as high-level visual information. Secondly, two visual
evaluation branches are designed to comprehensively analyze the obtained
high-level feature information. These include the visual compensation guidance
branch, grounded in the transformer architecture and noise embedding strategy,
and the visual difference analysis branch, built on the ResNet architecture and
the residual transposed attention block. Extensive experiments are conducted on
seven public NR-IQA datasets, and the results demonstrate that the proposed
model outperforms SOTA methods for NR-IQA. |
This paper proposes DiffV^2IQA, a novel NR-IQA model that leverages a diffusion model for image restoration and introduces two visual evaluation branches for enhanced quality assessment. |
Existing NR-IQA methods struggle to balance pixel-level and high-level feature learning, particularly in authentic distortion scenarios. This work addresses these limitations by employing the intricate modeling capabilities of diffusion models. |
The method employs a diffusion restoration network to generate an enhanced image and noise-containing images. Two branches then analyze this information: a visual compensation guidance branch (ViT-based with noise embedding) and a visual difference analysis branch (ResNet-based with a novel RTAB module). |
DiffV^2IQA outperforms SOTA NR-IQA methods on several synthetic distortion datasets (LIVE, CSIQ, TID2013, Kadid10k).
The model demonstrates strong generalization ability, achieving top performance in cross-database evaluations.
Ablation studies validate the contribution of each component, highlighting the importance of the diffusion model, noise embedding, and the dual-branch evaluation strategy. |
The pre-training requirement of the diffusion restoration network adds complexity and introduces dataset dependency.
The iterative nature of the diffusion model increases inference time. |
no-reference image quality assessment, diffusion model, transformer, visual compensation guidance, visual difference analysis |
2402.14327
Report |
Subobject-level Image Tokenization |
Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung |
Transformer-based vision models typically tokenize images into fixed-size
square patches as input units, which lacks the adaptability to image content
and overlooks the inherent pixel grouping structure. Inspired by the subword
tokenization widely adopted in language models, we propose an image tokenizer
at a subobject level, where the subobjects are represented by semantically
meaningful image segments obtained by segmentation models (e.g., segment
anything models). To implement a learning system based on subobject
tokenization, we first introduced a Direct Segment Anything Model (DirectSAM)
that efficiently produces comprehensive segmentation of subobjects, then embed
subobjects into compact latent vectors and fed them into a large language model
for vision language learning. Empirical results demonstrated that our
subobject-level tokenization significantly facilitates efficient learning of
translating images into object and attribute descriptions compared to the
traditional patch-level tokenization. Codes and models are open-sourced at
https://github.com/ChenDelong1999/subobjects. |
This paper introduces "subobject"-level image tokenization for vision-language learning, leveraging semantically meaningful image segments instead of fixed-size patches. |
Current Transformer-based vision models rely on patch-level tokenization, ignoring semantic boundaries and leading to inefficient learning. |
The authors propose DirectSAM for efficient subobject segmentation and a Sequence-to-sequence AutoEncoder (SeqAE) for embedding subobjects into compact vectors. These embeddings are then integrated into a Large Language Model (LLM) for vision-language tasks. |
Subobject-level tokenization significantly accelerates vision-language learning compared to patch-level tokenization.
Models with subobject tokenization achieve higher accuracy in object counting.
Subobject-based models demonstrate superior performance in recognizing visual attributes like size, material, and shape. |
The current implementation relies on synthetic datasets for evaluation.
Exploration of different subobject segmentation methods and their impact on downstream tasks. |
image tokenization, vision-language learning, subobject segmentation, large language models, segment anything model |
2402.14316
Report |
Place Anything into Any Video |
Ziling Liu, Jinyu Yang, Mingqi Gao, Feng Zheng |
Controllable video editing has demonstrated remarkable potential across
diverse applications, particularly in scenarios where capturing or re-capturing
real-world videos is either impractical or costly. This paper introduces a
novel and efficient system named Place-Anything, which facilitates the
insertion of any object into any video solely based on a picture or text
description of the target object or element. The system comprises three
modules: 3D generation, video reconstruction, and 3D target insertion. This
integrated approach offers an efficient and effective solution for producing
and editing high-quality videos by seamlessly inserting realistic objects.
Through a user study, we demonstrate that our system can effortlessly place any
object into any video using just a photograph of the object. Our demo video can
be found at https://youtu.be/afXqgLLRnTE. Please also visit our project page
https://place-anything.github.io to get access. |
Introduces "Place-Anything," a novel system for inserting objects into any video using only a picture or text description, enabling easy video editing and creation without 3D modeling expertise. |
Addresses the challenge of expensive and time-consuming video editing by enabling users to easily insert virtual objects into videos using simple inputs like photos or text descriptions, opening possibilities for various applications like product advertisements and VR experiences. |
Uses a three-module approach: (1) 3D model generation from image/text using a diffusion-based Gaussian model; (2) Video reconstruction to estimate camera parameters and depth maps via optical flow and bundle adjustment; (3) 3D target insertion, projecting the selected region to 3D space and rendering the 3D model into the video. |
Generates 3D models with high visual fidelity to input images or text.
Accurately inserts 3D objects even in textureless regions by leveraging optical flow for precise tracking.
Successfully infers camera parameters and seamlessly integrates 3D models into diverse video footage. |
Current implementation requires user intervention to select object placement region.
Further exploration of automatic object placement and interaction with the environment. |
video editing, 3d model generation, object insertion, computer vision, deep learning |
2402.14253
Report |
MVD$^2$: Efficient Multiview 3D Reconstruction for Multiview Diffusion |
Xin-Yang Zheng, Hao Pan, Yu-Xiao Guo, Xin Tong, Yang Liu |
As a promising 3D generation technique, multiview diffusion (MVD) has
received a lot of attention due to its advantages in terms of generalizability,
quality, and efficiency. By finetuning pretrained large image diffusion models
with 3D data, the MVD methods first generate multiple views of a 3D object
based on an image or text prompt and then reconstruct 3D shapes with multiview
3D reconstruction. However, the sparse views and inconsistent details in the
generated images make 3D reconstruction challenging. We present MVD$^2$, an
efficient 3D reconstruction method for multiview diffusion (MVD) images.
MVD$^2$ aggregates image features into a 3D feature volume by projection and
convolution and then decodes volumetric features into a 3D mesh. We train
MVD$^2$ with 3D shape collections and MVD images prompted by rendered views of
3D shapes. To address the discrepancy between the generated multiview images
and ground-truth views of the 3D shapes, we design a simple-yet-efficient
view-dependent training scheme. MVD$^2$ improves the 3D generation quality of
MVD and is fast and robust to various MVD methods. After training, it can
efficiently decode 3D meshes from multiview images within one second. We train
MVD$^2$ with Zero-123++ and ObjectVerse-LVIS 3D dataset and demonstrate its
superior performance in generating 3D models from multiview images generated by
different MVD methods, using both synthetic and real images as prompts. |
This paper presents \mvd, an efficient multiview 3D reconstruction method specifically designed to address the challenges of sparse views and inconsistent details in images generated by Multiview Diffusion (MVD) models. |
Existing 3D reconstruction techniques struggle with the unique characteristics of MVD-generated images, leading to low-quality 3D models. \mvd aims to improve the quality and efficiency of 3D generation using MVD. |
\mvd employs a lightweight neural network that aggregates image features from multiple views into a 3D feature volume. It then decodes this volume into a differentiable 3D mesh. To address inconsistencies, a view-dependent training scheme is introduced, prioritizing pixel-level alignment at the reference view and structural similarity at other views. |
\mvd significantly improves the quality of 3D reconstruction from MVD images, outperforming methods like NeuS in metrics such as SSIM and LPIPS.
The method is highly efficient, capable of decoding a 3D mesh from MVD images within one second.
Demonstrating strong generalizability, \mvd effectively reconstructs 3D shapes from images generated by various MVD models, including those conditioned on text and images. |
Limitations: Struggles with reconstructing unseen geometry if hidden in all input views. Performance degrades with significant inconsistencies between input MVD images.
Future Work: Explore higher grid resolutions for finer detail reconstruction. Address inconsistencies and inpainting challenges in texture mapping. |
3d reconstruction, multiview diffusion, view synthesis, deep learning, computer vision |
2402.14167
Report |
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching |
Zizheng Pan, Bohan Zhuang, De-An Huang, Weili Nie, Zhiding Yu, Chaowei Xiao, Jianfei Cai, Anima Anandkumar |
Sampling from diffusion probabilistic models (DPMs) is often expensive for
high-quality image generation and typically requires many steps with a large
model. In this paper, we introduce sampling Trajectory Stitching T-Stitch, a
simple yet efficient technique to improve the sampling efficiency with little
or no generation degradation. Instead of solely using a large DPM for the
entire sampling trajectory, T-Stitch first leverages a smaller DPM in the
initial steps as a cheap drop-in replacement of the larger DPM and switches to
the larger DPM at a later stage. Our key insight is that different diffusion
models learn similar encodings under the same training data distribution and
smaller models are capable of generating good global structures in the early
steps. Extensive experiments demonstrate that T-Stitch is training-free,
generally applicable for different architectures, and complements most existing
fast sampling techniques with flexible speed and quality trade-offs. On DiT-XL,
for example, 40% of the early timesteps can be safely replaced with a 10x
faster DiT-S without performance drop on class-conditional ImageNet generation.
We further show that our method can also be used as a drop-in technique to not
only accelerate the popular pretrained stable diffusion (SD) models but also
improve the prompt alignment of stylized SD models from the public model zoo.
Code is released at https://github.com/NVlabs/T-Stitch |
Introduces T-Stitch, a technique to accelerate diffusion model sampling by using smaller models in early denoising steps and larger models in later steps. |
Sampling from large diffusion models is computationally expensive, limiting practical applications. |
Leverages the observation that different diffusion models learn similar latent representations, allowing direct stitching of models at different timesteps. Allocates smaller models to early steps and larger models to later steps. |
Achieves up to 1.7x speedup with negligible performance drop on DiT models for ImageNet generation.
Demonstrates general applicability across architectures (DiT, U-Net) and samplers (DDPM, DDIM, DPM-Solver).
Shows compatibility and improvement with Stable Diffusion, including acceleration and enhanced prompt alignment for stylized models. |
Relies on the availability of a smaller model trained on the same data distribution.
Introduces a slight increase in memory usage due to loading an additional model. |
diffusion models, sampling acceleration, trajectory stitching, model compression, text-to-image generation |
2402.14000
Report |
Real-time 3D-aware Portrait Editing from a Single Image |
Qingyan Bai, Zifan Shi, Yinghao Xu, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, Qifeng Chen |
This work presents 3DPE, a practical method that can efficiently edit a face
image following given prompts, like reference images or text descriptions, in a
3D-aware manner. To this end, a lightweight module is distilled from a 3D
portrait generator and a text-to-image model, which provide prior knowledge of
face geometry and superior editing capability, respectively. Such a design
brings two compelling advantages over existing approaches. First, our system
achieves real-time editing with a feedforward network (i.e., ~0.04s per image),
over 100x faster than the second competitor. Second, thanks to the powerful
priors, our module could focus on the learning of editing-related variations,
such that it manages to handle various types of editing simultaneously in the
training phase and further supports fast adaptation to user-specified
customized types of editing during inference (e.g., with ~5min fine-tuning per
style). The code, the model, and the interface will be made publicly available
to facilitate future research. |
Presents 3DPE, a real-time 3D-aware portrait editing method that uses image or text prompts for editing face images in a 3D-consistent manner. |
Real-time 3D portrait editing is crucial for AR/VR, 3D telepresence, and video conferencing, but existing methods are either slow or lack 3D consistency. |
Distills knowledge from a 3D portrait generator (Live3D) and a text-guided image editing model (InstructPix2Pix) into a lightweight module, allowing for real-time editing while maintaining 3D consistency. |
Achieves real-time editing speed of 40ms per image on a standard GPU.
Exhibits superior 3D consistency, accurate texture alignment, and better identity preservation compared to baselines.
Supports fast adaptation to user-specified editing prompts in just 5 minutes using 10 image pairs. |
Novel view rendering can have inconsistencies in details due to reliance on a super-resolution module.
Video editing can have flickering artifacts as the model is designed for image editing.
Future work can focus on addressing these limitations and exploring higher-quality 3D representations. |
3d-aware portrait editing, real-time editing, knowledge distillation, single image editing, customized prompt adaptation |
2402.13929
Report |
SDXL-Lightning: Progressive Adversarial Diffusion Distillation |
Shanchuan Lin, Anran Wang, Xiao Yang |
We propose a diffusion distillation method that achieves new state-of-the-art
in one-step/few-step 1024px text-to-image generation based on SDXL. Our method
combines progressive and adversarial distillation to achieve a balance between
quality and mode coverage. In this paper, we discuss the theoretical analysis,
discriminator design, model formulation, and training techniques. We
open-source our distilled SDXL-Lightning models both as LoRA and full UNet
weights. |
This paper introduces SDXL-Lightning, a novel progressive adversarial diffusion distillation method that achieves state-of-the-art one-step/few-step 1024px text-to-image generation. |
Diffusion models are computationally expensive due to the iterative sampling procedure. This work significantly reduces the required steps for fast, high-quality image generation. |
This work combines progressive distillation with a novel adversarial objective that utilizes the diffusion model's U-Net encoder as the discriminator backbone. It also introduces several techniques for stable training, schedule modification, and mode coverage relaxation. |
The proposed method achieves superior image quality compared to other state-of-the-art distillation methods like SDXL-Turbo and LCM, especially in high-resolution details.
The method allows for flexible control over the generated images, demonstrated through compatibility with ControlNet for conditioning on canny edges and depth maps.
The authors open-source SDXL-Lightning, offering both full UNet weights and lightweight LoRA modules for plug-and-play use with other base models. |
The current method requires separate checkpoints for each inference step setting, unlike some other approaches that utilize a single checkpoint.
The authors believe the UNet architecture might not be optimal for one-step generation, suggesting exploration of more efficient architectures as future work. |
diffusion models, text-to-image generation, model distillation, adversarial training, sdxl |
2402.13729
Report |
Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation |
Kihong Kim, Haneol Lee, Jihye Park, Seyeon Kim, Kwanghee Lee, Seungryong Kim, Jaejun Yoo |
Generating high-quality videos that synthesize desired realistic content is a
challenging task due to their intricate high-dimensionality and complexity of
videos. Several recent diffusion-based methods have shown comparable
performance by compressing videos to a lower-dimensional latent space, using
traditional video autoencoder architecture. However, such method that employ
standard frame-wise 2D and 3D convolution fail to fully exploit the
spatio-temporal nature of videos. To address this issue, we propose a novel
hybrid video diffusion model, called HVDM, which can capture spatio-temporal
dependencies more effectively. The HVDM is trained by a hybrid video
autoencoder which extracts a disentangled representation of the video
including: (i) a global context information captured by a 2D projected latent
(ii) a local volume information captured by 3D convolutions with wavelet
decomposition (iii) a frequency information for improving the video
reconstruction. Based on this disentangled representation, our hybrid
autoencoder provide a more comprehensive video latent enriching the generated
videos with fine structures and details. Experiments on video generation
benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed
approach achieves state-of-the-art video generation quality, showing a wide
range of video applications (e.g., long video generation, image-to-video, and
video dynamics control). |
This paper presents HVDM, a novel hybrid video diffusion model for high-quality video generation. HVDM leverages a hybrid video autoencoder combining 2D triplane projections for global context and 3D wavelet representations for local volume information, enhancing video encoding and generation. |
Generating high-quality videos is challenging due to their high dimensionality and complexity. Existing methods struggle to balance efficiency and the ability to capture spatio-temporal dependencies effectively. HVDM addresses these challenges by combining the strengths of 2D and 3D representations in a novel autoencoder architecture. |
HVDM employs a hybrid video autoencoder that extracts a disentangled representation: (1) global context via 2D projected latents from triplane representations, (2) local volume information via 3D CNNs with wavelet decomposition, and (3) frequency information for improved reconstruction. A diffusion model trained on this latent space generates videos. |
HVDM achieves state-of-the-art video generation quality on benchmarks like UCF101, SkyTimelapse, and TaiChi, outperforming existing methods in both quantitative metrics and qualitative visual fidelity.
The hybrid autoencoder effectively captures both global context and local details, leading to more realistic and coherent video generation.
The use of wavelet decomposition and frequency matching loss contributes to preserving finer details and improving reconstruction quality. |
The paper acknowledges limitations in applying the model to large-scale text-to-video generation tasks due to computational resources.
Future work will explore diffusion model architectures specifically designed for the hybrid latent space and investigate more efficient wavelet filter banks for video. |
video generation, diffusion models, video autoencoders, triplane representation, wavelet transform |
2402.13616
Report |
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information |
Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao |
Today's deep learning methods focus on how to design the most appropriate
objective functions so that the prediction results of the model can be closest
to the ground truth. Meanwhile, an appropriate architecture that can facilitate
acquisition of enough information for prediction has to be designed. Existing
methods ignore a fact that when input data undergoes layer-by-layer feature
extraction and spatial transformation, large amount of information will be
lost. This paper will delve into the important issues of data loss when data is
transmitted through deep networks, namely information bottleneck and reversible
functions. We proposed the concept of programmable gradient information (PGI)
to cope with the various changes required by deep networks to achieve multiple
objectives. PGI can provide complete input information for the target task to
calculate objective function, so that reliable gradient information can be
obtained to update network weights. In addition, a new lightweight network
architecture -- Generalized Efficient Layer Aggregation Network (GELAN), based
on gradient path planning is designed. GELAN's architecture confirms that PGI
has gained superior results on lightweight models. We verified the proposed
GELAN and PGI on MS COCO dataset based object detection. The results show that
GELAN only uses conventional convolution operators to achieve better parameter
utilization than the state-of-the-art methods developed based on depth-wise
convolution. PGI can be used for variety of models from lightweight to large.
It can be used to obtain complete information, so that train-from-scratch
models can achieve better results than state-of-the-art models pre-trained
using large datasets, the comparison results are shown in Figure 1. The source
codes are at: https://github.com/WongKinYiu/yolov9. |
Proposed YOLOv9, a new object detection system leveraging Programmable Gradient Information (PGI) and a novel Generalized Efficient Layer Aggregation Network (GELAN) architecture. |
Addresses information loss during feedforward in deep networks (information bottleneck), enabling reliable gradient generation and efficient training even for lightweight models. |
Introduces PGI, comprising a main branch for inference, an auxiliary reversible branch for reliable gradient generation, and multi-level auxiliary information to handle error accumulation in deep supervision. Also designs GELAN, generalizing ELAN architecture to support diverse computational blocks for flexibility and efficiency. |
YOLOv9 achieves state-of-the-art performance on MS COCO, outperforming existing real-time object detectors across various model sizes.
GELAN demonstrates strong and stable performance with diverse computational blocks and depths, enabling flexible model design for various hardware.
PGI effectively mitigates information bottleneck and improves accuracy in both lightweight and deep models, enabling better gradient utilization and accurate data-target mapping. |
Further exploration of reversible architectures and integration networks for PGI can potentially yield additional performance gains.
The study primarily focuses on object detection; applying PGI to other computer vision tasks can further validate its effectiveness. |
object detection, deep learning, information bottleneck, reversible architectures, auxiliary supervision |
2402.13573
Report |
ToDo: Token Downsampling for Efficient Generation of High-Resolution Images |
Ethan Smith, Nayan Saxena, Aninda Saha |
Attention mechanism has been crucial for image diffusion models, however,
their quadratic computational complexity limits the sizes of images we can
process within reasonable time and memory constraints. This paper investigates
the importance of dense attention in generative image models, which often
contain redundant features, making them suitable for sparser attention
mechanisms. We propose a novel training-free method ToDo that relies on token
downsampling of key and value tokens to accelerate Stable Diffusion inference
by up to 2x for common sizes and up to 4.5x or more for high resolutions like
2048x2048. We demonstrate that our approach outperforms previous methods in
balancing efficient throughput and fidelity. |
This paper proposes ToDo, a training-free token downsampling method to accelerate Stable Diffusion inference by leveraging the inherent spatial redundancy in images to reduce the computational burden of attention. |
The quadratic computational complexity of attention in image diffusion models limits the image sizes that can be processed efficiently. Sparse attention mechanisms offer a solution but often require training-time modifications, introducing logistical overheads. |
ToDo downsamples key and value tokens using a Nearest-Neighbor algorithm based on spatial contiguity, reducing the token count while preserving query tokens, and eliminating the need for computationally expensive similarity calculations. |
ToDo achieves up to 2x speedup for common image sizes and up to 4.5x or more for high resolutions (e.g., 2048x2048) compared to standard Stable Diffusion.
ToDo outperforms previous methods like ToMe in balancing inference speed and generated image fidelity, as demonstrated by lower MSE and comparable HPF values.
Analysis of latent features in Stable Diffusion's U-Net reveals high redundancy among spatially adjacent tokens, supporting the principle behind ToDo. |
The differentiability of ToDo and its potential for efficient fine-tuning of Stable Diffusion at larger image dimensions remain unexplored.
Further investigation is needed to determine the generalizability of ToDo's benefits to other attention-based generative image models. |
image generation, diffusion models, stable diffusion, attention mechanism, sparse attention |
2402.13490
Report |
Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models |
Chen Wu, Fernando De la Torre |
Text-to-image diffusion models have achieved remarkable performance in image
synthesis, while the text interface does not always provide fine-grained
control over certain image factors. For instance, changing a single token in
the text can have unintended effects on the image. This paper shows a simple
modification of classifier-free guidance can help disentangle image factors in
text-to-image models. The key idea of our method, Contrastive Guidance, is to
characterize an intended factor with two prompts that differ in minimal tokens:
the positive prompt describes the image to be synthesized, and the baseline
prompt serves as a "baseline" that disentangles other factors. Contrastive
Guidance is a general method we illustrate whose benefits in three scenarios:
(1) to guide domain-specific diffusion models trained on an object class, (2)
to gain continuous, rig-like controls for text-to-image generation, and (3) to
improve the performance of zero-shot image editors. |
This paper proposes a simple but effective method, Contrastive Guidance, which leverages contrastive prompts to disentangle image factors in text-to-image diffusion models, leading to fine-grained control over image generation. |
Text-to-image diffusion models often lack fine-grained control, as changing even a single token can lead to unintended consequences in the generated image. This method addresses this challenge by allowing for more precise manipulation of specific image factors. |
The method introduces a baseline prompt alongside the positive prompt, where the baseline prompt helps to isolate the intended image factor by providing a contrasting reference. The difference between the score functions of these prompts guides the denoising process, enhancing control over the desired image aspect. |
Contrastive Guidance shows improved disentanglement compared to classifier-free guidance, enabling more precise control over image attributes, backgrounds, and objects.
The method effectively guides domain-specific diffusion models, improving realism and domain specificity while maintaining consistency with text prompts.
Contrastive Guidance proves beneficial for zero-shot image editing, strengthening intended edits and improving content preservation in tasks like style transfer and object manipulation. |
The assumption of an adaptive temperature parameter to simplify calculations might not hold true across all domains.
Further research is needed to understand the impact of different prompt pair choices on the performance and potential biases. |
text-to-image synthesis, diffusion models, disentanglement, contrastive learning, image editing |
2402.13404
Report |
Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control |
Denis Lukovnikov, Asja Fischer |
While text-to-image diffusion models can generate highquality images from
textual descriptions, they generally lack fine-grained control over the visual
composition of the generated images. Some recent works tackle this problem by
training the model to condition the generation process on additional input
describing the desired image layout. Arguably the most popular among such
methods, ControlNet, enables a high degree of control over the generated image
using various types of conditioning inputs (e.g. segmentation maps). However,
it still lacks the ability to take into account localized textual descriptions
that indicate which image region is described by which phrase in the prompt. In
this work, we show the limitations of ControlNet for the layout-to-image task
and enable it to use localized descriptions using a training-free approach that
modifies the crossattention scores during generation. We adapt and investigate
several existing cross-attention control methods in the context of ControlNet
and identify shortcomings that cause failure (concept bleeding) or image
degradation under specific conditions. To address these shortcomings, we
develop a novel cross-attention manipulation method in order to maintain image
quality while improving control. Qualitative and quantitative experimental
studies focusing on challenging cases are presented, demonstrating the
effectiveness of the investigated general approach, and showing the
improvements obtained by the proposed cross-attention control method. |
This LaTeX document provides a template and guidelines for formatting papers to be submitted to the IEEE Computer Society Press. |
This ensures a consistent and professional style for all submissions, aiding in the review process and enhancing readability. |
The document outlines specific formatting requirements for various elements like paper length, margins, type style, headings, references, figures, tables, and more. |
It emphasizes the importance of clear and concise writing, proper mathematical notation, and the use of cross-references.
The guide also stresses the need for blind review anonymity and provides instructions on how to achieve it.
It includes directions on handling supplementary material and final copy submission. |
The template focuses heavily on LaTeX, potentially limiting accessibility for users of other document preparation systems.
It lacks detailed explanations on certain aspects, such as color use, which are deferred to external guidelines. |
latex, ieee, paper formatting, academic writing, conference submission |
2402.13369
Report |
The Uncanny Valley: A Comprehensive Analysis of Diffusion Models |
Karam Ghanem, Danilo Bzdok |
Through Diffusion Models (DMs), we have made significant advances in
generating high-quality images. Our exploration of these models delves deeply
into their core operational principles by systematically investigating key
aspects across various DM architectures: i) noise schedules, ii) samplers, and
iii) guidance. Our comprehensive examination of these models sheds light on
their hidden fundamental mechanisms, revealing the concealed foundational
elements that are essential for their effectiveness. Our analyses emphasize the
hidden key factors that determine model performance, offering insights that
contribute to the advancement of DMs. Past findings show that the configuration
of noise schedules, samplers, and guidance is vital to the quality of generated
images; however, models reach a stable level of quality across different
configurations at a remarkably similar point, revealing that the decisive
factors for optimal performance predominantly reside in the diffusion process
dynamics and the structural design of the model's network, rather than the
specifics of configuration details. Our comparative analysis reveals that
Denoising Diffusion Probabilistic Model (DDPM)-based diffusion dynamics
consistently outperform the Noise Conditioned Score Network (NCSN)-based ones,
not only when evaluated in their original forms but also when continuous
through Stochastic Differential Equation (SDE)-based implementations. |
This paper presents a comprehensive analysis of Diffusion Models (DMs), focusing on noise schedules, samplers, and guidance to understand their impact on image generation quality. |
DMs have revolutionized image generation but their complex dynamics are not fully understood. This work aims to clarify the key drivers of DM performance for future model development. |
The authors conduct systematic benchmarking of various DM architectures (DDPMs, NCSNs, SDE-based) trained on CIFAR10 and FFHQ datasets. They analyze the impact of different noise schedules, samplers, and guidance mechanisms on Inception Score (IS) and visual quality of generated images. |
DDPM-based diffusion dynamics consistently outperform NCSN-based ones across different configurations and datasets.
The choice of noise schedule and sampler influences convergence speed, but DDPM-based schedules (cosine, sigmoid) generally excel.
Classifier Guidance does not inherently enhance overall image quality and its impact is negligible compared to the diffusion process and network design. |
The study primarily focuses on IS and visual inspection, which might not fully capture all aspects of image quality.
Future work could explore the interplay of network design and diffusion process in more depth, potentially leading to novel DM architectures. |
diffusion models, image generation, noise schedules, samplers, classifier guidance |
2402.13349
Report |
Aria Everyday Activities Dataset |
Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, Kiran Somasundaram, Luis Pesqueira, Mark Schwesinger, Omkar Parkhi, Qiao Gu, Renzo De Nardi, Shangyi Cheng, Steve Saarinen, Vijay Baiyya, Yuyang Zou, Richard Newcombe, Jakob Julian Engel, Xiaqing Pan, Carl Ren |
We present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal
open dataset recorded using Project Aria glasses. AEA contains 143 daily
activity sequences recorded by multiple wearers in five geographically diverse
indoor locations. Each of the recording contains multimodal sensor data
recorded through the Project Aria glasses. In addition, AEA provides machine
perception data including high frequency globally aligned 3D trajectories,
scene point cloud, per-frame 3D eye gaze vector and time aligned speech
transcription. In this paper, we demonstrate a few exemplar research
applications enabled by this dataset, including neural scene reconstruction and
prompted segmentation. AEA is an open source dataset that can be downloaded
from https://www.projectaria.com/datasets/aea/. We are also providing
open-source implementations and examples of how to use the dataset in Project
Aria Tools https://github.com/facebookresearch/projectaria_tools. |
The Aria Everyday Activities (AEA) dataset is an open dataset of egocentric multimodal data captured using Project Aria glasses. It contains 143 daily activity sequences in diverse indoor locations, featuring high-frequency 6DoF trajectories, scene point clouds, 3D eye gaze vectors, and time-aligned speech transcriptions. |
AEA facilitates research in contextual AI and AR by providing rich, realistic, and spatially-temporally aligned data, addressing limitations of existing egocentric datasets that lack sensor modalities or precise 3D information. |
Multiple wearers recorded daily activities in five indoor locations using Project Aria glasses, capturing RGB video, monochrome scene videos, eyetracking videos, IMU data, spatial audio, and other sensor data. Machine Perception Services (MPS) provided precise 3D localization, eye gaze vectors, and time synchronization across devices. |
The dataset enables accurate 3D scene reconstruction using methods like Gaussian Splatting, leveraging the precise trajectory and point cloud data.
AEA facilitates research in multimodal understanding, demonstrated through examples of eye gaze-prompted segmentation and speech-grounded segmentation using EfficientSAM and GroundingDino.
AEA provides a valuable resource for studying real-world human activities with spatial-temporal context, enabling the development of personalized and context-aware AI assistants. |
Current reconstruction methods may not handle dynamic motions in the recordings optimally.
Future work includes reconstructing the AEA dataset using NeRFstudio and exploring advanced methods for activity and scene understanding. |
egocentric vision, multimodal ai, 3d reconstruction, eye tracking, dataset |
2402.13252
Report |
Improving Robustness for Joint Optimization of Camera Poses and Decomposed Low-Rank Tensorial Radiance Fields |
Bo-Yu Cheng, Wei-Chen Chiu, Yu-Lun Liu |
In this paper, we propose an algorithm that allows joint refinement of camera
pose and scene geometry represented by decomposed low-rank tensor, using only
2D images as supervision. First, we conduct a pilot study based on a 1D signal
and relate our findings to 3D scenarios, where the naive joint pose
optimization on voxel-based NeRFs can easily lead to sub-optimal solutions.
Moreover, based on the analysis of the frequency spectrum, we propose to apply
convolutional Gaussian filters on 2D and 3D radiance fields for a
coarse-to-fine training schedule that enables joint camera pose optimization.
Leveraging the decomposition property in decomposed low-rank tensor, our method
achieves an equivalent effect to brute-force 3D convolution with only incurring
little computational overhead. To further improve the robustness and stability
of joint optimization, we also propose techniques of smoothed 2D supervision,
randomly scaled kernel parameters, and edge-guided loss mask. Extensive
quantitative and qualitative evaluations demonstrate that our proposed
framework achieves superior performance in novel view synthesis as well as
rapid convergence for optimization. |
This paper proposes a method for joint refinement of camera pose and scene geometry represented by a decomposed low-rank tensor using only 2D images. |
Existing methods for joint pose optimization struggle with voxel-based NeRFs due to their tendency to overemphasize sharp edges, leading to sub-optimal solutions. |
The authors conduct a spectral analysis on a 1D signal alignment task and draw parallels to 3D joint optimization. Based on their findings, they propose a coarse-to-fine training schedule with separable component-wise convolution of Gaussian filters applied to both 2D and 3D radiance fields. Additionally, techniques like smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss masks are introduced to enhance robustness. |
The method achieves superior performance in novel view synthesis compared to previous approaches.
It exhibits faster convergence, requiring only 50k training iterations compared to 200k iterations needed by other methods.
The approach is shown to be effective and robust for both synthetic and real-world scenes. |
The current implementation relies on PyTorch and could potentially achieve faster speeds with custom CUDA acceleration.
Future work could explore the applicability of the proposed techniques to other compressed voxel-based architectures like multi-resolution hash encoding. |
neural rendering, novel view synthesis, joint pose optimization, decomposed low-rank tensor, gaussian filtering |
2402.13251
Report |
FlashTex: Fast Relightable Mesh Texturing with LightControlNet |
Kangle Deng, Timothy Omernick, Alexander Weiss, Deva Ramanan, Jun-Yan Zhu, Tinghui Zhou, Maneesh Agrawala |
Manually creating textures for 3D meshes is time-consuming, even for expert
visual content creators. We propose a fast approach for automatically texturing
an input 3D mesh based on a user-provided text prompt. Importantly, our
approach disentangles lighting from surface material/reflectance in the
resulting texture so that the mesh can be properly relit and rendered in any
lighting environment. We introduce LightControlNet, a new text-to-image model
based on the ControlNet architecture, which allows the specification of the
desired lighting as a conditioning image to the model. Our text-to-texture
pipeline then constructs the texture in two stages. The first stage produces a
sparse set of visually consistent reference views of the mesh using
LightControlNet. The second stage applies a texture optimization based on Score
Distillation Sampling (SDS) that works with LightControlNet to increase the
texture quality while disentangling surface material from lighting. Our
algorithm is significantly faster than previous text-to-texture methods, while
producing high-quality and relightable textures. |
This paper introduces a novel approach for rapid and automatic texturing of 3D meshes based on user-provided text prompts, enabling relighting by separating lighting from surface material. |
Creating realistic textures for 3D models is crucial in various industries, but manual methods are time-consuming and require expertise. Existing automatic methods are slow, prone to visual artifacts, and often bake lighting into the texture, limiting their usability. |
The proposed two-stage pipeline utilizes a new illumination-aware text-to-image model, LightControlNet. Stage 1 generates consistent reference views of the mesh under fixed lighting using multi-view visual prompting. Stage 2 optimizes texture quality and disentangles lighting using an improved Score Distillation Sampling (SDS) method with LightControlNet. |
The method generates high-quality, relightable textures significantly faster than previous approaches.
Quantitative evaluations demonstrate superior performance over existing baselines in FID and KID metrics.
User studies confirm preference for the method's output in realism, consistency with text prompts, and plausibility under varying lighting. |
Limitations include occasional baked-in lighting, imperfect material map disentanglement, and potential failure to fully adhere to complex text prompts.
Future work involves addressing these limitations and exploring applications in related 3D content creation tasks. |
text-to-texture, 3d mesh texturing, relightable texture, diffusion models, controlnet |
2402.13217
Report |
VideoPrism: A Foundational Visual Encoder for Video Understanding |
Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong |
We introduce VideoPrism, a general-purpose video encoder that tackles diverse
video understanding tasks with a single frozen model. We pretrain VideoPrism on
a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M
video clips with noisy parallel text (e.g., ASR transcripts). The pretraining
approach improves upon masked autoencoding by global-local distillation of
semantic video embeddings and a token shuffling scheme, enabling VideoPrism to
focus primarily on the video modality while leveraging the invaluable text
associated with videos. We extensively test VideoPrism on four broad groups of
video understanding tasks, from web video question answering to CV for science,
achieving state-of-the-art performance on 30 out of 33 video understanding
benchmarks. |
VideoPrism, a general-purpose video encoder pretrained on a large-scale dataset of video-text pairs and video-only clips, achieves state-of-the-art performance on a wide range of video understanding tasks using a single frozen model. |
Existing video foundation models often struggle with balancing appearance-heavy tasks and motion-centric reasoning, and building a truly foundational video model that excels across diverse tasks remains a challenge. |
VideoPrism is pretrained in two stages: 1) contrastive learning aligns a video encoder and a text encoder on video-text pairs, 2) masked video modeling with global-local distillation and token shuffling trains the video encoder on video-only data, leveraging knowledge from the first stage. |
Outperforms previous video foundation models on 30 out of 33 video understanding benchmarks, including VideoGLUE, zero-shot video-text retrieval, and CV for science tasks.
Demonstrates robust generalizability, excelling on both appearance- and motion-focused tasks across diverse video sources.
Shows strong scaling capabilities with both model size and data size, achieving substantial improvements with larger models and datasets. |
Reliance on noisy text data in the pretraining corpus might introduce potential biases and limitations.
The current focus on short video clips limits the model's applicability to long video understanding. |
video foundation model, vision-language model, self-supervised learning, contrastive learning, masked video modeling |
2402.13185
Report |
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing |
Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, Jiang Bian |
Recent advances in text-guided video editing have showcased promising results
in appearance editing (e.g., stylization). However, video motion editing in the
temporal dimension (e.g., from eating to waving), which distinguishes video
editing from image editing, is underexplored. In this work, we present UniEdit,
a tuning-free framework that supports both video motion and appearance editing
by harnessing the power of a pre-trained text-to-video generator within an
inversion-then-generation framework. To realize motion editing while preserving
source video content, based on the insights that temporal and spatial
self-attention layers encode inter-frame and intra-frame dependency
respectively, we introduce auxiliary motion-reference and reconstruction
branches to produce text-guided motion and source features respectively. The
obtained features are then injected into the main editing path via temporal and
spatial self-attention layers. Extensive experiments demonstrate that UniEdit
covers video motion editing and various appearance editing scenarios, and
surpasses the state-of-the-art methods. Our code will be publicly available. |
Introduces UniEdit, a tuning-free framework for video motion and appearance editing utilizing a pre-trained text-to-video generator. |
Addresses limitations in existing video editing methods by enabling motion editing in addition to appearance editing without fine-tuning. |
Employs an inversion-then-generation pipeline with auxiliary branches for reconstruction and motion reference. Features from these branches are injected into the main editing path via spatial and temporal self-attention layers to achieve content preservation and motion control. |
Achieves superior performance compared to state-of-the-art methods in both qualitative and quantitative evaluations.
Demonstrates the ability to edit various aspects of videos, including motion, style, object replacement, and background.
Enables text-image-to-video generation by combining image animation techniques with UniEdit's editing capabilities. |
Simultaneous motion and appearance editing within a single iteration requires further exploration.
Developing an automatic scheme for determining optimal hyper-parameters is an area for future work. |
video editing, motion editing, appearance editing, diffusion models, text-to-video generation |
2402.13144
Report |
Neural Network Diffusion |
Kai Wang, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, Yang You |
Diffusion models have achieved remarkable success in image and video
generation. In this work, we demonstrate that diffusion models can also
\textit{generate high-performing neural network parameters}. Our approach is
simple, utilizing an autoencoder and a standard latent diffusion model. The
autoencoder extracts latent representations of a subset of the trained network
parameters. A diffusion model is then trained to synthesize these latent
parameter representations from random noise. It then generates new
representations that are passed through the autoencoder's decoder, whose
outputs are ready to use as new subsets of network parameters. Across various
architectures and datasets, our diffusion process consistently generates models
of comparable or improved performance over trained networks, with minimal
additional cost. Notably, we empirically find that the generated models perform
differently with the trained networks. Our results encourage more exploration
on the versatile use of diffusion models. |
This paper introduces 'neural network diffusion (p-diff),' a novel approach using diffusion models to generate high-performing neural network parameters. |
This work explores the under-explored potential of diffusion models beyond visual generation, offering a new paradigm for generating effective network parameters. |
P-diff utilizes an autoencoder to learn latent representations of a subset of trained network parameters and employs a standard latent diffusion model to synthesize new representations from random noise. The synthesized representations are then decoded to obtain new network parameters. |
P-diff consistently achieves comparable or even superior performance to models trained by SGD across diverse datasets and architectures.
The generation process is efficient, generating new models within seconds.
Analysis reveals that the generated models exhibit distinct prediction patterns compared to the original training models, indicating genuine parameter synthesis rather than mere memorization. |
Current limitations include constraints in generating entire parameters of large architectures due to GPU memory.
Future work will focus on addressing memory limitations, enhancing structure design efficiency, and improving performance stability. |
diffusion models, parameter generation, neural networks, deep learning, generative models |
2402.13126
Report |
VGMShield: Mitigating Misuse of Video Generative Models |
Yan Pang, Yang Zhang, Tianhao Wang |
With the rapid advancement in video generation, people can conveniently
utilize video generation models to create videos tailored to their specific
desires. Nevertheless, there are also growing concerns about their potential
misuse in creating and disseminating false information.
In this work, we introduce VGMShield: a set of three straightforward but
pioneering mitigations through the lifecycle of fake video generation. We start
from \textit{fake video detection} trying to understand whether there is
uniqueness in generated videos and whether we can differentiate them from real
videos; then, we investigate the \textit{tracing} problem, which maps a fake
video back to a model that generates it. Towards these, we propose to leverage
pre-trained models that focus on {\it spatial-temporal dynamics} as the
backbone to identify inconsistencies in videos. Through experiments on seven
state-of-the-art open-source models, we demonstrate that current models still
cannot perfectly handle spatial-temporal relationships, and thus, we can
accomplish detection and tracing with nearly perfect accuracy.
Furthermore, anticipating future generative model improvements, we propose a
{\it prevention} method that adds invisible perturbations to images to make the
generated videos look unreal. Together with fake video detection and tracing,
our multi-faceted set of solutions can effectively mitigate misuse of video
generative models. |
This paper introduces the first defense pipeline, called MMVGM, specifically designed to address misuse issues in video generation models. |
The rapid advancement of video generation models raises concerns about their potential misuse in creating and spreading misinformation. MMVGM aims to mitigate these concerns by providing tools for detecting, tracing, and preventing the generation of fake videos. |
MMVGM leverages pre-trained video recognition models (I3D, X-CLIP, VideoMAE) to detect spatial-temporal inconsistencies in generated videos for both fake video detection and tracing the source model. Additionally, it introduces two misuse prevention methods based on adversarial examples that disrupt video generation by adding imperceptible perturbations to images. |
VideoMAE-based detection and tracing models achieve high accuracy (over 90%) in various realistic scenarios, demonstrating the presence of model-specific 'fingerprints' in generated videos.
Analysis using Grad-CAM reveals that the VideoMAE-based model is particularly adept at identifying temporal anomalies, outperforming I3D which mainly focuses on spatial distortions.
Both directed and undirected defense strategies successfully disrupt video generation by introducing imperceptible perturbations to images, effectively preventing misuse. |
The effectiveness of the proposed methods might be challenged as video generation models evolve to produce more realistic videos.
Directed defense, while effective, requires careful selection of target images for optimal performance. |
video generation, misinformation detection, source tracing, adversarial defense, video forensics |
2402.12974
Report |
Visual Style Prompting with Swapping Self-Attention |
Jaeseok Jeong, Junho Kim, Yunjey Choi, Gayoung Lee, Youngjung Uh |
In the evolving domain of text-to-image generation, diffusion models have
emerged as powerful tools in content creation. Despite their remarkable
capability, existing models still face challenges in achieving controlled
generation with a consistent style, requiring costly fine-tuning or often
inadequately transferring the visual elements due to content leakage. To
address these challenges, we propose a novel approach, \ours, to produce a
diverse range of images while maintaining specific style elements and nuances.
During the denoising process, we keep the query from original features while
swapping the key and value with those from reference features in the late
self-attention layers. This approach allows for the visual style prompting
without any fine-tuning, ensuring that generated images maintain a faithful
style. Through extensive evaluation across various styles and text prompts, our
method demonstrates superiority over existing approaches, best reflecting the
style of the references and ensuring that resulting images match the text
prompts most accurately. Our project page is available
https://curryjung.github.io/VisualStylePrompt/. |
This paper introduces Visual Style Prompting with Swapping Self-Attention, a novel method to generate images that reflect the style of a reference image while adhering to the content specified in a text prompt, all without requiring fine-tuning. |
Existing text-to-image generation models struggle to achieve controlled generation with consistent styles. This new approach aims to overcome limitations of existing methods that require costly fine-tuning and often suffer from content leakage. |
The method leverages a swapping self-attention mechanism. It maintains the queries from original image features while swapping the keys and values with those from reference image features in the late self-attention layers of a diffusion model. |
The approach successfully generates images reflecting the style of reference images while minimizing content leakage.
It outperforms existing methods in terms of style fidelity, text prompt alignment, and diversity of generated images.
The method is versatile and compatible with other techniques like ControlNet and Dreambooth-LoRA. |
The method is limited by the capabilities of the pre-trained diffusion model used.
Future work includes exploring better inversion methods for real images and extending the approach to other domains like video. |
text-to-image generation, diffusion models, visual style prompting, swapping self-attention, content leakage |
2402.12927
Report |
CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection |
Sohail Ahmed Khan, Duc-Tien Dang-Nguyen |
The recent advancements in Generative Adversarial Networks (GANs) and the
emergence of Diffusion models have significantly streamlined the production of
highly realistic and widely accessible synthetic content. As a result, there is
a pressing need for effective general purpose detection mechanisms to mitigate
the potential risks posed by deepfakes. In this paper, we explore the
effectiveness of pre-trained vision-language models (VLMs) when paired with
recent adaptation methods for universal deepfake detection. Following previous
studies in this domain, we employ only a single dataset (ProGAN) in order to
adapt CLIP for deepfake detection. However, in contrast to prior research,
which rely solely on the visual part of CLIP while ignoring its textual
component, our analysis reveals that retaining the text part is crucial.
Consequently, the simple and lightweight Prompt Tuning based adaptation
strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and
6.61% accuracy while utilizing less than one third of the training data (200k
images as compared to 720k). To assess the real-world applicability of our
proposed models, we conduct a comprehensive evaluation across various
scenarios. This involves rigorous testing on images sourced from 21 distinct
datasets, including those generated by GANs-based, Diffusion-based and
Commercial tools. |
This paper investigates the effectiveness of adapting pre-trained vision-language models (VLMs), specifically CLIP, for universal deepfake detection by leveraging both visual and textual information. |
Existing deepfake detection models often struggle to generalize across different data distributions due to their focus on detecting specific artifacts. This work explores the potential of VLMs, which are trained on diverse datasets and possess strong zero-shot capabilities, to overcome this limitation. |
The authors adapt CLIP for deepfake detection using four transfer learning strategies: Linear Probing, Fine-tuning, Prompt Tuning (CoOp), and Adapter Network. They train the models on the ProGAN dataset and evaluate them on a comprehensive test set of 21 different image generators, including GANs, Diffusion models, and commercial tools. |
Adapting CLIP using both visual and textual components significantly outperforms methods relying solely on visual features.
Prompt Tuning with CoOp achieves state-of-the-art performance, surpassing previous methods in both mAP and average accuracy while using less training data.
CLIP-based detectors demonstrate robust performance even with limited training data and in the presence of post-processing operations. |
The paper focuses on single-image deepfake detection and does not explore video-based deepfakes.
Further research is needed to investigate the performance of the proposed methods on emerging deepfake generation techniques. |
deepfake detection, transfer learning, vision-language models, clip, prompt tuning |
2402.12908
Report |
RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models |
Xinchen Zhang, Ling Yang, Yaqi Cai, Zhaochen Yu, Kai-Ni Wang, Jiake Xie, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin Cui |
Diffusion models have achieved remarkable advancements in text-to-image
generation. However, existing models still have many difficulties when faced
with multiple-object compositional generation. In this paper, we propose
RealCompo, a new training-free and transferred-friendly text-to-image
generation framework, which aims to leverage the respective advantages of
text-to-image models and spatial-aware image diffusion models (e.g., layout,
keypoints and segmentation maps) to enhance both realism and compositionality
of the generated images. An intuitive and novel balancer is proposed to
dynamically balance the strengths of the two models in denoising process,
allowing plug-and-play use of any model without extra training. Extensive
experiments show that our RealCompo consistently outperforms state-of-the-art
text-to-image models and spatial-aware image diffusion models in
multiple-object compositional generation while keeping satisfactory realism and
compositionality of the generated images. Notably, our RealCompo can be
seamlessly extended with a wide range of spatial-aware image diffusion models
and stylized diffusion models. Our code is available at:
https://github.com/YangLing0818/RealCompo |
This paper proposes RealCompo, a training-free and transferred-friendly text-to-image generation framework that balances realism and compositionality by dynamically combining the strengths of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints, segmentation maps). |
Existing text-to-image models struggle with accurately aligning with prompts involving multiple objects or complex relationships, highlighting the need for improved compositional generation while maintaining realism. |
RealCompo utilizes a novel "balancer" that dynamically adjusts the influence of predicted noise from both a text-to-image model and a spatial-aware image diffusion model based on their cross-attention maps during the denoising process. |
RealCompo outperforms state-of-the-art text-to-image and layout-to-image models in compositional generation benchmarks (T2I-CompBench).
RealCompo exhibits superior image realism and aesthetic quality compared to baselines, as evidenced by higher CLIP and aesthetic scores.
RealCompo demonstrates strong generalizability and can be extended to various spatial-aware conditions and stylized image generation tasks. |
RealCompo's computational cost is slightly higher than single-model approaches.
Future work includes exploring more computationally efficient methods and extending RealCompo to text-to-video or text-to-3D generation. |
text-to-image generation, compositionality, diffusion models, spatial awareness, controllable generation |
2402.12760
Report |
A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis |
Nailei Hei, Qianyu Guo, Zihao Wang, Yan Wang, Haofen Wang, Wenqiang Zhang |
Well-designed prompts have demonstrated the potential to guide text-to-image
models in generating amazing images. Although existing prompt engineering
methods can provide high-level guidance, it is challenging for novice users to
achieve the desired results by manually entering prompts due to a discrepancy
between novice-user-input prompts and the model-preferred prompts. To bridge
the distribution gap between user input behavior and model training datasets,
we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and
propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG)
for automated prompt optimization. For CFP, we construct a novel dataset for
text-to-image tasks that combines coarse and fine-grained prompts to facilitate
the development of automated prompt generation methods. For UF-FGTG, we propose
a novel framework that automatically translates user-input prompts into
model-preferred prompts. Specifically, we propose a prompt refiner that
continually rewrites prompts to empower users to select results that align with
their unique needs. Meanwhile, we integrate image-related loss functions from
the text-to-image model into the training process of text generation to
generate model-preferred prompts. Additionally, we propose an adaptive feature
extraction module to ensure diversity in the generated results. Experiments
demonstrate that our approach is capable of generating more visually appealing
and diverse images than previous state-of-the-art methods, achieving an average
improvement of 5% across six quality and aesthetic metrics. |
This paper introduces CFP, a novel dataset bridging the gap between user input and model-preferred prompts for text-to-image generation, and proposes UF-FGTG, a user-friendly framework for automated prompt optimization. |
Novice users often struggle to craft effective prompts for text-to-image models. This work addresses this by aligning user input with model preferences and improving image generation quality. |
The UF-FGTG framework utilizes a prompt refiner to transform coarse-grained prompts into fine-grained ones, incorporates image-related loss functions for model-preferred prompts, and employs an adaptive feature extraction module for result diversity. |
UF-FGTG generates visually appealing images superior to existing language models like GPT-4.
The framework consistently outperforms other methods in image quality and aesthetic assessments, demonstrating a 5% improvement.
The adaptive feature extraction module effectively enhances the diversity of generated images. |
The study primarily focuses on Stable Diffusion, potentially limiting generalizability to other text-to-image models.
Exploration of alternative adaptive feature extraction modules and prompt refinement techniques could further enhance performance. |
text-to-image generation, prompt engineering, dataset creation, deep learning, computer vision |
2402.12741
Report |
MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion |
Sen Li, Ruochen Wang, Cho-Jui Hsieh, Minhao Cheng, Tianyi Zhou |
Existing text-to-image models still struggle to generate images of multiple
objects, especially in handling their spatial positions, relative sizes,
overlapping, and attribute bindings. In this paper, we develop a training-free
Multimodal-LLM agent (MuLan) to address these challenges by progressive
multi-object generation with planning and feedback control, like a human
painter. MuLan harnesses a large language model (LLM) to decompose a prompt to
a sequence of sub-tasks, each generating only one object conditioned on
previously generated objects by stable diffusion. Unlike existing LLM-grounded
methods, MuLan only produces a high-level plan at the beginning while the exact
size and location of each object are determined by an LLM and attention
guidance upon each sub-task. Moreover, MuLan adopts a vision-language model
(VLM) to provide feedback to the image generated in each sub-task and control
the diffusion model to re-generate the image if it violates the original
prompt. Hence, each model in every step of MuLan only needs to address an easy
sub-task it is specialized for. We collect 200 prompts containing multi-objects
with spatial relationships and attribute bindings from different benchmarks to
evaluate MuLan. The results demonstrate the superiority of MuLan in generating
multiple objects over baselines. The code is available on
https://github.com/measure-infinity/mulan-code. |
This paper introduces MuLan, a training-free Multimodal-LLM Agent, designed to enhance the quality of images generated from intricate text prompts containing multiple objects, particularly by improving spatial relationships and attribute bindings, commonly challenging for existing text-to-image models. |
Current text-to-image models struggle to accurately represent complex prompts involving multiple objects with specific attributes and spatial relationships. MuLan addresses this limitation by utilizing the strengths of LLMs, diffusion models, and VLMs in a collaborative framework. |
MuLan decomposes a complex prompt into a sequence of simpler sub-prompts using an LLM planner. It then progressively generates one object per stage, guided by an LLM-generated rough mask and refined by attention guidance within a diffusion model. A VLM feedback loop ensures each stage aligns with the prompt, allowing for adjustments before proceeding. |
MuLan significantly outperforms baseline models (including SDXL, PixArt-α) in generating images from complex prompts containing multiple objects with specific attributes and spatial arrangements, as evaluated by both GPT-4V and human assessors.
The integration of VLM feedback control is crucial, leading to substantial performance improvements compared to a version of MuLan without this component.
MuLan demonstrates flexibility by effectively using various VLMs (LLaVA, GPT-4V, Gemini-Pro) without significant performance differences. |
The multi-stage generation process in MuLan, while allowing for fine-grained control, can be more time-consuming than single-stage generation methods.
Potential errors in prompt decomposition by the LLM could cascade through the generation process. Future work could explore LLM-based prompt rewriting to minimize such errors. |
text-to-image generation, multimodal-llm, diffusion models, controllable generation, vlm feedback |
2402.12712
Report |
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction |
Shitao Tang, Jiacheng Chen, Dilin Wang, Chengzhou Tang, Fuyang Zhang, Yuchen Fan, Vikas Chandra, Yasutaka Furukawa, Rakesh Ranjan |
This paper presents a neural architecture MVDiffusion++ for 3D object
reconstruction that synthesizes dense and high-resolution views of an object
given one or a few images without camera poses. MVDiffusion++ achieves superior
flexibility and scalability with two surprisingly simple ideas: 1) A
``pose-free architecture'' where standard self-attention among 2D latent
features learns 3D consistency across an arbitrary number of conditional and
generation views without explicitly using camera pose information; and 2) A
``view dropout strategy'' that discards a substantial number of output views
during training, which reduces the training-time memory footprint and enables
dense and high-resolution view synthesis at test time. We use the Objaverse for
training and the Google Scanned Objects for evaluation with standard novel view
synthesis and 3D reconstruction metrics, where MVDiffusion++ significantly
outperforms the current state of the arts. We also demonstrate a text-to-3D
application example by combining MVDiffusion++ with a text-to-image generative
model. The project page is at https://mvdiffusion-plusplus.github.io. |
Presents MVDiffusion++, a novel multi-view diffusion model for reconstructing dense, high-resolution 3D objects from single or sparse unposed images. |
Addresses limitations of existing methods that struggle with high-resolution outputs and rely on accurate camera pose estimation, enabling more flexible and scalable 3D object reconstruction. |
Introduces a pose-free architecture with self-attention among 2D latent features to learn 3D consistency across views. Employs a view dropout strategy during training to reduce memory footprint and enable high-resolution image generation. |
Achieves state-of-the-art performance on single-view reconstruction, outperforming SyncDreamer by 0.1552 in Vol. IoU on Google Scanned Objects dataset.
Significantly improves novel view synthesis quality in sparse view settings, surpassing LEAP by 8.19 PSNR.
Demonstrates successful text-to-3D applications by integrating with text-to-image generative models. |
Struggles with reconstructing thin object structures.
May generate implausible images for occluded views. |
3d reconstruction, diffusion models, multi-view image generation, pose-free, view synthesis |
2402.12550
Report |
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization |
James Oldfield, Markos Georgopoulos, Grigorios G. Chrysos, Christos Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Jiankang Deng, Ioannis Patras |
The Mixture of Experts (MoE) paradigm provides a powerful way to decompose
inscrutable dense layers into smaller, modular computations often more amenable
to human interpretation, debugging, and editability. A major problem however
lies in the computational cost of scaling the number of experts to achieve
sufficiently fine-grained specialization. In this paper, we propose the
Multilinear Mixutre of Experts (MMoE) layer to address this, focusing on vision
models. MMoE layers perform an implicit computation on prohibitively large
weight tensors entirely in factorized form. Consequently, MMoEs both (1) avoid
the issues incurred through the discrete expert routing in the popular 'sparse'
MoE models, yet (2) do not incur the restrictively high inference-time costs of
'soft' MoE alternatives. We present both qualitative and quantitative evidence
(through visualization and counterfactual interventions respectively) that
scaling MMoE layers when fine-tuning foundation models for vision tasks leads
to more specialized experts at the class-level whilst remaining competitive
with the performance of parameter-matched linear layer counterparts. Finally,
we show that learned expert specialism further facilitates manual correction of
demographic bias in CelebA attribute classification. Our MMoE model code is
available at https://github.com/james-oldfield/MMoE. |
This paper introduces the Multilinear Mixture of Experts (MMoE) layer, a novel architecture for deep learning models that allows for the efficient computation and fusion of a large number of expert operations. |
The MMoE layer addresses the limitations of traditional MoEs (Mixture of Experts) in scaling to a large number of experts while promoting expert specialization and enabling interpretability and editability of the model. |
MMoE leverages tensor factorization techniques (CP, Tucker, Tensor Train, Tensor Ring) to represent the weight tensor of experts in a compressed form, enabling efficient computation with tens of thousands of experts. The model learns to specialize experts towards subtasks by fine-tuning MMoE layers on various image classification tasks. |
Scaling up the number of experts in MMoE leads to increased expert specialization, where individual experts learn to process specific classes or categories of images.
MMoE's factorized architecture allows for manual editing of expert combinations to mitigate demographic bias in image classification, leading to improved fairness metrics.
MMoE layers achieve competitive performance compared to parameter-matched linear layers when fine-tuning foundation models (CLIP, DINO) for image classification on various datasets. |
The evaluation of expert behavior is primarily focused on in-domain data, and further investigation is needed to assess the generalization of MMoEs under domain shift.
Future work could explore the application of MMoEs to natural language processing tasks and investigate their performance in broader settings. |
mixture of experts, tensor factorization, interpretability, model editing, fairness |
2402.12377
Report |
Binary Opacity Grids: Capturing Fine Geometric Detail for Mesh-Based View Synthesis |
Christian Reiser, Stephan Garbin, Pratul P. Srinivasan, Dor Verbin, Richard Szeliski, Ben Mildenhall, Jonathan T. Barron, Peter Hedman, Andreas Geiger |
While surface-based view synthesis algorithms are appealing due to their low
computational requirements, they often struggle to reproduce thin structures.
In contrast, more expensive methods that model the scene's geometry as a
volumetric density field (e.g. NeRF) excel at reconstructing fine geometric
detail. However, density fields often represent geometry in a "fuzzy" manner,
which hinders exact localization of the surface. In this work, we modify
density fields to encourage them to converge towards surfaces, without
compromising their ability to reconstruct thin structures. First, we employ a
discrete opacity grid representation instead of a continuous density field,
which allows opacity values to discontinuously transition from zero to one at
the surface. Second, we anti-alias by casting multiple rays per pixel, which
allows occlusion boundaries and subpixel structures to be modelled without
using semi-transparent voxels. Third, we minimize the binary entropy of the
opacity values, which facilitates the extraction of surface geometry by
encouraging opacity values to binarize towards the end of training. Lastly, we
develop a fusion-based meshing strategy followed by mesh simplification and
appearance model fitting. The compact meshes produced by our model can be
rendered in real-time on mobile devices and achieve significantly higher view
synthesis quality compared to existing mesh-based approaches. |
This paper presents a novel method for reconstructing compact triangle meshes from multi-view images, capable of capturing fine geometric detail like leaves and branches for real-time view synthesis. |
Surface-based view synthesis, while efficient, struggles to reproduce thin structures, unlike computationally expensive volumetric methods. This work bridges this gap by enhancing surface-based methods to reconstruct fine details. |
The method utilizes a high-resolution opacity grid, encouraging binary opacity values (0 or 1) through an entropy loss and supersampling. This allows precise surface localization, converting the grid into a simplified, real-time renderable mesh. |
The approach achieves higher quality than existing mesh-based methods, especially for thin structures.
The resulting meshes are compact enough for real-time rendering on mobile devices.
The method outperforms BakedSDF, the previous state-of-the-art in mesh-based view synthesis, in both quality and compactness. |
Training-time supersampling introduces significant computational overhead.
Background reconstruction can be noisy, leading to larger mesh sizes, potentially mitigated by smoothness regularization. |
novel view synthesis, differentiable rendering, neural radiance fields, multiview-to-3d, real-time rendering |
2402.12376
Report |
FiT: Flexible Vision Transformer for Diffusion Model |
Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai |
Nature is infinitely resolution-free. In the context of this reality,
existing diffusion models, such as Diffusion Transformers, often face
challenges when processing image resolutions outside of their trained domain.
To overcome this limitation, we present the Flexible Vision Transformer (FiT),
a transformer architecture specifically designed for generating images with
unrestricted resolutions and aspect ratios. Unlike traditional methods that
perceive images as static-resolution grids, FiT conceptualizes images as
sequences of dynamically-sized tokens. This perspective enables a flexible
training strategy that effortlessly adapts to diverse aspect ratios during both
training and inference phases, thus promoting resolution generalization and
eliminating biases induced by image cropping. Enhanced by a meticulously
adjusted network structure and the integration of training-free extrapolation
techniques, FiT exhibits remarkable flexibility in resolution extrapolation
generation. Comprehensive experiments demonstrate the exceptional performance
of FiT across a broad range of resolutions, showcasing its effectiveness both
within and beyond its training resolution distribution. Repository available at
https://github.com/whlzy/FiT. |
This paper introduces FiT, a Flexible Vision Transformer tailored for diffusion models, capable of generating images at any resolution and aspect ratio. |
Existing diffusion models struggle to generalize across arbitrary resolutions and aspect ratios. FiT addresses this limitation by conceptualizing images as sequences of variable-length tokens, unlike traditional methods that rely on fixed-resolution grids. |
The paper presents a three-pronged approach: 1) a flexible training pipeline that eliminates the need for cropping by resizing high-resolution images to a maximum token limit, 2) a unique transformer architecture utilizing 2D Rotary Positional Embedding (RoPE) and Masked MHSA to handle dynamic token lengths, and 3) a training-free resolution extrapolation method inspired by techniques used in large language models. |
FiT significantly outperforms previous state-of-the-art models on class-conditional image generation benchmarks across various resolutions and aspect ratios.
Flexible training with dynamic token lengths proves crucial for resolution generalization and surpasses the performance of fixed-resolution training.
Training-free resolution extrapolation methods, specifically VisionNTK and VisionYaRN, further enhance FiT's ability to generate high-quality images at resolutions exceeding those seen during training. |
Limited computational resources restricted the training of the largest FiT model, potentially hindering performance at the 256x256 resolution.
The generative capabilities of FiT with higher resolution training and alternative resolution extrapolation techniques requiring additional training remain unexplored. |
vision transformers, diffusion models, image generation, resolution extrapolation, arbitrary aspect ratio |
2402.12336
Report |
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models |
Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein |
Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are
increasingly used for various real-world tasks. Prior work has shown that these
models are highly vulnerable to adversarial attacks on the vision modality.
These attacks can be leveraged to spread fake information or defraud users, and
thus pose a significant risk, which makes the robustness of large multi-modal
foundation models a pressing problem. The CLIP model, or one of its variants,
is used as a frozen vision encoder in many vision-language models (VLMs), e.g.
LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning
scheme to obtain a robust CLIP vision encoder, which yields robustness on all
vision down-stream tasks (VLMs, zero-shot classification) that rely on CLIP. In
particular, we show that stealth-attacks on users of VLMs by a malicious third
party providing manipulated images are no longer possible once one replaces the
original CLIP model with our robust one. No retraining or fine-tuning of the
VLM is required. The code and robust models are available at
https://github.com/chs20/RobustVLM |
The paper proposes FARE, an unsupervised adversarial fine-tuning scheme for the vision encoder of CLIP, to make vision-language models (VLMs) robust against adversarial attacks on images. |
Large multi-modal foundation models are vulnerable to adversarial attacks, posing significant risks such as spreading misinformation and defrauding users. Robustness is crucial for their safe deployment. |
FARE fine-tunes the vision encoder by minimizing the difference between its embeddings of perturbed and original images, preserving feature similarity to the original CLIP for clean inputs. |
FARE makes VLMs like OpenFlamingo and LLaVA robust to imperceptible targeted attacks while maintaining high performance on clean data.
FARE outperforms the supervised method TeCoA in terms of both robustness and clean performance across various downstream tasks.
Robust CLIP models trained with FARE exhibit lower hallucination rates and better performance in chain-of-thought tasks. |
The study focuses on CLIP-based VLMs and doesn't explore applicability to other architectures.
The defense is focused on the vision modality, with the language side robustness left for future work. |
adversarial robustness, vision-language models, clip, unsupervised adversarial training, multi-modal foundation models |
2402.12259
Report |
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships |
Sebastian Koch, Narunas Vaskevicius, Mirco Colosi, Pedro Hermosilla, Timo Ropinski |
Current approaches for 3D scene graph prediction rely on labeled datasets to
train models for a fixed set of known object classes and relationship
categories. We present Open3DSG, an alternative approach to learn 3D scene
graph prediction in an open world without requiring labeled scene graph data.
We co-embed the features from a 3D scene graph prediction backbone with the
feature space of powerful open world 2D vision language foundation models. This
enables us to predict 3D scene graphs from 3D point clouds in a zero-shot
manner by querying object classes from an open vocabulary and predicting the
inter-object relationships from a grounded LLM with scene graph features and
queried object classes as context. Open3DSG is the first 3D point cloud method
to predict not only explicit open-vocabulary object classes, but also open-set
relationships that are not limited to a predefined label set, making it
possible to express rare as well as specific objects and relationships in the
predicted 3D scene graph. Our experiments show that Open3DSG is effective at
predicting arbitrary object classes as well as their complex inter-object
relationships describing spatial, supportive, semantic and comparative
relationships. |
This paper introduces the first approach for predicting open-vocabulary 3D scene graphs from point clouds, enabling the representation of scenes with arbitrary object classes and relationships. |
Existing 3D scene graph prediction methods are limited to a fixed set of object and relationship labels, hindering their applicability in real-world scenarios requiring broader semantic understanding. |
The method distills knowledge from 2D vision-language models into a 3D graph neural network. It uses CLIP for open-vocabulary object prediction and a grounded LLM for relationship prediction based on predicted object classes and learned relationship features. |
The method outperforms existing methods on predicting rare object and predicate classes.
It achieves comparable performance to fully supervised methods on a closed-set benchmark.
Qualitative results demonstrate the capability to predict specific object classes and relationships. |
Predicting diverse open-vocabulary relationships remains a challenge.
Systematic evaluation of open-vocabulary 3D scene graphs is an open problem. |
3d scene graph, open vocabulary, zero-shot learning, vision-language models, graph neural networks |
2402.12121
Report |
Evaluating Image Review Ability of Vision Language Models |
Shigeki Saito, Kazuki Hayashi, Yusuke Ide, Yusuke Sakai, Kazuma Onishi, Toma Suzuki, Seiji Gobara, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe |
Large-scale vision language models (LVLMs) are language models that are
capable of processing images and text inputs by a single model. This paper
explores the use of LVLMs to generate review texts for images. The ability of
LVLMs to review images is not fully understood, highlighting the need for a
methodical evaluation of their review abilities. Unlike image captions, review
texts can be written from various perspectives such as image composition and
exposure. This diversity of review perspectives makes it difficult to uniquely
determine a single correct review for an image. To address this challenge, we
introduce an evaluation method based on rank correlation analysis, in which
review texts are ranked by humans and LVLMs, then, measures the correlation
between these rankings. We further validate this approach by creating a
benchmark dataset aimed at assessing the image review ability of recent LVLMs.
Our experiments with the dataset reveal that LVLMs, particularly those with
proven superiority in other evaluative contexts, excel at distinguishing
between high-quality and substandard image reviews. |
This paper introduces a novel method for evaluating the ability of Large-scale Vision Language Models (LVLMs) to generate review texts for images, addressing the challenge of subjective review perspectives. |
This evaluation is crucial for understanding LVLMs' capacity to provide detailed and objective feedback on images, potentially replacing human judges in assessment contexts. |
The method involves ranking review texts generated by LVLMs and human annotators, then calculating the rank correlation to assess alignment. A new benchmark dataset with ranked reviews was created to validate this approach. |
LVLMs, especially those excelling in other evaluation tasks, show increasing ability to distinguish high-quality from substandard reviews.
The proposed evaluation method, based on rank correlation, proves effective in assessing LVLMs' review generation capabilities.
Newer LVLMs demonstrate better support for multiple languages compared to earlier models. |
The current method does not incorporate domain-specific knowledge for evaluation.
The dataset, sourced from English Wikipedia, may contain inherent biases. |
large-scale vision language models, image review generation, evaluation method, rank correlation analysis, benchmark dataset |
2402.12004
Report |
Direct Consistency Optimization for Compositional Text-to-Image Personalization |
Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin |
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal
images, are able to generate visuals with a high degree of consistency.
However, they still lack in synthesizing images of different scenarios or
styles that are possible in the original pretrained models. To address this, we
propose to fine-tune the T2I model by maximizing consistency to reference
images, while penalizing the deviation from the pretrained model. We devise a
novel training objective for T2I diffusion models that minimally fine-tunes the
pretrained model to achieve consistency. Our method, dubbed \emph{Direct
Consistency Optimization}, is as simple as regular diffusion loss, while
significantly enhancing the compositionality of personalized T2I models. Also,
our approach induces a new sampling method that controls the tradeoff between
image fidelity and prompt fidelity. Lastly, we emphasize the necessity of using
a comprehensive caption for reference images to further enhance the image-text
alignment. We show the efficacy of the proposed method on the T2I
personalization for subject, style, or both. In particular, our method results
in a superior Pareto frontier to the baselines. Generated examples and codes
are in our project page( https://dco-t2i.github.io/). |
This paper proposes Direct Consistency Optimization (DCO), a novel fine-tuning objective for Text-to-Image (T2I) diffusion models that enhances compositionality in personalized image generation. |
Existing T2I personalization methods, while effective in learning new concepts from few images, often suffer from reduced textual alignment and compositional generation capability due to knowledge forgetting and concept collapse. |
DCO casts fine-tuning as a constrained policy optimization problem. It maximizes consistency to reference images while minimizing deviation from the pretrained model. This approach preserves the compositionality of the original model while incorporating new concepts. |
DCO outperforms baselines like DreamBooth in subject and style personalization, demonstrating superior image-text alignment and subject fidelity.
The paper introduces “reward guidance”, a sampling method that allows users to control the tradeoff between image fidelity and prompt fidelity.
The authors emphasize the importance of using comprehensive captions for reference images to enhance model disentanglement and prevent concept collapse. |
The current implementation of DCO increases computational cost due to additional inference steps during training and sampling. Future work could explore efficient fine-tuning methods to address this.
While reward guidance sampling allows control over subject fidelity and textual alignment, finding the optimal guidance scale for a given dataset or prompt requires further investigation. |
text-to-image synthesis, diffusion models, personalization, compositionality, fine-tuning |
2402.11929
Report |
DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation |
Chong Zeng, Yue Dong, Pieter Peers, Youkang Kong, Hongzhi Wu, Xin Tong |
This paper presents a novel method for exerting fine-grained lighting control
during text-driven diffusion-based image generation. While existing diffusion
models already have the ability to generate images under any lighting
condition, without additional guidance these models tend to correlate image
content and lighting. Moreover, text prompts lack the necessary expressional
power to describe detailed lighting setups. To provide the content creator with
fine-grained control over the lighting during image generation, we augment the
text-prompt with detailed lighting information in the form of radiance hints,
i.e., visualizations of the scene geometry with a homogeneous canonical
material under the target lighting. However, the scene geometry needed to
produce the radiance hints is unknown. Our key observation is that we only need
to guide the diffusion process, hence exact radiance hints are not necessary;
we only need to point the diffusion model in the right direction. Based on this
observation, we introduce a three stage method for controlling the lighting
during image generation. In the first stage, we leverage a standard pretrained
diffusion model to generate a provisional image under uncontrolled lighting.
Next, in the second stage, we resynthesize and refine the foreground object in
the generated image by passing the target lighting to a refined diffusion
model, named DiLightNet, using radiance hints computed on a coarse shape of the
foreground object inferred from the provisional image. To retain the texture
details, we multiply the radiance hints with a neural encoding of the
provisional synthesized image before passing it to DiLightNet. Finally, in the
third stage, we resynthesize the background to be consistent with the lighting
on the foreground object. We demonstrate and validate our lighting controlled
diffusion model on a variety of text prompts and lighting conditions. |
This paper presents DiLightNet, a novel method for fine-grained lighting control during text-driven diffusion-based image generation by augmenting text prompts with radiance hints. |
Existing diffusion models struggle to decouple lighting from image content and text prompts lack the expressive power for detailed lighting descriptions, limiting creative control over lighting. |
The method involves three stages: (1) generating a provisional image with uncontrolled lighting from a text prompt, (2) resynthesizing the foreground using DiLightNet guided by radiance hints computed from a coarse depth estimate and the target lighting, and (3) inpainting a consistent background. |
DiLightNet successfully controls lighting in generated images, enabling diverse lighting conditions for the same text prompt.
Appearance-seed allows exploring plausible material interpretations, while prompt specialization offers additional control over material properties.
Ablation study validates the importance of provisional image encoding, radiance hint selection, foreground masking, and data augmentation. |
Material-light interactions might not perfectly align with the prompt due to limitations in text-based material control.
Reliance on off-the-shelf depth and mask estimation can impact results when estimation is inaccurate. |
diffusion models, image generation, lighting control, radiance hints, controlnet |
2402.11849
Report |
ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image |
Yan Hong, Jianfu Zhang |
Recent advancements in personalizing text-to-image (T2I) diffusion models
have shown the capability to generate images based on personalized visual
concepts using a limited number of user-provided examples. However, these
models often struggle with maintaining high visual fidelity, particularly in
manipulating scenes as defined by textual inputs. Addressing this, we introduce
ComFusion, a novel approach that leverages pretrained models generating
composition of a few user-provided subject images and predefined-text scenes,
effectively fusing visual-subject instances with textual-specific scenes,
resulting in the generation of high-fidelity instances within diverse scenes.
ComFusion integrates a class-scene prior preservation regularization, which
leverages composites the subject class and scene-specific knowledge from
pretrained models to enhance generation fidelity. Additionally, ComFusion uses
coarse generated images, ensuring they align effectively with both the instance
image and scene texts. Consequently, ComFusion maintains a delicate balance
between capturing the essence of the subject and maintaining scene
fidelity.Extensive evaluations of ComFusion against various baselines in T2I
personalization have demonstrated its qualitative and quantitative superiority. |
ComFusion, a novel two-stream finetuning approach for personalized text-to-image generation that balances instance fidelity and scene fidelity across diverse scenes. |
Existing methods struggle to maintain both instance fidelity (visual congruence with the instance image) and scene fidelity (aligning generated scenes with text prompts), especially in few-shot personalized generation. |
ComFusion employs a composite stream with class-scene prior loss to preserve class and scene knowledge from the pretrained model, and a fusion stream with visual-textual matching loss to fuse instance visual features with textual scene information. |
ComFusion outperforms baselines in quantitative metrics (CLIP-I, DINO, CLIP-T) demonstrating superior instance and scene fidelity.
Human perceptual studies confirm ComFusion generates images with significantly better instance and scene fidelity compared to baselines.
Ablation studies validate the contribution of both the composite and fusion streams, highlighting the importance of class-scene prior preservation and visual-textual feature fusion. |
ComFusion shows limitations in understanding and rendering creative scenes, material properties, and complex composite semantics.
Future work will focus on addressing these limitations to enhance the model's ability to handle more complex and nuanced scene descriptions. |
text-to-image generation, personalized image generation, diffusion models, few-shot learning, instance fidelity, scene fidelity |
2402.11846
Report |
UnlearnCanvas: A Stylized Image Dataset to Benchmark Machine Unlearning for Diffusion Models |
Yihua Zhang, Yimeng Zhang, Yuguang Yao, Jinghan Jia, Jiancheng Liu, Xiaoming Liu, Sijia Liu |
The rapid advancement of diffusion models (DMs) has not only transformed
various real-world industries but has also introduced negative societal
concerns, including the generation of harmful content, copyright disputes, and
the rise of stereotypes and biases. To mitigate these issues, machine
unlearning (MU) has emerged as a potential solution, demonstrating its ability
to remove undesired generative capabilities of DMs in various applications.
However, by examining existing MU evaluation methods, we uncover several key
challenges that can result in incomplete, inaccurate, or biased evaluations for
MU in DMs. To address them, we enhance the evaluation metrics for MU, including
the introduction of an often-overlooked retainability measurement for DMs
post-unlearning. Additionally, we introduce UnlearnCanvas, a comprehensive
high-resolution stylized image dataset that facilitates us to evaluate the
unlearning of artistic painting styles in conjunction with associated image
objects. We show that this dataset plays a pivotal role in establishing a
standardized and automated evaluation framework for MU techniques on DMs,
featuring 7 quantitative metrics to address various aspects of unlearning
effectiveness. Through extensive experiments, we benchmark 5 state-of-the-art
MU methods, revealing novel insights into their pros and cons, and the
underlying unlearning mechanisms. Furthermore, we demonstrate the potential of
UnlearnCanvas to benchmark other generative modeling tasks, such as style
transfer. The UnlearnCanvas dataset, benchmark, and the codes to reproduce all
the results in this work can be found at
https://github.com/OPTML-Group/UnlearnCanvas. |
This paper introduces UnlearnCanvas, a large-scale, high-resolution dataset designed to benchmark machine unlearning (MU) in diffusion models, specifically focusing on unlearning artistic styles and objects. |
Existing MU evaluation methods for diffusion models suffer from limitations such as limited target diversity, imprecise evaluation, and a lack of retainability assessment, hindering the development and understanding of MU techniques. |
The authors curate UnlearnCanvas with dual style-object supervision and high stylistic consistency. They also propose an evaluation pipeline that includes metrics for unlearning effectiveness, in-domain and cross-domain retainability, generation quality, and efficiency. |
Retainability metrics are crucial for a comprehensive MU assessment, revealing significant performance differences not captured by unlearning accuracy alone.
Cross-domain retainability is harder to maintain than in-domain retainability, highlighting a previously overlooked challenge.
No single MU method excels in all aspects, indicating room for improvement and the need for a balanced approach. |
The study primarily focuses on Stable Diffusion v1.5; evaluating other diffusion models is left for future work.
Exploring the impact of varying dataset sizes and prompt complexities on unlearning performance is an area for further investigation. |
machine unlearning, diffusion models, benchmarking, style transfer, generative ai |
2402.11487
Report |
Visual Concept-driven Image Generation with Text-to-Image Diffusion Model |
Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal |
Text-to-image (TTI) diffusion models have demonstrated impressive results in
generating high-resolution images of complex and imaginative scenes. Recent
approaches have further extended these methods with personalization techniques
that allow them to integrate user-illustrated concepts (e.g., the user
him/herself) using a few sample image illustrations. However, the ability to
generate images with multiple interacting concepts, such as human subjects, as
well as concepts that may be entangled in one, or across multiple, image
illustrations remains illusive. In this work, we propose a concept-driven TTI
personalization framework that addresses these core challenges. We build on
existing works that learn custom tokens for user-illustrated concepts, allowing
those to interact with existing text tokens in the TTI model. However,
importantly, to disentangle and better learn the concepts in question, we
jointly learn (latent) segmentation masks that disentangle these concepts in
user-provided image illustrations. We do so by introducing an Expectation
Maximization (EM)-like optimization procedure where we alternate between
learning the custom tokens and estimating masks encompassing corresponding
concepts in user-supplied images. We obtain these masks based on
cross-attention, from within the U-Net parameterized latent diffusion model and
subsequent Dense CRF optimization. We illustrate that such joint alternating
refinement leads to the learning of better tokens for concepts and, as a
bi-product, latent masks. We illustrate the benefits of the proposed approach
qualitatively and quantitatively (through user studies) with a number of
examples and use cases that can combine up to three entangled concepts. |
This paper proposes a concept-driven text-to-image (TTI) personalization framework that disentangles multiple concepts from a single or multiple images for generating novel compositions and interactions. |
Existing TTI personalization methods struggle to generate images with multiple interacting user-specified concepts, especially when entangled within a single illustration. |
The method uses an Expectation Maximization (EM)-like optimization to jointly learn: 1) concept-specific tokens and 2) latent binary masks for each concept. It leverages cross-attention maps within the diffusion model to generate and refine these masks. |
Quantitative user studies show that the proposed method significantly outperforms baselines in generating faithful and controllable images with multiple interacting concepts.
The approach can effectively disentangle concepts from a single image, removing the need for user-provided masks.
It demonstrates strong performance in generating complex scenarios with interactions between user-specified concepts, including both cartoon and real-world instances. |
The current method focuses on generating interactions between a limited number of concepts (up to three).
Future work could explore extending this framework to handle a wider array of interactions and more complex compositions. |
text-to-image generation, personalization, diffusion models, concept disentanglement, cross-attention |
2402.11303
Report |
FViT: A Focal Vision Transformer with Gabor Filter |
Yulong Shi, Mingwei Sun, Yongshuai Wang, Rui Wang, Hui Sun, Zengqiang Chen |
Vision transformers have achieved encouraging progress in various computer
vision tasks. A common belief is that this is attributed to the competence of
self-attention in modeling the global dependencies among feature tokens.
Unfortunately, self-attention still faces some challenges in dense prediction
tasks, such as the high computational complexity and absence of desirable
inductive bias. To address these issues, we revisit the potential benefits of
integrating vision transformer with Gabor filter, and propose a Learnable Gabor
Filter (LGF) by using convolution. As an alternative to self-attention, we
employ LGF to simulate the response of simple cells in the biological visual
system to input images, prompting models to focus on discriminative feature
representations of targets from various scales and orientations. Additionally,
we design a Bionic Focal Vision (BFV) block based on the LGF. This block draws
inspiration from neuroscience and introduces a Multi-Path Feed Forward Network
(MPFFN) to emulate the working way of biological visual cortex processing
information in parallel. Furthermore, we develop a unified and efficient
pyramid backbone network family called Focal Vision Transformers (FViTs) by
stacking BFV blocks. Experimental results show that FViTs exhibit highly
competitive performance in various vision tasks. Especially in terms of
computational efficiency and scalability, FViTs show significant advantages
compared with other counterparts. Code is available at
https://github.com/nkusyl/FViT |
This paper proposes Focal Vision Transformers (FViTs), a family of efficient vision backbone networks that replace self-attention with a Learnable Gabor Filter (LGF) and introduce a Multi-Path Feed Forward Network (MPFFN) inspired by neuroscience. |
Self-attention in vision transformers suffers from high computational complexity, lack of local sensitivity, and absence of inductive bias. FViTs aim to address these issues by providing an efficient and scalable alternative. |
The paper designs LGF using convolution to simulate simple cell responses in the visual system. MPFFN emulates parallel information processing in the visual cortex. These components are combined in a hierarchical pyramid backbone network. |
FViTs achieve competitive performance on ImageNet classification compared to CNNs and vision transformers, demonstrating a good balance between accuracy and efficiency.
Experiments on COCO object detection and instance segmentation show FViTs outperform ResNet and achieve competitive results with PVT and PoolFormer.
Evaluations on ADE20K semantic segmentation task further confirm the effectiveness of FViTs in dense prediction tasks. |
The paper primarily focuses on image-level tasks and could explore more challenging video-related tasks.
Further investigation into the combination of LGF and self-attention for potential synergistic effects is warranted. |
vision transformer, gabor filter, image classification, object detection, semantic segmentation |
2402.11281
Report |
Can Large Multimodal Models Uncover Deep Semantics Behind Images? |
Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui |
Understanding the deep semantics of images is essential in the era dominated
by social media. However, current research works primarily on the superficial
description of images, revealing a notable deficiency in the systematic
investigation of the inherent deep semantics. In this work, we introduce
DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs)
capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset
and three progressive subtasks: fine-grained description selection, in-depth
title matching, and deep semantics understanding. Utilizing DEEPEVAL, we
evaluate 9 open-source LMMs and GPT-4V(ision).Our evaluation demonstrates a
substantial gap between the deep semantic comprehension capabilities of
existing LMMs and humans. For example, GPT-4V is 30% behind humans in
understanding deep semantics, even though it achieves human-comparable
performance in image description. Further analysis indicates that the
integration of description texts during the inference process notably enhances
LMMs' ability to perceive deep semantics. Furthermore, our dataset is divided
into multiple categories, and we conducted a more detailed analysis within
these categories. |
This paper introduces \method{}, a benchmark designed to assess the capabilities of Large Multimodal Models (LMMs) in understanding the deep semantics of images. |
Existing research primarily focuses on superficial image descriptions, neglecting the crucial aspect of deep semantic understanding, which is vital for comprehending the deeper meaning and message conveyed in visual content. |
\method{} comprises a human-annotated dataset of cartoon images and three progressive subtasks: Fine-grained Description Selection, In-depth Title Matching, and Deep Semantics Understanding. The authors evaluate nine open-source LMMs and GPT-4V(ision) using these tasks. |
There's a significant gap between the deep semantic comprehension abilities of current LMMs and humans.
Integrating description texts during inference notably improves LMMs' ability to perceive deep semantics.
LMMs exhibit varying performance across different image categories, with certain categories, such as 'Satirical,' posing greater challenges. |
The dataset is limited in terms of image categories and currently only includes cartoons.
Images with potentially controversial deep semantics are excluded to ensure annotator consensus. |
large multimodal models, deep semantics, image understanding, benchmarking, visual reasoning |
2402.11248
Report |
CoLLaVO: Crayon Large Language and Vision mOdel |
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro |
The remarkable success of Large Language Models (LLMs) and instruction tuning
drives the evolution of Vision Language Models (VLMs) towards a versatile
general-purpose model. Yet, it remains unexplored whether current VLMs
genuinely possess quality object-level image understanding capabilities
determined from `what objects are in the image?' or `which object corresponds
to a specified bounding box?'. Our findings reveal that the image understanding
capabilities of current VLMs are strongly correlated with their zero-shot
performance on vision language (VL) tasks. This suggests that prioritizing
basic image understanding is crucial for VLMs to excel at VL tasks. To enhance
object-level image understanding, we propose Crayon Large Language and Vision
mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a
new visual prompt tuning scheme based on panoptic color maps. Furthermore, we
present a learning strategy of Dual QLoRA to preserve object-level image
understanding without forgetting it during visual instruction tuning, thereby
achieving a significant leap in numerous VL benchmarks in a zero-shot setting. |
The paper introduces CoLLaVO, a new large language and vision model that leverages a novel visual prompt tuning scheme called Crayon Prompt and a learning strategy called Dual QLoRA to significantly enhance object-level image understanding and achieve state-of-the-art zero-shot performance on various vision language tasks. |
Current Vision Language Models (VLMs) often lack sufficient object-level image understanding, which limits their performance on complex vision language tasks. This paper aims to address this issue by improving the object recognition and understanding capabilities of VLMs. |
The authors propose two key techniques: 1) Crayon Prompt: Inspired by panoptic segmentation maps, this method injects object-level semantic and numbering information into image embedding features at every attention layer. 2) Dual QLoRA: This learning strategy utilizes two QLoRA modules to efficiently train the model on both crayon instructions for object-level understanding and visual instruction tuning datasets for complex VL tasks, preventing catastrophic forgetting. |
CoLLaVO achieves state-of-the-art zero-shot performance on various VL benchmarks, including MME, MM-Bench, MM-Bench-Chinese, and Q-Bench.
The Crayon Prompt, particularly the semantic embedding component, significantly improves object-level image understanding, as demonstrated by improved scores on tasks like MME-P.
Dual QLoRA effectively integrates both crayon instructions and visual instruction tuning datasets, leading to superior performance compared to using either approach alone. |
The performance of Crayon Prompts relies on the accuracy and object class coverage of the external panoptic segmentation model.
Future work includes exploring the integration of diverse visual prompts from various sources like object classification, captioning models, and open-object detection. |
vision language models, object-level image understanding, visual prompt tuning, crayon prompt, dual qlora |
2402.11148
Report |
Knowledge Distillation Based on Transformed Teacher Matching |
Kaixiang Zheng, En-Hui Yang |
As a technique to bridge logit matching and probability distribution
matching, temperature scaling plays a pivotal role in knowledge distillation
(KD). Conventionally, temperature scaling is applied to both teacher's logits
and student's logits in KD. Motivated by some recent works, in this paper, we
drop instead temperature scaling on the student side, and systematically study
the resulting variant of KD, dubbed transformed teacher matching (TTM). By
reinterpreting temperature scaling as a power transform of probability
distribution, we show that in comparison with the original KD, TTM has an
inherent R\'enyi entropy term in its objective function, which serves as an
extra regularization term. Extensive experiment results demonstrate that thanks
to this inherent regularization, TTM leads to trained students with better
generalization than the original KD. To further enhance student's capability to
match teacher's power transformed probability distribution, we introduce a
sample-adaptive weighting coefficient into TTM, yielding a novel distillation
approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments,
that although WTTM is simple, it is effective, improves upon TTM, and achieves
state-of-the-art accuracy performance. Our source code is available at
https://github.com/zkxufo/TTM. |
This paper introduces Transformed Teacher Matching (TTM), a knowledge distillation (KD) variant that removes temperature scaling from the student model, leading to improved generalization due to an inherent Rényi entropy regularization. |
This work provides a novel understanding of temperature scaling in KD, showing it's better to apply it only to the teacher model. This leads to improved generalization and provides a new theoretical framework for KD. |
The authors reinterpret temperature scaling as a probability distribution power transform. By removing temperature scaling from the student in KD, they derive TTM and show it embeds a Rényi entropy regularizer, improving generalization. They further enhance TTM with sample-adaptive weighting, resulting in Weighted TTM (WTTM). |
TTM consistently outperforms KD in image classification tasks on CIFAR-100 and ImageNet.
WTTM further improves upon TTM by adaptively weighting the distillation loss based on sample difficulty.
WTTM achieves state-of-the-art accuracy, even surpassing many complex feature-based distillation methods. |
The selection of the sample-adaptive weight in WTTM could be further optimized.
Exploration of alternative probability distribution transforms beyond the power transform could yield additional benefits. |
knowledge distillation, temperature scaling, rényi entropy, regularization, image classification |
2402.10882
Report |
Universal Prompt Optimizer for Safe Text-to-Image Generation |
Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang |
Text-to-Image (T2I) models have shown great performance in generating images
based on textual prompts. However, these models are vulnerable to unsafe input
to generate unsafe content like sexual, harassment and illegal-activity images.
Existing studies based on image checker, model fine-tuning and embedding
blocking are impractical in real-world applications. Hence, we propose the
first universal prompt optimizer for safe T2I (POSI) generation in black-box
scenario. We first construct a dataset consisting of toxic-clean prompt pairs
by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting
toxic prompt to clean prompt while preserving semantic information, we design a
novel reward function measuring toxicity and text alignment of generated images
and train the optimizer through Proximal Policy Optimization. Experiments show
that our approach can effectively reduce the likelihood of various T2I models
in generating inappropriate images, with no significant impact on text
alignment. It is also flexible to be combined with methods to achieve better
performance. Our code is available at https://github.com/wzongyu/POSI. |
This paper proposes POSI, the first universal prompt optimizer for safe Text-to-Image (T2I) generation in a black-box scenario. POSI revises potentially harmful prompts to generate safe images while preserving semantic content. |
Existing safety measures for T2I models, like image checkers and model fine-tuning, have limitations in real-world applications. POSI offers a universal and flexible solution for enhancing the safety of black-box T2I models without requiring access to their internal structure. |
The methodology involves: (1) Constructing a toxic-clean prompt pairs dataset using GPT-3.5 Turbo. (2) Supervised fine-tuning (SFT) of a language model (LLaMA) on the dataset for basic prompt rewriting. (3) Designing a novel reward function that considers both the toxicity (using Q16 classifier) and text alignment (using CLIP similarity) of generated images. (4) Further training the language model using Proximal Policy Optimization (PPO) to maximize the reward and improve safe image generation. |
POSI effectively reduces the likelihood of generating inappropriate images across various T2I models, including SD versions and black-box models like DALL-E 3 and Midjourney.
It maintains good text alignment, ensuring the generated images stay relevant to the user's original (though potentially harmful) prompt.
The framework is flexible and can be combined with existing safety methods like SLD and SD-NP to further enhance their effectiveness. |
Balancing the trade-off between image safety and text alignment remains a challenge.
Constructing datasets tailored to produce inappropriate images on specific T2I models like DALL-E 3 and Midjourney is crucial for future research and algorithm development. |
text-to-image generation, safe ai, prompt engineering, reinforcement learning, black-box optimization |
2402.10855
Report |
Control Color: Multimodal Diffusion-based Interactive Image Colorization |
Zhexin Liang, Zhaochen Li, Shangchen Zhou, Chongyi Li, Chen Change Loy |
Despite the existence of numerous colorization methods, several limitations
still exist, such as lack of user interaction, inflexibility in local
colorization, unnatural color rendering, insufficient color variation, and
color overflow. To solve these issues, we introduce Control Color (CtrlColor),
a multi-modal colorization method that leverages the pre-trained Stable
Diffusion (SD) model, offering promising capabilities in highly controllable
interactive image colorization. While several diffusion-based methods have been
proposed, supporting colorization in multiple modalities remains non-trivial.
In this study, we aim to tackle both unconditional and conditional image
colorization (text prompts, strokes, exemplars) and address color overflow and
incorrect color within a unified framework. Specifically, we present an
effective way to encode user strokes to enable precise local color manipulation
and employ a practical way to constrain the color distribution similar to
exemplars. Apart from accepting text prompts as conditions, these designs add
versatility to our approach. We also introduce a novel module based on
self-attention and a content-guided deformable autoencoder to address the
long-standing issues of color overflow and inaccurate coloring. Extensive
comparisons show that our model outperforms state-of-the-art image colorization
methods both qualitatively and quantitatively. |
CtrlColor, a novel multi-modal diffusion-based colorization framework is proposed, which unifies unconditional, prompt-, stroke-, and exemplar-based image colorization in a single framework. |
Existing colorization methods have limitations such as lack of user interaction, inflexibility in local colorization, unnatural color rendering, insufficient color variation, and color overflow. |
The framework leverages the pre-trained Stable Diffusion model, introduces a novel module for stroke encoding, employs a method to constrain color distribution similar to exemplars, and utilizes self-attention guidance and a content-guided deformable autoencoder to address color overflow and inaccurate coloring. |
CtrlColor outperforms state-of-the-art methods in terms of color richness, stability, and visual quality.
The method effectively addresses color overflow and miscoloring issues.
It offers highly precise and flexible control, enabling users to modify image color locally using strokes. |
Region coloring may not generate very colorful results for small regions in grayscale images.
Exemplar-based colorization might not perfectly replicate complex color distributions from exemplars. |
image colorization, diffusion models, multi-modal learning, stable diffusion, interactive image editing |
2402.10821
Report |
Training Class-Imbalanced Diffusion Model Via Overlap Optimization |
Divin Yan, Lu Qi, Vincent Tao Hu, Ming-Hsuan Yang, Meng Tang |
Diffusion models have made significant advances recently in high-quality
image synthesis and related tasks. However, diffusion models trained on
real-world datasets, which often follow long-tailed distributions, yield
inferior fidelity for tail classes. Deep generative models, including diffusion
models, are biased towards classes with abundant training images. To address
the observed appearance overlap between synthesized images of rare classes and
tail classes, we propose a method based on contrastive learning to minimize the
overlap between distributions of synthetic images for different classes. We
show variants of our probabilistic contrastive learning method can be applied
to any class conditional diffusion model. We show significant improvement in
image synthesis using our loss for multiple datasets with long-tailed
distribution. Extensive experimental results demonstrate that the proposed
method can effectively handle imbalanced data for diffusion-based generation
and classification models. Our code and datasets will be publicly available at
https://github.com/yanliang3612/DiffROP. |
This paper proposes DiffROP, a novel framework to train class-imbalanced diffusion models by minimizing distribution overlap between head and tail classes using probabilistic contrastive learning. |
Diffusion models trained on real-world, long-tailed datasets often generate low-fidelity images for tail classes due to bias towards data-abundant head classes. |
The method introduces a probabilistic contrastive learning (PCL) loss to penalize the KL divergence between conditional image distributions of different classes, effectively minimizing it using estimated noise from image pairs. |
DiffROP significantly improves FID scores and other metrics on CIFAR10LT and CIFAR100LT datasets, indicating better image fidelity and diversity.
The method consistently enhances performance across different class categories, particularly for tail classes, showing its robustness to dataset imbalances.
Integrating DiffROP for data augmentation in long-tailed classification tasks leads to notable improvements in accuracy, precision, and recall. |
The study primarily focuses on image synthesis; further exploration is needed for other data modalities.
Fine-tuning the classifier-free guidance strength (ω) is crucial for optimal performance and requires careful consideration. |
diffusion models, class imbalance, long-tailed distribution, probabilistic contrastive learning, image synthesis |
2402.10739
Report |
PointMamba: A Simple State Space Model for Point Cloud Analysis |
Dingkang Liang, Xin Zhou, Xinyu Wang, Xingkui Zhu, Wei Xu, Zhikang Zou, Xiaoqing Ye, Xiang Bai |
Transformers have become one of the foundational architectures in point cloud
analysis tasks due to their excellent global modeling ability. However, the
attention mechanism has quadratic complexity and is difficult to extend to long
sequence modeling due to limited computational resources and so on. Recently,
state space models (SSM), a new family of deep sequence models, have presented
great potential for sequence modeling in NLP tasks. In this paper, taking
inspiration from the success of SSM in NLP, we propose PointMamba, a framework
with global modeling and linear complexity. Specifically, by taking embedded
point patches as input, we proposed a reordering strategy to enhance SSM's
global modeling ability by providing a more logical geometric scanning order.
The reordered point tokens are then sent to a series of Mamba blocks to
causally capture the point cloud structure. Experimental results show our
proposed PointMamba outperforms the transformer-based counterparts on different
point cloud analysis datasets, while significantly saving about 44.3%
parameters and 25% FLOPs, demonstrating the potential option for constructing
foundational 3D vision models. We hope our PointMamba can provide a new
perspective for point cloud analysis. The code is available at
https://github.com/LMD0311/PointMamba. |
This paper introduces PointMamba, a novel state space model (SSM) designed for point cloud analysis, achieving global modeling capabilities with linear complexity, making it a potential cornerstone for 3D vision foundation models. |
Existing Transformer-based models, while effective for point cloud analysis, suffer from quadratic complexity, hindering their scalability to long sequences. PointMamba addresses this limitation by leveraging the efficiency of SSMs while maintaining global receptive fields. |
PointMamba utilizes a point tokenizer to generate point tokens from input point clouds. A reordering strategy then organizes these tokens based on geometric coordinates, facilitating causal structure capturing by the subsequent Mamba blocks. The model is pre-trained using an asymmetric autoencoder with a masked point reconstruction objective. |
PointMamba demonstrates competitive performance against Transformer-based counterparts on ModelNet40 and ShapeNetPart datasets, achieving comparable or superior accuracy with significantly reduced parameters and FLOPs.
It outperforms Point-MAE in various ScanObjectNN benchmark tasks, showcasing its robustness in real-world object classification.
The model exhibits superior memory efficiency for processing lengthy sequences compared to ViT-based approaches, making it suitable for large-scale point cloud analysis. |
The current reordering strategy, while effective, involves tripling the sequence length, which may limit the model's capacity to handle extremely long sequences.
The pre-training strategy adopted from Point-MAE is not specifically tailored for the unidirectional nature of SSMs, leaving room for further optimization. |
point cloud analysis, state space model, mamba, global modeling, linear complexity |
2402.10636
Report |
PEGASUS: Personalized Generative 3D Avatars with Composable Attributes |
Hyunsoo Cha, Byungjun Kim, Hanbyul Joo |
We present PEGASUS, a method for constructing a personalized generative 3D
face avatar from monocular video sources. Our generative 3D avatar enables
disentangled controls to selectively alter the facial attributes (e.g., hair or
nose) while preserving the identity. Our approach consists of two stages:
synthetic database generation and constructing a personalized generative
avatar. We generate a synthetic video collection of the target identity with
varying facial attributes, where the videos are synthesized by borrowing the
attributes from monocular videos of diverse identities. Then, we build a
person-specific generative 3D avatar that can modify its attributes
continuously while preserving its identity. Through extensive experiments, we
demonstrate that our method of generating a synthetic database and creating a
3D generative avatar is the most effective in preserving identity while
achieving high realism. Subsequently, we introduce a zero-shot approach to
achieve the same goal of generative modeling more efficiently by leveraging a
previously constructed personalized generative model. |
PEGASUS is a novel method for creating personalized, generative 3D face avatars from monocular videos, allowing for disentangled control over facial attributes (e.g., hair, nose) while preserving identity. |
Personalized and controllable 3D avatars are important for various applications, including AR/VR and the metaverse. Existing methods often lack the ability to alter facial attributes or struggle to maintain identity. |
The method involves two stages: (1) generating a synthetic database of the target individual with varying facial attributes by swapping parts from other videos, and (2) training a personalized generative 3D avatar model using this database. Additionally, a zero-shot transfer approach leverages previously constructed models for efficient avatar creation. |
PEGASUS outperforms baseline methods in preserving identity and naturalness when transferring hairstyles.
The synthetic database generation with part-swapping leads to better generative performance compared to using original videos directly.
The zero-shot transfer approach efficiently creates personalized avatars without additional training, showing high identity preservation. |
The quality of generated avatars does not yet reach photorealistic levels and exhibits artifacts.
The reliance on non-physical-based methods for synthetic database generation limits physical accuracy. |
3d face avatar, generative model, personalized avatar, part swapping, zero-shot transfer |
2402.10491
Report |
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation |
Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen |
Diffusion models have proven to be highly effective in image and video
generation; however, they still face composition challenges when generating
images of varying sizes due to single-scale training data. Adapting large
pre-trained diffusion models for higher resolution demands substantial
computational and optimization resources, yet achieving a generation capability
comparable to low-resolution models remains elusive. This paper proposes a
novel self-cascade diffusion model that leverages the rich knowledge gained
from a well-trained low-resolution model for rapid adaptation to
higher-resolution image and video generation, employing either tuning-free or
cheap upsampler tuning paradigms. Integrating a sequence of multi-scale
upsampler modules, the self-cascade diffusion model can efficiently adapt to a
higher resolution, preserving the original composition and generation
capabilities. We further propose a pivot-guided noise re-schedule strategy to
speed up the inference process and improve local structural details. Compared
to full fine-tuning, our approach achieves a 5X training speed-up and requires
only an additional 0.002M tuning parameters. Extensive experiments demonstrate
that our approach can quickly adapt to higher resolution image and video
synthesis by fine-tuning for just 10k steps, with virtually no additional
inference time. |
This paper presents a novel self-cascade diffusion model for rapid adaptation of pre-trained models to higher resolutions for image and video generation. |
Existing diffusion models face challenges in generating images of varying sizes due to single-scale training data, and adapting them to higher resolutions is computationally expensive and often results in poor composition and generation quality. |
The method utilizes a pivot-guided noise re-scheduling strategy to progressively synthesize higher resolution images by reusing the knowledge from a well-trained low-resolution model. It introduces lightweight, learnable upsampling modules to further improve the adaptation with minimal fine-tuning on a small amount of high-resolution data. |
The approach achieves a 5x training speed-up compared to full fine-tuning and requires only 0.002M additional parameters.
It demonstrates state-of-the-art performance in both tuning-free and tuning settings across various scale adaptations for both image and video generation.
The method efficiently adapts to higher resolutions with minimal additional inference time. |
The performance of the method may be limited when the scale gap is too large due to the small number of parameters in the upsampling modules.
Future work will explore the trade-off between adaptation efficiency and generalization ability. |
diffusion models, image generation, video generation, resolution adaptation, self-cascade |
2402.10401
Report |
ManiFPT: Defining and Analyzing Fingerprints of Generative Models |
Hae Jin Song, Mahyar Khayatkhoei, Wael AbdAlmageed |
Recent works have shown that generative models leave traces of their
underlying generative process on the generated samples, broadly referred to as
fingerprints of a generative model, and have studied their utility in detecting
synthetic images from real ones. However, the extend to which these
fingerprints can distinguish between various types of synthetic image and help
identify the underlying generative process remain under-explored. In
particular, the very definition of a fingerprint remains unclear, to our
knowledge. To that end, in this work, we formalize the definition of artifact
and fingerprint in generative models, propose an algorithm for computing them
in practice, and finally study its effectiveness in distinguishing a large
array of different generative models. We find that using our proposed
definition can significantly improve the performance on the task of identifying
the underlying generative process from samples (model attribution) compared to
existing methods. Additionally, we study the structure of the fingerprints, and
observe that it is very predictive of the effect of different design choices on
the generative process. |
This work proposes a formal definition of fingerprints in generative models, based on the deviation of generated samples from the true data manifold. |
Identifying the source of synthetic data is crucial for various applications, including differentiating authorized from malicious personification and detecting digital copyright infringement. Existing works lack a clear definition of generative model fingerprints, hindering systematic study and comparison. |
The authors define an artifact as the difference between a generated sample and its closest point on the true data manifold. The fingerprint of a generative model is then defined as the set of all its artifacts. They propose an algorithm to compute these artifacts by estimating the data manifold from real samples in a chosen embedding space (RGB, Frequency, or learned spaces). |
The proposed fingerprint definition, when used as features for model attribution, outperforms existing methods on four different datasets.
Analysis of feature spaces shows that the proposed fingerprint representations exhibit better separability compared to baselines.
The clustering structure of the fingerprints reveals a strong alignment with the choice of upsampling methods and loss functions used in generative models, confirming common intuitions about model limitations. |
The estimation of the true data manifold relies on finite samples, which might not perfectly represent the underlying manifold.
Future work includes investigating the impact of different embedding spaces and distance metrics on fingerprint quality and exploring techniques to improve manifold estimation. |
generative models, fingerprinting, model attribution, data manifold, deep learning |
2402.10294
Report |
LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing |
Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, Raj Sodhi |
Video creation has become increasingly popular, yet the expertise and effort
required for editing often pose barriers to beginners. In this paper, we
explore the integration of large language models (LLMs) into the video editing
workflow to reduce these barriers. Our design vision is embodied in LAVE, a
novel system that provides LLM-powered agent assistance and language-augmented
editing features. LAVE automatically generates language descriptions for the
user's footage, serving as the foundation for enabling the LLM to process
videos and assist in editing tasks. When the user provides editing objectives,
the agent plans and executes relevant actions to fulfill them. Moreover, LAVE
allows users to edit videos through either the agent or direct UI manipulation,
providing flexibility and enabling manual refinement of agent actions. Our user
study, which included eight participants ranging from novices to proficient
editors, demonstrated LAVE's effectiveness. The results also shed light on user
perceptions of the proposed LLM-assisted editing paradigm and its impact on
users' creativity and sense of co-creation. Based on these findings, we propose
design implications to inform the future development of agent-assisted content
editing. |
This paper presents LAVE, a video editing tool that integrates Large Language Models (LLMs) to provide agent assistance and language-augmented editing features, aiming to lower editing barriers for beginners and enhance the editing workflow. |
Video creation is popular, but the complexity of editing poses challenges for beginners. LAVE addresses these challenges by leveraging LLMs' linguistic capabilities to assist users throughout the editing process, from ideation to execution. |
LAVE combines a language-augmented video gallery with an LLM-based plan-and-execute agent. It automatically generates textual descriptions of videos, enabling the agent to understand and manipulate them based on user instructions. |
User study participants successfully produced videos using LAVE and found it enjoyable and efficient.
Users appreciated the novelty of LAVE's language-driven interaction and its potential to democratize video editing.
The study revealed varying preferences for agent assistance, emphasizing the need for adaptive support tailored to user needs and task types. |
LAVE's current agent design and editing functions could be further enhanced, for example, by incorporating multi-agent systems and more fine-grained editing controls.
Future work can address limitations related to LLM capabilities, such as the limited context window and occasional factual inaccuracies. |
video editing, llms, agents, human-ai co-creation, language augmentation |
2402.10210
Report |
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation |
Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, Quanquan Gu |
Fine-tuning Diffusion Models remains an underexplored frontier in generative
artificial intelligence (GenAI), especially when compared with the remarkable
progress made in fine-tuning Large Language Models (LLMs). While cutting-edge
diffusion models such as Stable Diffusion (SD) and SDXL rely on supervised
fine-tuning, their performance inevitably plateaus after seeing a certain
volume of data. Recently, reinforcement learning (RL) has been employed to
fine-tune diffusion models with human preference data, but it requires at least
two images ("winner" and "loser" images) for each text prompt. In this paper,
we introduce an innovative technique called self-play fine-tuning for diffusion
models (SPIN-Diffusion), where the diffusion model engages in competition with
its earlier versions, facilitating an iterative self-improvement process. Our
approach offers an alternative to conventional supervised fine-tuning and RL
strategies, significantly improving both model performance and alignment. Our
experiments on the Pick-a-Pic dataset reveal that SPIN-Diffusion outperforms
the existing supervised fine-tuning method in aspects of human preference
alignment and visual appeal right from its first iteration. By the second
iteration, it exceeds the performance of RLHF-based methods across all metrics,
achieving these results with less data. |
This paper introduces SPIN-Diffusion, a novel self-play fine-tuning method for diffusion models that effectively utilizes datasets with only one image per text prompt. |
Fine-tuning diffusion models like Stable Diffusion often plateaus with limited data, and existing reinforcement learning methods require multiple images per prompt, limiting their applicability to common datasets. |
SPIN-Diffusion leverages a self-play mechanism where the diffusion model competes against its earlier versions, iteratively improving its performance through a decomposed training objective based on differentiating and deceiving the test function. |
SPIN-Diffusion outperforms supervised fine-tuning and existing Diffusion-DPO methods in human preference alignment and visual appeal.
The method surpasses baselines on the Pick-a-Pic dataset, achieving superior scores in metrics like PickScore and Aesthetic score.
Theoretical analysis shows SPIN-Diffusion's stationary point aligns with the target data distribution, outperforming traditional supervised fine-tuning. |
The paper mainly focuses on text-to-image generation, and further investigation is needed for other applications of diffusion models.
Future work could explore incorporating human feedback during the fine-tuning process to further enhance performance. |
diffusion models, self-play fine-tuning, text-to-image generation, generative ai, stable diffusion |
2402.10208
Report |
Recovering the Pre-Fine-Tuning Weights of Generative Models |
Eliahu Horwitz, Jonathan Kahana, Yedid Hoshen |
The dominant paradigm in generative modeling consists of two steps: i)
pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained
model with human values via fine-tuning. This practice is considered safe, as
no current method can recover the unsafe, pre-fine-tuning model weights. In
this paper, we demonstrate that this assumption is often false. Concretely, we
present Spectral DeTuning, a method that can recover the weights of the
pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In
contrast to previous attacks that attempt to recover pre-fine-tuning
capabilities, our method aims to recover the exact pre-fine-tuning weights. Our
approach exploits this new vulnerability against large-scale models such as a
personalized Stable Diffusion and an aligned Mistral. |
This paper identifies a new vulnerability in LoRA fine-tuned models, enabling the recovery of pre-fine-tuning weights using multiple models fine-tuned from the same source. |
This vulnerability poses significant security and safety risks, as it allows access to potentially unsafe pre-trained models even after alignment fine-tuning. |
The authors propose Spectral DeTuning, an iterative low-rank matrix factorization method that exploits the low-rank nature of LoRA updates to recover the original weights. |
Spectral DeTuning successfully recovers pre-fine-tuning weights with high precision on various models, including ViT, Stable Diffusion, and Mistral-7B.
The method effectively reverses alignment training in Mistral-7B, restoring pre-fine-tuning generation capabilities.
Stable Diffusion LoRAs obtained from online marketplaces are also vulnerable, demonstrating the real-world applicability of this attack. |
Spectral DeTuning requires several LoRA models with the same rank to be effective.
The paper primarily focuses on LoRA and does not address other fine-tuning techniques. |
model security, lora, fine-tuning, weight recovery, alignment attack |
2402.10193
Report |
BitDelta: Your Fine-Tune May Only Be Worth One Bit |
James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai |
Large Language Models (LLMs) are typically trained in two phases:
pre-training on large internet-scale datasets, and fine-tuning for downstream
tasks. Given the higher computational demand of pre-training, it's intuitive to
assume that fine-tuning adds less new information to the model, and is thus
more compressible. We explore this assumption by decomposing the weights of
fine-tuned models into their pre-trained components and an additional delta. We
introduce a simple method, BitDelta, which successfully quantizes this delta
down to 1 bit without compromising performance. This interesting finding not
only highlights the potential redundancy of information added during
fine-tuning, but also has significant implications for the multi-tenant serving
and multi-tenant storage of fine-tuned models. By enabling the use of a single
high-precision base model accompanied by multiple 1-bit deltas, BitDelta
dramatically reduces GPU memory requirements by more than 10x, which can also
be translated to enhanced generation latency in multi-tenant settings. We
validate BitDelta through experiments across Llama-2 and Mistral model
families, and on models up to 70B parameters, showcasing minimal performance
degradation over all tested settings. |
\oursmethod quantizes the weight delta between fine-tuned and pre-trained LLMs down to 1 bit without hurting performance. |
Storing and serving numerous fine-tuned LLMs is expensive. \oursmethod addresses this by compressing the fine-tuning information (the delta) significantly. |
\oursmethod first quantizes the delta to 1-bit by taking the sign. It then calibrates per-matrix scaling factors via distillation on a small dataset. |
Quantizing the delta to 1-bit leads to over 10x compression.
\oursmethod achieves comparable performance to the original fine-tuned models across various tasks, model families (Llama-2, Mistral), and sizes (7B-70B).
Preliminary results with a custom Triton kernel show that \oursmethod can lead to a 2x speedup in multi-tenant serving latency. |
The current implementation of the efficient inference kernel is not fully optimized.
Potential alignment degradation due to the lossy compression of fine-tuning information needs further investigation. |
model compression, quantization, large language models, multi-tenant serving, parameter-efficient fine-tuning |
2402.10093
Report |
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations |
Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter, Johannes Brandstetter |
We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning
boost for pre-trained MIM models. The motivation behind MIM-Refiner is rooted
in the insight that optimal representations within MIM models generally reside
in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive
heads that are connected to diverse intermediate layers. In each head, a
modified nearest neighbor objective helps to construct respective semantic
clusters.
The refinement process is short but effective. Within a few epochs, we refine
the features of MIM models from subpar to state-of-the-art, off-the-shelf
features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K,
achieves new state-of-the-art results in linear probing (84.7%) and low-shot
classification among models that are pre-trained on ImageNet-1K. In ImageNet-1K
1-shot classification, MIM-Refiner sets a new state-of-the-art of 64.2%,
outperforming larger models that were trained on up to 2000x more data such as
DINOv2-g, OpenCLIP-G and MAWS-6.5B. Project page:
https://ml-jku.github.io/MIM-Refiner |
Introduces MIM-Refiner, a method using contrastive learning to boost pre-trained Masked Image Modeling (MIM) models by leveraging representations in intermediate layers. |
MIM models often have subpar representations in later encoder blocks due to their lightweight decoders, limiting downstream task performance. |
MIM-Refiner attaches multiple contrastive heads to intermediate encoder blocks, including those with peak representation quality. It employs Nearest Neighbor Alignment (NNA), aligning each sample with its nearest neighbor while repelling others, to enforce semantic clusters. |
Refined MIM models achieve state-of-the-art linear probing (84.7%) and low-shot classification on ImageNet-1K among models pre-trained on the same dataset.
MIM-Refiner advances ImageNet-1K 1-shot classification to 64.2%, surpassing larger models trained on significantly more data.
Significantly improved clustering performance, as measured by metrics like ACC and NMI, indicating better-defined semantic clusters. |
Reliance on batch normalization in ID heads limits scalability to distributed setups.
Exploration of alternative solutions to the lightweight decoder issue, such as larger decoders or different training schemes. |
self-supervised learning, masked image modeling, contrastive learning, instance discrimination, vision transformer |
2402.09966
Report |
Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation |
Junjie Shentu, Matthew Watson, Noura Al Moubayed |
Subject-driven text-to-image diffusion models empower users to tailor the
model to new concepts absent in the pre-training dataset using a few sample
images. However, prevalent subject-driven models primarily rely on
single-concept input images, facing challenges in specifying the target concept
when dealing with multi-concept input images. To this end, we introduce a
textual localized text-to-image model (Texual Localization) to handle
multi-concept input images. During fine-tuning, our method incorporates a novel
cross-attention guidance to decompose multiple concepts, establishing distinct
connections between the visual representation of the target concept and the
identifier token in the text prompt. Experimental results reveal that our
method outperforms or performs comparably to the baseline models in terms of
image fidelity and image-text alignment on multi-concept input images. In
comparison to Custom Diffusion, our method with hard guidance achieves CLIP-I
scores that are 7.04%, 8.13% higher and CLIP-T scores that are 2.22%, 5.85%
higher in single-concept and multi-concept generation, respectively. Notably,
our method generates cross-attention maps consistent with the target concept in
the generated images, a capability absent in existing models. |
This paper introduces \textit{Textual Localization}, a subject-driven text-to-image model designed to handle multi-concept input images for generating customized images containing new concepts. |
Existing subject-driven text-to-image models struggle to specify target concepts within multi-concept images, often generating all concepts present in the input. |
\textit{Textual Localization} incorporates cross-attention guidance during fine-tuning to decompose multi-concept images, establishing distinct connections between the visual representation of the target concept and its identifier token in the text prompt. Two guidance strategies are explored: hard and soft guidance. |
The method outperforms or performs comparably to baseline models in terms of image fidelity and image-text alignment in both single-concept and multi-concept generation.
Hard guidance proves particularly effective for multi-concept generation, achieving superior image fidelity and accurately outlining target concepts in cross-attention maps.
Optimizing specific parameters (Wk and Wv matrices in cross-attention layers) is identified as crucial for balancing visual representation learning and semantic knowledge preservation. |
The model exhibits limitations in capturing intricate details of target concepts.
Future work will focus on enhancing detail capture, potentially by using more powerful feature extractors, and improving multi-concept generation success rates via guiding techniques during inference. |
text-to-image generation, subject-driven generation, diffusion models, cross-attention guidance, multi-concept images |
2402.09872
Report |
Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community |
Arman Isajanyan, Artur Shatveryan, David Kocharyan, Zhangyang Wang, Humphrey Shi |
Social reward as a form of community recognition provides a strong source of
motivation for users of online platforms to engage and contribute with content.
The recent progress of text-conditioned image synthesis has ushered in a
collaborative era where AI empowers users to craft original visual artworks
seeking community validation. Nevertheless, assessing these models in the
context of collective community preference introduces distinct challenges.
Existing evaluation methods predominantly center on limited size user studies
guided by image quality and prompt alignment. This work pioneers a paradigm
shift, unveiling Social Reward - an innovative reward modeling framework that
leverages implicit feedback from social network users engaged in creative
editing of generated images. We embark on an extensive journey of dataset
curation and refinement, drawing from Picsart: an online visual creation and
editing platform, yielding a first million-user-scale dataset of implicit human
preferences for user-generated visual art named Picsart Image-Social. Our
analysis exposes the shortcomings of current metrics in modeling community
creative preference of text-to-image models' outputs, compelling us to
introduce a novel predictive model explicitly tailored to address these
limitations. Rigorous quantitative experiments and user study show that our
Social Reward model aligns better with social popularity than existing metrics.
Furthermore, we utilize Social Reward to fine-tune text-to-image models,
yielding images that are more favored by not only Social Reward, but also other
established metrics. These findings highlight the relevance and effectiveness
of Social Reward in assessing community appreciation for AI-generated artworks,
establishing a closer alignment with users' creative goals: creating popular
visual art. Codes can be accessed at
https://github.com/Picsart-AI-Research/Social-Reward |
This work introduces "Social Reward", a novel reward modeling framework for text-to-image synthesis that leverages implicit feedback from social network users engaged in creative editing of generated images. |
Assessing text-to-image models in the context of collective community preference is crucial, especially for creative editing, but existing methods are limited by small user studies focused on image quality and prompt alignment. |
The authors curate a million-user-scale dataset of implicit human preferences from Picsart, a visual creation platform, and develop a predictive model that leverages collective implicit feedback from users who employ generated images for creative purposes. |
Existing metrics fall short in capturing community-level creative preference for text-to-image model outputs.
The Social Reward model outperforms existing metrics in predicting social popularity of generated images for creative editing.
Fine-tuning text-to-image models with Social Reward improves alignment with both Social Reward and other established metrics. |
Social Reward is currently focused on creative editing and might not generalize to other domains.
Further research is needed to investigate the impact of specific editing actions on Social Reward. |
text-to-image synthesis, reward modeling, social network popularity, creative editing, human preference learning |
2402.09812
Report |
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization |
Jisu Nam, Heesu Kim, DongJae Lee, Siyoon Jin, Seungryong Kim, Seunggyu Chang |
The objective of text-to-image (T2I) personalization is to customize a
diffusion model to a user-provided reference concept, generating diverse images
of the concept aligned with the target prompts. Conventional methods
representing the reference concepts using unique text embeddings often fail to
accurately mimic the appearance of the reference. To address this, one solution
may be explicitly conditioning the reference images into the target denoising
process, known as key-value replacement. However, prior works are constrained
to local editing since they disrupt the structure path of the pre-trained T2I
model. To overcome this, we propose a novel plug-in method, called
DreamMatcher, which reformulates T2I personalization as semantic matching.
Specifically, DreamMatcher replaces the target values with reference values
aligned by semantic matching, while leaving the structure path unchanged to
preserve the versatile capability of pre-trained T2I models for generating
diverse structures. We also introduce a semantic-consistent masking strategy to
isolate the personalized concept from irrelevant regions introduced by the
target prompts. Compatible with existing T2I models, DreamMatcher shows
significant improvements in complex scenarios. Intensive analyses demonstrate
the effectiveness of our approach. |
DreamMatcher is a novel plug-in method for text-to-image (T2I) personalization that enhances subject appearance by transferring reference appearance while preserving diverse structures guided by target prompts. |
Existing T2I personalization methods often fail to accurately mimic the appearance of subjects, especially in complex non-rigid scenarios, due to the limited expressivity of text embeddings or disruptions to the target structure path. |
DreamMatcher leverages semantic matching within a reference-target dual-branch framework. It utilizes appearance matching self-attention (AMA) to align reference appearance with the target structure while maintaining the pre-trained structure path. It also introduces semantic matching guidance to enhance fine-grained subject details and a semantic-consistent masking strategy to isolate the personalized concept from irrelevant regions. |
DreamMatcher significantly improves subject fidelity compared to existing T2I personalization methods, including Textual Inversion, DreamBooth, and CustomDiffusion, while effectively preserving prompt fidelity.
DreamMatcher outperforms previous tuning-free plug-in methods, such as FreeU and MagicFusion, and even a learnable method, ViCo, in both quantitative metrics and user studies.
Ablation studies confirm the effectiveness of each component, highlighting the importance of semantic matching, consistent masking, and matching guidance for achieving high-fidelity personalization. |
DreamMatcher may not handle stylization prompts that are not present in the reference images.
The personalization quality can be affected by the selection of reference images, with reference images containing richer visual attributes leading to better results. |
text-to-image personalization, diffusion models, semantic matching, appearance transfer, plug-in methods |
2402.09712
Report |
Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement |
Tao Yang, Cuiling Lan, Yan Lu, Nanning zheng |
Disentangled representation learning strives to extract the intrinsic factors
within observed data. Factorizing these representations in an unsupervised
manner is notably challenging and usually requires tailored loss functions or
specific structural designs. In this paper, we introduce a new perspective and
framework, demonstrating that diffusion models with cross-attention can serve
as a powerful inductive bias to facilitate the learning of disentangled
representations. We propose to encode an image to a set of concept tokens and
treat them as the condition of the latent diffusion for image reconstruction,
where cross-attention over the concept tokens is used to bridge the interaction
between the encoder and diffusion. Without any additional regularization, this
framework achieves superior disentanglement performance on the benchmark
datasets, surpassing all previous methods with intricate designs. We have
conducted comprehensive ablation studies and visualization analysis, shedding
light on the functioning of this model. This is the first work to reveal the
potent disentanglement capability of diffusion models with cross-attention,
requiring no complex designs. We anticipate that our findings will inspire more
investigation on exploring diffusion for disentangled representation learning
towards more sophisticated data analysis and understanding. |
This paper, for the first time, shows that diffusion models with cross-attention can serve as a strong inductive bias for learning disentangled representations. |
Disentangled representation learning is crucial for enhancing interpretability, generalizability, and controllability of machine learning models but remains a challenging task. |
The paper proposes EncDiff, a simple framework where an image encoder transforms an image into concept tokens, which condition a latent diffusion model with cross-attention for image reconstruction. |
EncDiff achieves state-of-the-art disentanglement performance on benchmark datasets, outperforming previous methods with complex designs.
Ablation studies confirm that both diffusion modeling and cross-attention interaction are crucial for the disentanglement capability.
Visualization analysis provides insights into the alignment between learned concept tokens and spatial features, verifying the disentanglement effectiveness. |
While effective on simple datasets, achieving satisfactory disentanglement on complex data remains a challenge.
Although faster than some diffusion-based methods, the generation speed is still slower compared to VAE-based and GAN-based methods, requiring more efficient sampling strategies in the future. |
disentangled representation learning, diffusion models, cross-attention, inductive bias, unsupervised learning |
2402.09368
Report |
Magic-Me: Identity-Specific Video Customized Diffusion |
Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, Jiashi Feng |
Creating content with specified identities (ID) has attracted significant
interest in the field of generative models. In the field of text-to-image
generation (T2I), subject-driven creation has achieved great progress with the
identity controlled via reference images. However, its extension to video
generation is not well explored. In this work, we propose a simple yet
effective subject identity controllable video generation framework, termed
Video Custom Diffusion (VCD). With a specified identity defined by a few
images, VCD reinforces the identity characteristics and injects frame-wise
correlation at the initialization stage for stable video outputs. To achieve
this, we propose three novel components that are essential for high-quality
identity preservation and stable video generation: 1) a noise initialization
method with 3D Gaussian Noise Prior for better inter-frame stability; 2) an ID
module based on extended Textual Inversion trained with the cropped identity to
disentangle the ID information from the background 3) Face VCD and Tiled VCD
modules to reinforce faces and upscale the video to higher resolution while
preserving the identity's features. We conducted extensive experiments to
verify that VCD is able to generate stable videos with better ID over the
baselines. Besides, with the transferability of the encoded identity in the ID
module, VCD is also working well with personalized text-to-image models
available publicly. The codes are available at
https://github.com/Zhen-Dong/Magic-Me. |
This paper proposes Video Custom Diffusion (VCD), a novel framework for generating high-quality, identity-specific videos. |
Creating videos with specific identities is challenging due to the difficulty of maintaining identity features and motion consistency across frames. |
VCD uses a three-stage approach: 1) T2V VCD generates initial low-resolution videos, 2) Face VCD enhances facial features, and 3) Tiled VCD upscales the video while preserving identity. A 3D Gaussian Noise Prior ensures stable motion, and an extended Textual Inversion-based ID module preserves identity. |
VCD generates videos with better identity preservation and text alignment compared to baselines.
The 3D Gaussian Noise Prior significantly improves temporal consistency in generated videos.
The ID module effectively disentangles identity information while aligning with user prompts. |
VCD faces challenges in generating videos with multiple interacting identities.
The framework is currently limited to generating short videos due to the motion module's capacity. |
video generation, diffusion models, identity customization, text-to-video, motion consistency |
2402.09240
Report |
Switch EMA: A Free Lunch for Better Flatness and Sharpness |
Siyuan Li, Zicheng Liu, Juanxi Tian, Ge Wang, Zedong Wang, Weiyang Jin, Di Wu, Cheng Tan, Tao Lin, Yang Liu, Baigui Sun, Stan Z. Li |
Exponential Moving Average (EMA) is a widely used weight averaging (WA)
regularization to learn flat optima for better generalizations without extra
cost in deep neural network (DNN) optimization. Despite achieving better
flatness, existing WA methods might fall into worse final performances or
require extra test-time computations. This work unveils the full potential of
EMA with a single line of modification, i.e., switching the EMA parameters to
the original model after each epoch, dubbed as Switch EMA (SEMA). From both
theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to
reach generalization optima that better trade-off between flatness and
sharpness. To verify the effectiveness of SEMA, we conduct comparison
experiments with discriminative, generative, and regression tasks on vision and
language datasets, including image classification, self-supervised learning,
object detection and segmentation, image generation, video prediction,
attribute regression, and language modeling. Comprehensive results with popular
optimizers and networks show that SEMA is a free lunch for DNN training by
improving performances and boosting convergence speeds. |
This paper introduces Switch Exponential Moving Average (SEMA), a novel weight averaging method that enhances deep neural network optimization by dynamically switching between a fast model and a slow, exponentially averaged model during training. |
The proposed SEMA method aims to overcome limitations of existing weight averaging techniques by combining the fast convergence of EMA with the ability to explore sharper, potentially better, local minima, thus improving generalization capabilities of DNNs. |
SEMA leverages the EMA algorithm but, crucially, switches the model parameters to the EMA-averaged weights after each training epoch, allowing the optimizer to further explore the loss landscape from this new starting point. |
SEMA consistently outperforms baseline models and other weight averaging techniques, including EMA and SWA, across a diverse range of tasks, such as image classification, self-supervised learning, object detection, and language modeling.
SEMA demonstrates faster convergence speeds compared to traditional training setups and EMA.
The paper provides theoretical analysis demonstrating SEMA's ability to reduce low-frequency oscillations and maintain a gradient descent property, contributing to its stability and fast convergence. |
The paper primarily focuses on empirical validation of SEMA, leaving further theoretical exploration of its properties and behavior in different optimization landscapes for future work.
While the one-epoch switching interval proves effective across various tasks, exploring task-specific optimal intervals might further enhance SEMA's performance. |
deep neural networks, optimization, regularization, weight averaging, exponential moving average |
2402.09052
Report |
L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects |
Yutaro Yamada, Khyathi Chandu, Yuchen Lin, Jack Hessel, Ilker Yildirim, Yejin Choi |
Diffusion-based image generation models such as DALL-E 3 and Stable
Diffusion-XL demonstrate remarkable capabilities in generating images with
realistic and unique compositions. Yet, these models are not robust in
precisely reasoning about physical and spatial configurations of objects,
especially when instructed with unconventional, thereby out-of-distribution
descriptions, such as "a chair with five legs". In this paper, we propose a
language agent with chain-of-3D-thoughts (L3GO), an inference-time approach
that can reason about part-based 3D mesh generation of unconventional objects
that current data-driven diffusion models struggle with. More concretely, we
use large language models as agents to compose a desired object via
trial-and-error within the 3D simulation environment. To facilitate our
investigation, we develop a new benchmark, Unconventionally Feasible Objects
(UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender
where language agents can build and compose atomic building blocks via API
calls. Human and automatic GPT-4V evaluations show that our approach surpasses
the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D
mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our
approach outperforms other state-of-the-art text-to-2D image and text-to-3D
models based on human evaluation. |
This paper introduces L3GO, an inference-time approach that uses language agents with chain-of-3D-thoughts to generate unconventional objects in a 3D environment. |
Existing diffusion-based image generation models struggle to accurately generate objects from unconventional descriptions requiring precise 3D spatial understanding. L3GO leverages the reasoning capabilities of LLMs to address this. |
L3GO decomposes object creation into iterative part-by-part generation, utilizing LLMs as agents for part specification, spatial reasoning, coordinate calculation, action execution, and critique within a custom Blender environment called 'L3Env'. |
L3GO outperforms baseline LLM agents (GPT-4, ReAct-B, Reflexion-B) and achieves higher accuracy in generating 3D meshes on ShapeNet based on human and GPT-4V evaluation.
Human evaluation on a new benchmark 'Unconventionally Feasible Objects (UFO)' shows L3GO surpasses state-of-the-art text-to-2D image (DALL-E 3, SDXL) and text-to-3D models (Shap-E).
Ablation studies show the importance of the spatial critic and program-based coordinate calculation modules in L3GO. |
The quality of LLM-generated 3D meshes is not yet on par with human-designed meshes or those from diffusion-based methods.
The object generation process can be time-consuming, particularly for complex objects, highlighting the need for efficiency improvements. |
3d object generation, language agents, chain-of-thought, blender, unconventional objects |
2402.08960
Report |
Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision |
Zhaoqing Wang, Xiaobo Xia, Ziye Chen, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu |
Contemporary cutting-edge open-vocabulary segmentation approaches commonly
rely on image-mask-text triplets, yet this restricted annotation is
labour-intensive and encounters scalability hurdles in complex real-world
scenarios. Although some methods are proposed to reduce the annotation cost
with only text supervision, the incompleteness of supervision severely limits
the versatility and performance. In this paper, we liberate the strict
correspondence between masks and texts by using independent image-mask and
image-text pairs, which can be easily collected respectively. With this
unpaired mask-text supervision, we propose a new weakly-supervised
open-vocabulary segmentation framework (Uni-OVSeg) that leverages confident
pairs of mask predictions and entities in text descriptions. Using the
independent image-mask and image-text pairs, we predict a set of binary masks
and associate them with entities by resorting to the CLIP embedding space.
However, the inherent noise in the correspondence between masks and entities
poses a significant challenge when obtaining reliable pairs. In light of this,
we advocate using the large vision-language model (LVLM) to refine text
descriptions and devise a multi-scale ensemble to stablise the matching between
masks and entities. Compared to text-only weakly-supervised methods, our
Uni-OVSeg achieves substantial improvements of 15.5% mIoU on the ADE20K
datasets, and even surpasses fully-supervised methods on the challenging PASCAL
Context-459 dataset. |
This paper proposes Uni-OVSeg, a novel weakly-supervised open-vocabulary segmentation framework that utilizes unpaired image-mask and image-text pairs for training, significantly reducing the annotation cost associated with traditional image-mask-text triplets. |
Open-vocabulary segmentation, crucial for segmenting and categorizing objects from an extensive vocabulary, often relies on expensive and difficult-to-obtain image-mask-text triplets. Uni-OVSeg addresses this challenge by leveraging more readily available unpaired data sources. |
Uni-OVSeg consists of a mask generation branch (using a visual prompt encoder, pixel decoder, and mask decoder) to predict binary masks from images. Concurrently, it employs a large vision-language model (LLaVa) for text refinement and a ChatGPT-based parser for entity extraction from image captions. A mask-text bipartite matching aligns predicted masks with text entities, aided by a multi-scale feature adapter and ensemble strategy for robust correspondence. |
Uni-OVSeg achieves substantial improvements over weakly-supervised methods, with a 15.5% mIoU gain on ADE20K.
It even surpasses fully-supervised methods on the challenging PASCAL Context-459 dataset, demonstrating its strong open-vocabulary capability.
In promptable segmentation tasks, Uni-OVSeg consistently outperforms SAM, showcasing its efficacy in interactive segmentation with point and box prompts. |
The inherent noise in unpaired mask-text correspondence presents a challenge, although mitigated by the multi-scale ensemble strategy.
The use of multiple granularity masks in the image-mask training data impacts performance on panoptic segmentation tasks requiring specific instance differentiation. |
open-vocabulary segmentation, weakly-supervised learning, vision-language models, promptable segmentation, multi-scale ensemble |
2402.08919
Report |
Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding |
Alessandro Achille, Greg Ver Steeg, Tian Yu Liu, Matthew Trager, Carson Klingenberg, Stefano Soatto |
Quantifying the degree of similarity between images is a key copyright issue
for image-based machine learning. In legal doctrine however, determining the
degree of similarity between works requires subjective analysis, and
fact-finders (judges and juries) can demonstrate considerable variability in
these subjective judgement calls. Images that are structurally similar can be
deemed dissimilar, whereas images of completely different scenes can be deemed
similar enough to support a claim of copying. We seek to define and compute a
notion of "conceptual similarity" among images that captures high-level
relations even among images that do not share repeated elements or visually
similar components. The idea is to use a base multi-modal model to generate
"explanations" (captions) of visual data at increasing levels of complexity.
Then, similarity can be measured by the length of the caption needed to
discriminate between the two images: Two highly dissimilar images can be
discriminated early in their description, whereas conceptually dissimilar ones
will need more detail to be distinguished. We operationalize this definition
and show that it correlates with subjective (averaged human evaluation)
assessment, and beats existing baselines on both image-to-image and
text-to-text similarity benchmarks. Beyond just providing a number, our method
also offers interpretability by pointing to the specific level of granularity
of the description where the source data are differentiated. |
This paper introduces CC:DAE, a novel method for measuring "conceptual similarity" between data samples like images and text, focusing on shared high-level concepts rather than just pixel-level visual similarities. |
Defining objective similarity is crucial for copyright in machine learning, but existing methods struggle to capture human-like understanding of conceptual relationships. |
CC:DAE generates textual descriptions of increasing complexity for each sample using a pre-trained language model. It then quantifies similarity based on how well descriptions of one sample fit the other at varying complexity levels. A small distance at high complexity signifies high conceptual similarity. |
CC:DAE outperforms existing zero-shot methods on text similarity benchmarks, aligning better with human judgments.
It surpasses CLIP on image similarity tasks, demonstrating its ability to capture conceptual relations beyond visual features.
The method generalizes to cross-modal comparisons, effectively measuring similarity between text and images. |
The current implementation relies solely on text descriptions, limiting its ability to capture visual arrangement similarities.
Future work could explore incorporating visual features into the description space for a more comprehensive approach. |
conceptual similarity, copyright, multi-modal learning, language models, image similarity |
2402.08875
Report |
Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos |
Yang Qian, Yinan Sun, Ali Kargarandehkordi, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington |
The increasing variety and quantity of tagged multimedia content on platforms
such as TikTok provides an opportunity to advance computer vision modeling. We
have curated a distinctive dataset of 283,582 unique video clips categorized
under 386 hashtags relating to modern human actions. We release this dataset as
a valuable resource for building domain-specific foundation models for human
movement modeling tasks such as action recognition. To validate this dataset,
which we name TikTokActions, we perform two sets of experiments. First, we
pretrain the state-of-the-art VideoMAEv2 with a ViT-base backbone on
TikTokActions subset, and then fine-tune and evaluate on popular datasets such
as UCF101 and the HMDB51. We find that the performance of the model pre-trained
using our Tik-Tok dataset is comparable to models trained on larger action
recognition datasets (95.3% on UCF101 and 53.24% on HMDB51). Furthermore, our
investigation into the relationship between pre-training dataset size and
fine-tuning performance reveals that beyond a certain threshold, the
incremental benefit of larger training sets diminishes. This work introduces a
useful TikTok video dataset that is available for public use and provides
insights into the marginal benefit of increasing pre-training dataset sizes for
video-based foundation models. |
This paper investigates the use of a large, unlabeled dataset of TikTok videos for pre-training a foundation model (VideoMAEv2) for human action recognition. |
This is important because it leverages the diverse and dynamic nature of TikTok videos to improve action recognition in real-world scenarios and challenges the assumption that larger datasets are always better for pre-training. |
The authors curated a dataset of over 280,000 TikTok videos, pre-trained VideoMAEv2 on this dataset, and fine-tuned it on established benchmarks (UCF101, HMDB51, Kinetics-400, Something-Something V2). |
The fine-tuned model achieves state-of-the-art results on these benchmarks, demonstrating the effectiveness of using TikTok videos for pre-training.
The study found that while increasing the pre-training dataset size initially improves performance, the benefits diminish with further increases.
This suggests that a well-curated, smaller dataset can sometimes be more effective than a larger, more general one. |
The study acknowledges the ethical considerations of using online video data, particularly regarding privacy and informed consent.
Future work could explore the use of weekly self-supervised learning to further improve the model's adaptability to dynamic content. |
action recognition, foundation models, self-supervised learning, tiktok, video understanding |
2402.08714
Report |
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models |
Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou |
Reward finetuning has emerged as a promising approach to aligning foundation
models with downstream objectives. Remarkable success has been achieved in the
language domain by using reinforcement learning (RL) to maximize rewards that
reflect human preference. However, in the vision domain, existing RL-based
reward finetuning methods are limited by their instability in large-scale
training, rendering them incapable of generalizing to complex, unseen prompts.
In this paper, we propose Proximal Reward Difference Prediction (PRDP),
enabling stable black-box reward finetuning for diffusion models for the first
time on large-scale prompt datasets with over 100K prompts. Our key innovation
is the Reward Difference Prediction (RDP) objective that has the same optimal
solution as the RL objective while enjoying better training stability.
Specifically, the RDP objective is a supervised regression objective that tasks
the diffusion model with predicting the reward difference of generated image
pairs from their denoising trajectories. We theoretically prove that the
diffusion model that obtains perfect reward difference prediction is exactly
the maximizer of the RL objective. We further develop an online algorithm with
proximal updates to stably optimize the RDP objective. In experiments, we
demonstrate that PRDP can match the reward maximization ability of
well-established RL-based methods in small-scale training. Furthermore, through
large-scale training on text prompts from the Human Preference Dataset v2 and
the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a
diverse set of complex, unseen prompts whereas RL-based methods completely
fail. |
This paper introduces PRDP, the first black-box reward finetuning method for diffusion models that remains stable even when trained on large-scale datasets (100K+ prompts). |
Existing reinforcement learning (RL) based methods for finetuning diffusion models with rewards struggle to scale to large datasets due to instability during training, limiting their ability to generalize to complex and unseen prompts. |
PRDP addresses instability by: 1. Converting the RLHF objective into a supervised regression objective called Reward Difference Prediction (RDP), where the model predicts the reward difference between generated image pairs. 2. Employing proximal updates and online optimization to further enhance training stability and generation quality. |
PRDP achieves comparable reward maximization to established RL-based methods in small-scale training.
PRDP demonstrates superior stability in large-scale training where RL-based methods fail.
PRDP generates higher quality images and generalizes better to unseen prompts after large-scale training. |
The per-prompt reward normalization, crucial for DDPO's stability, is ineffective in large-scale settings due to limited prompt occurrences.
Future work could explore techniques to make reward normalization more effective in large-scale scenarios. |
diffusion models, reward finetuning, text-to-image synthesis, reinforcement learning, stable training |
2402.08682
Report |
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation |
Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, Filippos Kokkinos |
Most text-to-3D generators build upon off-the-shelf text-to-image models
trained on billions of images. They use variants of Score Distillation Sampling
(SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation
is to fine-tune the 2D generator to be multi-view aware, which can help
distillation or can be combined with reconstruction networks to output 3D
objects directly. In this paper, we further explore the design space of
text-to-3D models. We significantly improve multi-view generation by
considering video instead of image generators. Combined with a 3D
reconstruction algorithm which, by using Gaussian splatting, can optimize a
robust image-based loss, we directly produce high-quality 3D outputs from the
generated views. Our new method, IM-3D, reduces the number of evaluations of
the 2D generator network 10-100x, resulting in a much more efficient pipeline,
better quality, fewer geometric inconsistencies, and higher yield of usable 3D
assets. |
Introduces \method, a text/image-to-3D generation approach that leverages iterative multiview diffusion and reconstruction using a video generator network and direct 3D fitting with Gaussian splatting, eliminating the need for Score Distillation Sampling (SDS) and reconstruction networks. |
Addresses limitations of SDS-based methods (slow, unstable, artifact-prone) and direct reconstruction methods (limited quality) by improving multi-view generation quality and efficiency. |
1. Fine-tune a text-to-video generator (Emu Video) on synthetic 3D data to generate consistent multi-view sequences. 2. Directly fit a 3D Gaussian splatting model to the generated views using robust image-based losses (LPIPS, MS-SSIM). 3. Iteratively refine the 3D model by feeding back rendered views to the video generator. |
Significantly reduces the number of 2D generator evaluations compared to SDS (10-100x faster).
Achieves state-of-the-art text/image-to-3D generation quality, outperforming existing methods in faithfulness and visual quality.
Enables fast and robust 3D reconstruction without requiring training of large reconstruction networks. |
Struggles with highly dynamic subjects, sometimes generating spurious animations.
Relies on a synthetic dataset for training the multi-view video generator. |
text-to-3d, image-to-3d, video generation, gaussian splatting, multi-view consistency |
2402.08680
Report |
Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance |
Linxi Zhao, Yihe Deng, Weitong Zhang, Quanquan Gu |
The advancement of Large Vision-Language Models (LVLMs) has increasingly
highlighted the critical issue of their tendency to hallucinate non-existing
objects in the images. To address this issue, previous works focused on using
specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the
outputs of LVLMs. However, these approaches require either expensive
training/fine-tuning or API access to advanced LLMs to correct the model's
output post-generation. In this paper, we tackle this challenge by introducing
a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE
(MARINE), which is both training-free and API-free, and can effectively and
efficiently reduce object hallucinations during the generation process.
Specifically, MARINE enriches the visual context of LVLMs by integrating
existing open-source vision models, and employs classifier-free guidance to
incorporate the additional object grounding features to improve the precision
of LVLMs' generations. Through comprehensive evaluations across $6$ popular
LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of
MARINE, which even outperforms existing fine-tuning-based methods. Remarkably,
it not only reduces hallucinations but also improves the detailedness of LVLMs'
generations, as assessed by GPT-4V. |
This paper introduces MARINER, a training-free and API-free framework that mitigates object hallucinations in Large Vision-Language Models (LVLMs) during text generation by integrating object grounding features. |
Object hallucination, a critical issue in LVLMs where non-existing objects are described, compromises the accuracy and reliability of these models, especially in safety-critical applications. |
MARINER enriches the visual context of LVLMs by incorporating object grounding features from a pre-trained object detection model (DETR) and employs classifier-free guidance to control text generation, placing more importance on the enriched visual features. |
MARINER significantly reduces object hallucinations as measured by CHAIR and POPE metrics, outperforming existing methods.
The framework enhances the detailedness of LVLMs' generations, as assessed by GPT-4V.
MARINER strikes a balance between reducing hallucinations, maintaining computational efficiency, and preserving LLM originality. |
While the paper demonstrates MARINER with DETR, exploring other advanced vision encoders could further enhance its performance.
Further evaluation of MARINER on a wider range of benchmarks would be beneficial. |
large vision-language models, object hallucination, classifier-free guidance, object grounding, multi-modal generation |
2402.08678
Report |
Graph Mamba: Towards Learning on Graphs with State Space Models |
Ali Behrouz, Farnoosh Hashemi |
Graph Neural Networks (GNNs) have shown promising potential in graph
representation learning. The majority of GNNs define a local message-passing
mechanism, propagating information over the graph by stacking multiple layers.
These methods, however, are known to suffer from two major limitations:
over-squashing and poor capturing of long-range dependencies. Recently, Graph
Transformers (GTs) emerged as a powerful alternative to Message-Passing Neural
Networks (MPNNs). GTs, however, have quadratic computational cost, lack
inductive biases on graph structures, and rely on complex Positional/Structural
Encodings (SE/PE). In this paper, we show that while Transformers, complex
message-passing, and SE/PE are sufficient for good performance in practice,
neither is necessary. Motivated by the recent success of State Space Models
(SSMs), such as Mamba, we present Graph Mamba Networks (GMNs), a general
framework for a new class of GNNs based on selective SSMs. We discuss and
categorize the new challenges when adapting SSMs to graph-structured data, and
present four required and one optional steps to design GMNs, where we choose
(1) Neighborhood Tokenization, (2) Token Ordering, (3) Architecture of
Bidirectional Selective SSM Encoder, (4) Local Encoding, and dispensable (5) PE
and SE. We further provide theoretical justification for the power of GMNs.
Experiments demonstrate that despite much less computational cost, GMNs attain
an outstanding performance in long-range, small-scale, large-scale, and
heterophilic benchmark datasets. |
Presents Graph Mamba Networks (GMNs), a novel graph learning framework based on selective State Space Models (SSMs) like Mamba, to address limitations of Graph Neural Networks (GNNs) and Graph Transformers (GTs) in capturing long-range dependencies and scalability. |
GNNs struggle with long-range dependencies and GTs have high computational cost. GMNs offer an efficient and effective alternative. |
Introduces a 5-step recipe: (1) Tokenization: mapping the graph into a sequence of node/subgraph tokens. (2) Optional PE/SE: incorporating positional/structural encodings. (3) Local Encoding: encoding local structures around each node. (4) Token Ordering: ordering the sequence of tokens. (5) (Stack of) Bidirectional Mamba: scanning and selectively incorporating relevant nodes/subgraphs into hidden states. |
GMNs outperform baselines on benchmarks for long-range, small-scale, large-scale, and heterophilic graph datasets.
A variant of GMNs without complex components like Transformers, message-passing, and PE/SE achieves competitive performance, challenging their perceived necessity.
GMNs demonstrate superior memory efficiency compared to GTs, particularly on large graphs. |
The search space of hyperparameters is not fully explored, relying on a subspace for preliminary results.
Future work can investigate the integration of more sophisticated token ordering techniques. |
graph neural networks, graph transformers, state space models, mamba, long-range dependencies |
2402.08657
Report |
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs |
Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek, Yuki M. Asano |
Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown
immense potential by integrating large language models with vision systems.
Nevertheless, these models face challenges in the fundamental computer vision
task of object localisation, due to their training on multimodal data
containing mostly captions without explicit spatial grounding. While it is
possible to construct custom, supervised training pipelines with bounding box
annotations that integrate with VLMs, these result in specialized and
hard-to-scale models. In this paper, we aim to explore the limits of
caption-based VLMs and instead propose to tackle the challenge in a simpler
manner by i) keeping the weights of a caption-based VLM frozen and ii) not
using any supervised detection data. To this end, we introduce an
input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing
a minimal set of parameters that are slid inside the frozen VLM, unlocking
object localisation capabilities. Our PIN module is trained with a simple
next-token prediction task on synthetic data without requiring the introduction
of new output heads. Our experiments demonstrate strong zero-shot localisation
performances on a variety of images, including Pascal VOC, COCO, LVIS, and
diverse images like paintings or cartoons. |
This paper introduces PIN (Positional Insert), a lightweight learnable spatial prompt, to unlock zero-shot object localisation abilities in frozen caption-based Vision Language Models (VLMs). |
Existing VLMs, primarily trained on image-caption pairs, struggle with object localisation due to a lack of explicit spatial grounding in their training data. This work aims to address this limitation and enhance VLMs' spatial understanding. |
PIN is a spatial prompt added to the vision encoder's output, trained on synthetic data with a next-token prediction task to generate bounding box coordinates. This eliminates the need for supervised detection data or architectural changes to the VLM. |
PIN significantly improves object localisation in OpenFlamingo and BLIP-2 VLMs, outperforming baselines like in-context learning and other PEFT methods.
The method generalizes well to diverse images, including paintings, cartoons, and photos from COCO, PVOC, and LVIS datasets.
PIN shows promising zero-shot grounding capabilities on RefCOCO, achieving decent performance without using any annotated training data for this dataset. |
The model struggles with tight bounding box generation, especially for small objects, due to the low input resolution and simplistic training.
Localising multiple instances of the same object remains a challenge. |
vision-language models, object localisation, zero-shot learning, spatial prompt, synthetic data |
2402.08654
Report |
Learning Continuous 3D Words for Text-to-Image Generation |
Ta-Ying Cheng, Matheus Gadelha, Thibault Groueix, Matthew Fisher, Radomir Mech, Andrew Markham, Niki Trigoni |
Current controls over diffusion models (e.g., through text or ControlNet) for
image generation fall short in recognizing abstract, continuous attributes like
illumination direction or non-rigid shape change. In this paper, we present an
approach for allowing users of text-to-image models to have fine-grained
control of several attributes in an image. We do this by engineering special
sets of input tokens that can be transformed in a continuous manner -- we call
them Continuous 3D Words. These attributes can, for example, be represented as
sliders and applied jointly with text prompts for fine-grained control over
image generation. Given only a single mesh and a rendering engine, we show that
our approach can be adopted to provide continuous user control over several
3D-aware attributes, including time-of-day illumination, bird wing orientation,
dollyzoom effect, and object poses. Our method is capable of conditioning image
creation with multiple Continuous 3D Words and text descriptions simultaneously
while adding no overhead to the generative process. Project Page:
https://ttchengab.github.io/continuous_3d_words |
Introduces 'Continuous 3D Words', special tokens for text-to-image models enabling fine-grained control over continuous 3D attributes like illumination, non-rigid shape change, orientation, and camera parameters. |
Current text-based and ControlNet image generation methods struggle to recognize and manipulate abstract, continuous 3D attributes. This work aims to bridge this gap by integrating the precision of 3D control with the accessibility of text-to-image models. |
Trains a continuous vocabulary using a two-stage fine-tuning approach. First, Dreambooth learns the object identity from a single 3D mesh. Second, an MLP maps continuous attribute values to token embeddings, disentangling attributes from object identity. ControlNet augmentation with depth/lineart conditions enhances background and texture diversity. |
Quantitative user studies demonstrate superior performance of 'Continuous 3D Words' over ControlNet baselines in controlling various attributes.
The method generalizes well, enabling attribute control on objects semantically similar to the training mesh.
Enables simultaneous control of multiple attributes, enhancing the expressiveness of text-to-image generation. |
User study reveals a preference for condition accuracy over physical plausibility in some cases.
Current model faces challenges with style transfer from text prompts and occasional overfitting to training mesh attributes. |
text-to-image generation, continuous control, 3d attributes, diffusion models, controlnet |
2402.08622
Report |
NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs |
Michael Fischer, Zhengqin Li, Thu Nguyen-Phuoc, Aljaz Bozic, Zhao Dong, Carl Marshall, Tobias Ritschel |
A Neural Radiance Field (NeRF) encodes the specific relation of 3D geometry
and appearance of a scene. We here ask the question whether we can transfer the
appearance from a source NeRF onto a target 3D geometry in a semantically
meaningful way, such that the resulting new NeRF retains the target geometry
but has an appearance that is an analogy to the source NeRF. To this end, we
generalize classic image analogies from 2D images to NeRFs. We leverage
correspondence transfer along semantic affinity that is driven by semantic
features from large, pre-trained 2D image models to achieve multi-view
consistent appearance transfer. Our method allows exploring the mix-and-match
product space of 3D geometry and appearance. We show that our method
outperforms traditional stylization-based methods and that a large majority of
users prefer our method over several typical baselines. |
Introduces "NeRF analogies", a method for transferring visual appearance between NeRFs based on semantic affinity derived from ViT features. |
Addresses limitations of existing NeRF editing techniques by enabling combined, multi-view consistent, and semantically meaningful appearance transfer onto arbitrary 3D geometry. |
Leverages DiNO-ViT features to establish dense correspondences between source and target NeRF renderings, then trains a new NeRF to combine the target geometry with the transferred source appearance. |
Outperforms traditional stylization and image-analogy methods in transferring appearance while preserving semantic consistency.
Demonstrates superior multi-view consistency compared to 2D-based approaches, resulting in fewer artifacts and floaters.
Exhibits strong performance in user studies, with participants significantly preferring the method's output for its quality and semantic coherence. |
Reliance on accurate feature correspondences limits applicability to objects with rotational ambiguities or complex textures.
Inability to transfer texture due to the point-based appearance transfer approach. |
nerf, appearance transfer, semantic editing, vision transformer, 3d deep learning |
2402.08601
Report |
Latent Inversion with Timestep-aware Sampling for Training-free Non-rigid Editing |
Yunji Jung, Seokju Lee, Tair Djanibekov, Hyunjung Shim, Jong Chul Ye |
Text-guided non-rigid editing involves complex edits for input images, such
as changing motion or compositions within their surroundings. Since it requires
manipulating the input structure, existing methods often struggle with
preserving object identity and background, particularly when combined with
Stable Diffusion. In this work, we propose a training-free approach for
non-rigid editing with Stable Diffusion, aimed at improving the identity
preservation quality without compromising editability. Our approach comprises
three stages: text optimization, latent inversion, and timestep-aware text
injection sampling. Inspired by the recent success of Imagic, we employ their
text optimization for smooth editing. Then, we introduce latent inversion to
preserve the input image's identity without additional model fine-tuning. To
fully utilize the input reconstruction ability of latent inversion, we suggest
timestep-aware text inject sampling. This effectively retains the structure of
the input image by injecting the source text prompt in early sampling steps and
then transitioning to the target prompt in subsequent sampling steps. This
strategic approach seamlessly harmonizes with text optimization, facilitating
complex non-rigid edits to the input without losing the original identity. We
demonstrate the effectiveness of our method in terms of identity preservation,
editability, and aesthetic quality through extensive experiments. |
This paper proposes a training-free method for text-guided non-rigid image editing with Stable Diffusion, improving identity preservation without compromising editability. |
Non-rigid editing with existing methods often struggle with preserving object identity and background, especially in Stable Diffusion, limiting practical applications. |
The method utilizes text optimization for smooth editing, latent inversion for identity preservation, and timestep-aware text injection sampling for balancing identity and editability. |
Outperforms baselines in qualitative comparisons, demonstrating superior identity preservation and edit fidelity.
Quantitative evaluation shows higher CLIP and Aesthetic scores, indicating better alignment with target text and improved aesthetics.
Ablation studies confirm the effectiveness of each component, particularly latent inversion and timestep-aware sampling. |
Limitations exist in handling compositions with multiple objects and preserving high-frequency details.
Future work includes exploring faster inversion methods and improving compositional editing capabilities. |
image editing, non-rigid editing, stable diffusion, latent inversion, text optimization |
2402.08577
Report |
Test-Time Backdoor Attacks on Multimodal Large Language Models |
Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin |
Backdoor attacks are commonly executed by contaminating training data, such
that a trigger can activate predetermined harmful effects during the test
phase. In this work, we present AnyDoor, a test-time backdoor attack against
multimodal large language models (MLLMs), which involves injecting the backdoor
into the textual modality using adversarial test images (sharing the same
universal perturbation), without requiring access to or modification of the
training data. AnyDoor employs similar techniques used in universal adversarial
attacks, but distinguishes itself by its ability to decouple the timing of
setup and activation of harmful effects. In our experiments, we validate the
effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4,
InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies.
Notably, because the backdoor is injected by a universal perturbation, AnyDoor
can dynamically change its backdoor trigger prompts/harmful effects, exposing a
new challenge for defending against backdoor attacks. Our project page is
available at https://sail-sg.github.io/AnyDoor/. |
The paper introduces "AnyDoor," a novel test-time backdoor attack against multimodal large language models (MLLMs) that injects backdoors during the test phase by leveraging adversarial test images, eliminating the need for training data manipulation. |
This work exposes a significant security vulnerability in MLLMs, demonstrating that their multimodal capabilities can be exploited for malicious purposes even without access to training data. |
AnyDoor employs techniques similar to universal adversarial attacks, generating a universal perturbation applied to input images that triggers harmful effects when a specific textual prompt is provided to the MLLM. |
AnyDoor successfully attacks popular MLLMs like LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2 across various datasets.
The attack remains effective with variations in trigger prompts and harmful outputs, posing challenges for defense mechanisms.
The authors demonstrate AnyDoor's robustness under common corruptions and its applicability in dynamic video scenarios. |
The current work mainly focuses on vision-language MLLMs. Investigating other modalities like audio/speech is left for future work.
While the physical demonstrations are currently conceptual, future research should explore robust defense mechanisms to mitigate potential real-world threats. |
multimodal large language models, test-time backdoor attacks, adversarial attacks, universal perturbations, model security |
2402.08265
Report |
A Dense Reward View on Aligning Text-to-Image Diffusion with Preference |
Shentao Yang, Tianqi Chen, Mingyuan Zhou |
Aligning text-to-image diffusion model (T2I) with preference has been gaining
increasing research attention. While prior works exist on directly optimizing
T2I by preference data, these methods are developed under the bandit assumption
of a latent reward on the entire diffusion reverse chain, while ignoring the
sequential nature of the generation process. This may harm the efficacy and
efficiency of preference alignment. In this paper, we take on a finer dense
reward perspective and derive a tractable alignment objective that emphasizes
the initial steps of the T2I reverse chain. In particular, we introduce
temporal discounting into DPO-style explicit-reward-free objectives, to break
the temporal symmetry therein and suit the T2I generation hierarchy. In
experiments on single and multiple prompt generation, our method is competitive
with strong relevant baselines, both quantitatively and qualitatively. Further
investigations are conducted to illustrate the insight of our approach. |
This paper introduces a novel method for aligning text-to-image diffusion models with preference by adopting a dense reward perspective and incorporating temporal discounting. |
The traditional trajectory-level reward assumption used in DPO-style methods for text-to-image diffusion models leads to a large decision space and the sparse reward problem, hampering training effectiveness and efficiency. This paper addresses this issue by considering a finer, dense reward structure. |
The authors derive a tractable alignment objective by assuming a latent reward function for each step of the diffusion reverse chain and introducing a temporal discount factor. This approach breaks the temporal symmetry in DPO-style losses and emphasizes the initial steps of the generation process, which are crucial for establishing image outlines and high-level attributes. The resulting objective is a lower bound of a Bradley-Terry preference model, leading to a tractable loss for training the model in an explicit-reward-free manner. |
The method achieves competitive quantitative and qualitative results on single and multiple prompt generation tasks, surpassing strong baselines in terms of preference-generating metrics (ImageReward and HPSv2) and unseen Aesthetic scores.
Further investigation reveals that the method effectively generates desired image shapes earlier in the reverse chain, supporting the hypothesis that emphasizing initial steps leads to improved final image quality.
Ablation studies demonstrate the impact of the temporal discount factor and the robustness of the method to the choice of the KL coefficient. |
The iterative data collection and model training procedure inherent to the off-policy learning routine introduces additional complexity and costs compared to purely offline methods.
Storing the entire generation reverse chains, as opposed to only the final images, increases CPU memory and storage requirements. |
text-to-image diffusion model, preference alignment, dense reward, direct preference optimization (dpo), sequential generation |
2402.08018
Report |
Nearest Neighbour Score Estimators for Diffusion Generative Models |
Matthew Niedoba, Dylan Green, Saeid Naderiparizi, Vasileios Lioutas, Jonathan Wilder Lavington, Xiaoxuan Liang, Yunpeng Liu, Ke Zhang, Setareh Dabiri, Adam Ścibior, Berend Zwartsenberg, Frank Wood |
Score function estimation is the cornerstone of both training and sampling
from diffusion generative models. Despite this fact, the most commonly used
estimators are either biased neural network approximations or high variance
Monte Carlo estimators based on the conditional score. We introduce a novel
nearest neighbour score function estimator which utilizes multiple samples from
the training set to dramatically decrease estimator variance. We leverage our
low variance estimator in two compelling applications. Training consistency
models with our estimator, we report a significant increase in both convergence
speed and sample quality. In diffusion models, we show that our estimator can
replace a learned network for probability-flow ODE integration, opening
promising new avenues of future research. |
This paper introduces a novel nearest neighbour score function estimator for diffusion generative models, leveraging multiple training samples to reduce variance. |
Score function estimation is crucial for training and sampling in diffusion models, but existing methods suffer from bias (neural networks) or high variance (Monte Carlo). |
The method uses self-normalized importance sampling with a proposal distribution based on k-nearest neighbors in the training set, exploiting the Gaussian nature of diffusion processes. |
The proposed estimator exhibits near-zero variance and bias, outperforming existing estimators and even a near-SoTA diffusion model on CIFAR-10.
Using the estimator in consistency models leads to faster convergence and better sample quality compared to single-sample baselines.
The estimator enables general probability flow ODE traversal and highlights the role of network bias in diffusion model generalization. |
The paper primarily focuses on the EDM diffusion process, with generalization to other processes requiring further investigation.
While the l2 distance used for nearest neighbour search is computationally efficient, exploring alternative metric spaces might further improve performance. |
diffusion models, score function estimation, nearest neighbours, importance sampling, generative models |
2402.07562
Report |
Discovering Universal Semantic Triggers for Text-to-Image Synthesis |
Shengfang Zhai, Weilong Wang, Jiajun Li, Yinpeng Dong, Hang Su, Qingni Shen |
Recently text-to-image models have gained widespread attention in the
community due to their controllable and high-quality generation ability.
However, the robustness of such models and their potential ethical issues have
not been fully explored. In this paper, we introduce Universal Semantic
Trigger, a meaningless token sequence that can be added at any location within
the input text yet can induce generated images towards a preset semantic
target.To thoroughly investigate it, we propose Semantic Gradient-based Search
(SGS) framework. SGS automatically discovers the potential universal semantic
triggers based on the given semantic targets. Furthermore, we design evaluation
metrics to comprehensively evaluate semantic shift of images caused by these
triggers. And our empirical analyses reveal that the mainstream open-source
text-to-image models are vulnerable to our triggers, which could pose
significant ethical threats. Our work contributes to a further understanding of
text-to-image synthesis and helps users to automatically auditing their models
before deployment. |
This paper introduces 'Universal Semantic Triggers,' meaningless token sequences that can be inserted into text prompts for text-to-image models, causing the generated images to exhibit specific, pre-determined semantic features. |
This is important because it reveals a vulnerability in text-to-image models that could be exploited to generate harmful or sensitive content, bypassing existing safety measures like text filters. |
The authors propose a 'Semantic Gradient-based Search (SGS)' framework. SGS uses a gradient-based approach to automatically discover these trigger sequences by minimizing the distance in the text encoder's embedding space between trigger-inserted text and text explicitly describing the target semantic. |
Experiments demonstrate the effectiveness of these triggers across various text-to-image models (Stable Diffusion versions, Latent Diffusion) and even online platforms like Midjourney.
The triggers exhibit a degree of position insensitivity, remaining effective even when inserted at different locations within the text prompt.
The authors demonstrate the potential for increased harm through 'ensemble triggers,' where multiple trigger sequences are combined to imbue images with multiple semantic features simultaneously. |
While the paper demonstrates the existence and potential dangers of these triggers, it doesn't offer concrete mitigation strategies.
The evaluation of 'harmful' or 'sensitive' content relies heavily on user studies and subjective judgment, which can be inherently biased. |
text-to-image synthesis, adversarial attacks, semantic triggers, ethical ai, clip |
2402.07384
Report |
Exploring Perceptual Limitation of Multimodal Large Language Models |
Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, Maosong Sun |
Multimodal Large Language Models (MLLMs) have recently shown remarkable
perceptual capability in answering visual questions, however, little is known
about the limits of their perception. In particular, while prior works have
provided anecdotal evidence of MLLMs' sensitivity to object size, this
phenomenon and its underlying causes have not been explored comprehensively. In
this work, we quantitatively study the perception of small visual objects in
several state-of-the-art MLLMs and reveal a pervasive limitation in answering
questions about small objects in images. Next, we identify four independent
factors that can contribute to this limitation -- object quality, size,
distractors, and location -- and conduct controlled intervention studies to
measure the effect of each factor on MLLMs' perception. In particular, we find
that lower object quality and smaller object size can both independently reduce
MLLMs' ability to answer visual questions. More surprisingly, we find that the
location of the object in the image and the presence of visual distractors can
also significantly reduce MLLMs' question answering accuracy. Our study
provides a better understanding of the perceptual limitation of MLLMs and
contributes new evaluation protocols for analyzing the perception of future
MLLMs. To facilitate further investigations, we release our code and data. |
This paper reveals a perceptual limitation in Multimodal Large Language Models (MLLMs) when perceiving small objects and investigates the impact of object quality, size, distractors, and location on this limitation. |
This work provides a deeper understanding of the perceptual limitations of MLLMs, which is crucial for both practical applications and future model development. It also introduces a new evaluation protocol for analyzing the perception of future MLLMs. |
The authors conduct controlled experiments on five open-source MLLMs using synthetic images of digital texts with varying quality, size, distractor presence, and location. The evaluation focuses on text-reading ability using Gestalt Pattern Matching. |
Object quality (sampling rate) significantly impacts performance up to a threshold, beyond which performance stabilizes. This threshold aligns with human perception.
Smaller object size, even with sufficient quality, reduces performance in most MLLMs. Models trained on datasets with smaller objects show less sensitivity to size variations.
The presence of visual distractors and the object's location within the image significantly affect the performance of MLLMs. |
The study primarily uses synthetic digital texts for evaluation, potentially limiting the generalizability of findings to other visual tasks.
Further investigation is needed to understand the specific mechanisms within MLLMs that contribute to the observed limitations, particularly the impact of object location. |
multimodal large language models, perception, small objects, visual question answering, robustness analysis |
2402.07370
Report |
SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder |
Jaeseong Lee, Junha Hyung, Sohyun Jeong, Jaegul Choo |
Face swapping has gained significant attention for its varied applications.
The majority of previous face swapping approaches have relied on the seesaw
game training scheme, which often leads to the instability of the model
training and results in undesired samples with blended identities due to the
target identity leakage problem. This paper introduces the Shape Agnostic
Masked AutoEncoder (SAMAE) training scheme, a novel self-supervised approach
designed to enhance face swapping model training. Our training scheme addresses
the limitations of traditional training methods by circumventing the
conventional seesaw game and introducing clear ground truth through its
self-reconstruction training regime. It effectively mitigates identity leakage
by masking facial regions of the input images and utilizing learned
disentangled identity and non-identity features. Additionally, we tackle the
shape misalignment problem with new techniques including perforation confusion
and random mesh scaling, and establishes a new state-of-the-art, surpassing
other baseline methods, preserving both identity and non-identity attributes,
without sacrificing on either aspect. |
This paper proposes Shape Agnostic Masked AutoEncoder (SAMAE), a novel self-supervised training scheme for face swapping that mitigates identity leakage and enhances training stability. |
Existing face swapping methods rely on an unstable seesaw game training scheme, leading to identity blending and the need for extensive hyperparameter tuning. |
SAMAE uses self-reconstruction with face-masked images, disentangled identity and non-identity features, and introduces perforation confusion and random mesh scaling to improve cross-identity swapping. |
SAMAE outperforms state-of-the-art methods in identity preservation, attribute fidelity, and overall image realism.
Perforation confusion and random mesh scaling are crucial for handling shape misalignment and volume discrepancies between source and target faces.
Disentangling skin color from identity embeddings improves the realism of the swapped faces. |
The model's performance is limited by the accuracy of the 3DMM estimator, particularly for exaggerated expressions.
Future work could explore incorporating stronger generative priors like StyleGAN or diffusion models. |
face swapping, self-supervised learning, identity leakage, 3d morphable model, generative adversarial networks |
2402.07207
Report |
GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting |
Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang |
We present GALA3D, generative 3D GAussians with LAyout-guided control, for
effective compositional text-to-3D generation. We first utilize large language
models (LLMs) to generate the initial layout and introduce a layout-guided 3D
Gaussian representation for 3D content generation with adaptive geometric
constraints. We then propose an object-scene compositional optimization
mechanism with conditioned diffusion to collaboratively generate realistic 3D
scenes with consistent geometry, texture, scale, and accurate interactions
among multiple objects while simultaneously adjusting the coarse layout priors
extracted from the LLMs to align with the generated scene. Experiments show
that GALA3D is a user-friendly, end-to-end framework for state-of-the-art
scene-level 3D content generation and controllable editing while ensuring the
high fidelity of object-level entities within the scene. Source codes and
models will be available at https://gala3d.github.io/. |
This paper introduces \ourmethod{}, a novel layout-guided generative Gaussian splatting framework for generating complex 3D scenes from text descriptions. |
Existing text-to-3D methods struggle to generate complex scenes with multiple objects and their interactions. \ourmethod{} addresses this by leveraging layout priors and compositional optimization for enhanced control and fidelity. |
\ourmethod{} first uses LLMs to interpret text into coarse layouts. Then, it introduces a layout-guided Gaussian representation and utilizes adaptive geometry control to optimize the shape and distribution of Gaussians. A compositional optimization strategy with diffusion priors is employed to generate the final 3D scene, while a layout refinement module iteratively improves the LLM-generated layouts. |
\ourmethod{} outperforms existing NeRF-based, voxel-based, and 3DGS-based methods in text-to-3D scene generation, achieving higher CLIP scores and better visual quality.
User studies confirm that \ourmethod{} generates higher-quality 3D scenes with better geometry, text alignment, and consistency compared to other SOTA approaches.
\ourmethod{} supports interactive editing of generated scenes through textual conversations, enabling users to easily modify object placement, add/remove objects, and adjust styles. |
The reliance on LLMs for layout interpretation can introduce errors due to the LLMs' limited 3D scene understanding.
Further research can explore incorporating more detailed semantic information and object relationships into the layout representation for enhanced control. |
text-to-3d generation, generative gaussian splatting, layout-guided generation, compositional 3d generation, large language models |
2402.07181
Report |
3D Gaussian as a New Vision Era: A Survey |
Ben Fei, Jingyi Xu, Rui Zhang, Qingyuan Zhou, Weidong Yang, Ying He |
3D Gaussian Splatting (3D-GS) has emerged as a significant advancement in the
field of Computer Graphics, offering explicit scene representation and novel
view synthesis without the reliance on neural networks, such as Neural Radiance
Fields (NeRF). This technique has found diverse applications in areas such as
robotics, urban mapping, autonomous navigation, and virtual reality/augmented
reality, just name a few. Given the growing popularity and expanding research
in 3D Gaussian Splatting, this paper presents a comprehensive survey of
relevant papers from the past year. We organize the survey into taxonomies
based on characteristics and applications, providing an introduction to the
theoretical underpinnings of 3D Gaussian Splatting. Our goal through this
survey is to acquaint new researchers with 3D Gaussian Splatting, serve as a
valuable reference for seminal works in the field, and inspire future research
directions, as discussed in our concluding section. |
This paper presents a comprehensive survey of 3D Gaussian Splatting (3D-GS) research from the past year, categorizing advancements and applications to guide new researchers and inspire future research directions. |
3D-GS has emerged as a powerful technique in computer graphics for efficiently rendering complex scenes, offering explicit scene representation and novel view synthesis without relying on neural networks like NeRF. |
The paper organizes research into taxonomies based on characteristics (efficiency, realism, cost, physics) and applications (reconstruction, manipulation, generation, perception, and virtual humans). |
Various methods have been proposed to compress 3D Gaussian representations, improve rendering realism by addressing aliasing and incorporating physics-based rendering, and reduce the number of images needed for novel view synthesis.
3D-GS has shown promise in tasks like mesh reconstruction, text-guided scene manipulation, single/multi-view 3D generation, semantic object detection, dynamic scene tracking, and virtual human avatar creation.
Researchers are actively exploring real-time rendering of dynamic scenes, incorporating accurate physics simulations, and expanding 3D-GS capabilities by integrating with large foundation models. |
Current 3D-GS methods face challenges in handling floating elements, balancing rendering and reconstruction quality, and achieving realistic generation with accurate textures and geometry.
Future work could focus on addressing these challenges, improving performance in few-shot scenarios, and exploring applications in areas like robotics and autonomous vehicles. |
3d gaussian splatting, 3d-gs, computer graphics, novel view synthesis, 3d scene reconstruction |
2402.06149
Report |
HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting |
Zhenglin Zhou, Fan Ma, Hehe Fan, Yi Yang |
Creating digital avatars from textual prompts has long been a desirable yet
challenging task. Despite the promising outcomes obtained through 2D diffusion
priors in recent works, current methods face challenges in achieving
high-quality and animated avatars effectively. In this paper, we present
$\textbf{HeadStudio}$, a novel framework that utilizes 3D Gaussian splatting to
generate realistic and animated avatars from text prompts. Our method drives 3D
Gaussians semantically to create a flexible and achievable appearance through
the intermediate FLAME representation. Specifically, we incorporate the FLAME
into both 3D representation and score distillation: 1) FLAME-based 3D Gaussian
splatting, driving 3D Gaussian points by rigging each point to a FLAME mesh. 2)
FLAME-based score distillation sampling, utilizing FLAME-based fine-grained
control signal to guide score distillation from the text prompt. Extensive
experiments demonstrate the efficacy of HeadStudio in generating animatable
avatars from textual prompts, exhibiting visually appealing appearances. The
avatars are capable of rendering high-quality real-time ($\geq 40$ fps) novel
views at a resolution of 1024. They can be smoothly controlled by real-world
speech and video. We hope that HeadStudio can advance digital avatar creation
and that the present method can widely be applied across various domains. |
HeadStudio, a novel framework leveraging 3D Gaussian splatting to generate realistic and animatable head avatars from text prompts. |
Current text-based avatar generation methods struggle to effectively combine high-fidelity appearance with smooth animation. |
HeadStudio incorporates FLAME, a statistical head model, to semantically align 3D Gaussian points and guide score distillation from text prompts using: 1) FLAME-based 3D Gaussian Splatting (F-3DGS) for deformation, and 2) FLAME-based Score Distillation Sampling (F-SDS) for knowledge distillation. |
Generates high-fidelity head avatars surpassing state-of-the-art methods in visual quality.
Achieves effective semantic alignment for smooth and accurate animation of facial expressions.
Enables real-time rendering at ≥ 40 fps, suitable for augmented and virtual reality applications. |
Limited to head avatar generation, further research is needed for full-body avatars.
Relies on pre-trained diffusion models, inheriting potential biases and limitations. |
text-to-3d, avatar generation, 3d gaussian splatting, score distillation, flame |
2402.06117
Report |
Spatially-Attentive Patch-Hierarchical Network with Adaptive Sampling for Motion Deblurring |
Maitreya Suin, Kuldeep Purohit, A. N. Rajagopalan |
This paper tackles the problem of motion deblurring of dynamic scenes.
Although end-to-end fully convolutional designs have recently advanced the
state-of-the-art in non-uniform motion deblurring, their performance-complexity
trade-off is still sub-optimal. Most existing approaches achieve a large
receptive field by increasing the number of generic convolution layers and
kernel size. In this work, we propose a pixel adaptive and feature attentive
design for handling large blur variations across different spatial locations
and process each test image adaptively. We design a content-aware global-local
filtering module that significantly improves performance by considering not
only global dependencies but also by dynamically exploiting neighboring pixel
information. We further introduce a pixel-adaptive non-uniform sampling
strategy that implicitly discovers the difficult-to-restore regions present in
the image and, in turn, performs fine-grained refinement in a progressive
manner. Extensive qualitative and quantitative comparisons with prior art on
deblurring benchmarks demonstrate that our approach performs favorably against
the state-of-the-art deblurring algorithms. |
This paper proposes a Spatially-Attentive Patch-Hierarchical Network with Adaptive Sampling for more efficient and effective motion deblurring of dynamic scenes. |
Existing end-to-end fully convolutional designs for motion deblurring have sub-optimal performance-complexity trade-offs, struggling to handle large blur variations efficiently. This paper addresses this by introducing spatially adaptive and content-aware mechanisms within a CNN. |
The paper utilizes a multi-patch hierarchical network with content-aware processing modules that combine global attention and adaptive local filters. It introduces non-uniform pixel-adaptive sampling to prioritize the processing of heavily blurred regions and incorporates progressive image restoration with ground truth supervision at each stage. |
The approach achieves state-of-the-art deblurring performance on benchmark datasets like GoPro, HIDE, and RealBlur.
It offers a better accuracy-speed trade-off than methods relying solely on increasing network depth or filter size.
The adaptive sampling strategy is shown to significantly improve performance by efficiently distributing computation based on blur severity. |
The current implementation relies on custom operations that are less optimized in standard deep learning libraries, leading to slower runtime than some purely convolutional networks despite lower theoretical complexity (GFLOPs).
Future work includes exploring the application of the proposed adaptive sampling strategy to other image restoration tasks. |
image deblurring, spatially adaptive, attention mechanism, adaptive sampling, deep learning |
2402.05947
Report |
Separable Multi-Concept Erasure from Diffusion Models |
Mengnan Zhao, Lihe Zhang, Tianhang Zheng, Yuqiu Kong, Baocai Yin |
Large-scale diffusion models, known for their impressive image generation
capabilities, have raised concerns among researchers regarding social impacts,
such as the imitation of copyrighted artistic styles. In response, existing
approaches turn to machine unlearning techniques to eliminate unsafe concepts
from pre-trained models. However, these methods compromise the generative
performance and neglect the coupling among multi-concept erasures, as well as
the concept restoration problem. To address these issues, we propose a
Separable Multi-concept Eraser (SepME), which mainly includes two parts: the
generation of concept-irrelevant representations and the weight decoupling. The
former aims to avoid unlearning substantial information that is irrelevant to
forgotten concepts. The latter separates optimizable model weights, making each
weight increment correspond to a specific concept erasure without affecting
generative performance on other concepts. Specifically, the weight increment
for erasing a specified concept is formulated as a linear combination of
solutions calculated based on other known undesirable concepts. Extensive
experiments indicate the efficacy of our approach in eliminating concepts,
preserving model performance, and offering flexibility in the erasure or
recovery of various concepts. |
This paper introduces SepME, a novel machine unlearning technique for diffusion models that enables the flexible erasure and recovery of multiple concepts while preserving overall model performance. |
Existing methods for removing unsafe or undesirable concepts from pre-trained diffusion models often lead to performance degradation and struggle with multi-concept erasure and restoration. |
SepME consists of two key components: G-CiRs generates concept-irrelevant representations to preserve model performance, and WD decouples weight increments for individual concept erasure, allowing for flexible concept manipulation. |
SepME effectively removes targeted concepts while maintaining high generation quality for other concepts.
It outperforms baseline methods in terms of both concept erasure and overall model performance.
SepME enables flexible manipulation of concepts, including simultaneous multi-concept erasure, iterative concept erasure, and concept restoration. |
The cosine function as an alternative to the correlation term in SepME did not yield satisfactory results.
Future work will focus on exploring different architectures and optimization strategies for further improving SepME's efficiency and effectiveness. |
machine unlearning, diffusion models, concept erasure, concept restoration, stable diffusion |
2402.05937
Report |
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset |
Chengjian Feng, Yujie Zhong, Zequn Jie, Weidi Xie, Lin Ma |
In this paper, we present a novel paradigm to enhance the ability of object
detector, e.g., expanding categories or improving detection performance, by
training on synthetic dataset generated from diffusion models. Specifically, we
integrate an instance-level grounding head into a pre-trained, generative
diffusion model, to augment it with the ability of localising instances in the
generated images. The grounding head is trained to align the text embedding of
category names with the regional visual feature of the diffusion model, using
supervision from an off-the-shelf object detector, and a novel self-training
scheme on (novel) categories not covered by the detector. We conduct thorough
experiments to show that, this enhanced version of diffusion model, termed as
InstaGen, can serve as a data synthesizer, to enhance object detectors by
training on its generated samples, demonstrating superior performance over
existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse
(+1.2 to 5.2 AP) scenarios. Project page with code:
https://fcjian.github.io/InstaGen. |
This paper presents InstaGen, a novel framework that enhances object detection by training on synthetic datasets generated from diffusion models. InstaGen incorporates an instance-level grounding head into a pre-trained diffusion model, enabling the generation of photo-realistic images with bounding boxes for object instances. |
Building large-scale object detection datasets is labor-intensive and time-consuming. InstaGen offers a solution by synthesizing high-quality, annotated images, facilitating object detection model development and enhancing their capabilities. |
The methodology involves (1) fine-tuning a pre-trained Stable Diffusion Model (SDM) on an existing object detection dataset to create an image synthesizer, and (2) training an instance-level grounding head. The grounding head aligns text embeddings of category names with regional visual features from the image synthesizer to predict bounding boxes for objects in synthetic images. |
InstaGen outperforms state-of-the-art CLIP-based open-vocabulary object detection methods, achieving a +4.5 AP improvement on the COCO benchmark.
The synthetic datasets generated by InstaGen are particularly beneficial in data-sparse scenarios, showing significant performance improvement (+1.2 to +5.2 AP) when real training data is limited.
InstaGen effectively generalizes to unseen datasets, achieving superior performance in cross-dataset object detection when transferring from COCO to Object365 and LVIS. |
The synthetic datasets generated by InstaGen may lack the complexity and contextual diversity of real-world scenes, limiting the robustness of trained object detectors.
Current diffusion-based generative models, including those used in InstaGen, face challenges in representing and generating images for rare object categories, leading to potential class imbalance during training. |
object detection, synthetic dataset, diffusion model, open-vocabulary detection, data-sparse detection |
2402.05892
Report |
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data |
Shufan Li, Harkanwar Singh, Aditya Grover |
In recent years, Transformers have become the de-facto architecture for
sequence modeling on text and a variety of multi-dimensional data, such as
images and video. However, the use of self-attention layers in a Transformer
incurs prohibitive compute and memory complexity that scales quadratically
w.r.t. the sequence length. A recent architecture, Mamba, based on state space
models has been shown to achieve comparable performance for modeling text
sequences, while scaling linearly with the sequence length. In this work, we
present Mamba-ND, a generalized design extending the Mamba architecture to
arbitrary multi-dimensional data. Our design alternatively unravels the input
data across different dimensions following row-major orderings. We provide a
systematic comparison of Mamba-ND with several other alternatives, based on
prior multi-dimensional extensions such as Bi-directional LSTMs and S4ND.
Empirically, we show that Mamba-ND demonstrates performance competitive with
the state-of-the-art on a variety of multi-dimensional benchmarks, including
ImageNet-1K classification, HMDB-51 action recognition, and ERA5 weather
forecasting. |
The paper introduces Mamba-ND, a simple yet effective method for extending state space models (specifically, the Mamba architecture) to multi-dimensional data like images and videos. |
Transformers, while dominant in sequence modeling, have quadratic complexity, making them hard to scale. Mamba, based on state space models, offers linear complexity and competitive performance but lacked multi-dimensional extension, which Mamba-ND addresses. |
Mamba-ND leverages the efficient 1D Mamba layers and achieves multi-dimensionality by simply alternating the sequence ordering (e.g., height, width, time) across layers. |
Mamba-ND outperforms Transformers (ViT, Swin) on ImageNet-1K classification, HMDB-51 and UCF-101 action recognition, ERA5 weather forecasting, and BTCV 3D segmentation, often with fewer parameters.
Extensive ablations show the alternating-directional design surpasses more complex layer arrangements and scan factorization techniques.
Alternating ordering leads to better effective receptive fields compared to uni-directional or bi-directional baselines. |
The vast design space of possible orderings is not fully explored, with only row-major variations tested.
While offering linear complexity, the current implementation of scan factorization leads to memory and runtime overhead, needing future optimization. |
state space models, multi-dimensional modeling, vision transformers, sequence modeling, mamba |
2402.05889
Report |
CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion |
Shoubin Yu, Jaehong Yoon, Mohit Bansal |
Despite impressive advancements in multimodal compositional reasoning
approaches, they are still limited in their flexibility and efficiency by
processing fixed modality inputs while updating a lot of model parameters. This
paper tackles these critical challenges and proposes CREMA, an efficient and
modular modality-fusion framework for injecting any new modality into video
reasoning. We first augment multiple informative modalities (such as optical
flow, 3D point cloud, audio) from given videos without extra human annotation
by leveraging existing pre-trained models. Next, we introduce a query
transformer with multiple parameter-efficient modules associated with each
accessible modality. It projects diverse modality features to the LLM token
embedding space, allowing the model to integrate different data types for
response generation. Furthermore, we propose a fusion module designed to
compress multimodal queries, maintaining computational efficiency in the LLM
while combining additional modalities. We validate our method on video-3D,
video-audio, and video-language reasoning tasks and achieve better/equivalent
performance against strong multimodal LLMs, including BLIP-2, 3D-LLM, and
SeViLA while using 96% fewer trainable parameters. We provide extensive
analyses of CREMA, including the impact of each modality on reasoning domains,
the design of the fusion module, and example visualizations. |
This paper proposes CREMA, an efficient and modular modality-fusion framework for video reasoning that can integrate diverse modalities (e.g., video, audio, 3D point cloud) using parameter-efficient adapters and a novel self-gated fusion module (CREMA-Espresso). |
Current Multimodal Large Language Models (MLLMs) are computationally expensive and lack flexibility when adapting to new modalities, especially for video reasoning tasks that can benefit from diverse sensory inputs. |
CREMA leverages a frozen pre-trained vision-language model and introduces lightweight Modality-specific Multi-Query Adapters (MMQAs) with LoRA, learnable queries, and linear projections for each modality. CREMA-Espresso further fuses multimodal queries efficiently using self-gated attention. |
CREMA outperforms modality-specific baselines on SQA3D, MUSIC-AVQA, and NeXT-QA, showing improvements of +3.3%, +1.9%, and +0.9% respectively with significantly fewer parameters (2-4% of baselines).
It also achieves better or comparable performance than general-purpose MLLMs like BLIP-2 and 3D-LLM in both fine-tuning and zero-shot settings.
Analysis demonstrates the efficiency of the self-gated fusion module, the impact of additional modalities on answering hard questions, and provides qualitative visualizations of model reasoning. |
The reliance on pre-trained vision-language models may introduce potential biases present in the training data.
Future work includes exploring the impact of varying LoRA ranks, optimizing the MMQA pre-training process, and evaluating on more diverse video reasoning benchmarks. |
multimodal learning, video reasoning, large language models, modality fusion, parameter efficiency |
2402.05803
Report |
AvatarMMC: 3D Head Avatar Generation and Editing with Multi-Modal Conditioning |
Wamiq Reyaz Para, Abdelrahman Eldesokey, Zhenyu Li, Pradyumna Reddy, Jiankang Deng, Peter Wonka |
We introduce an approach for 3D head avatar generation and editing with
multi-modal conditioning based on a 3D Generative Adversarial Network (GAN) and
a Latent Diffusion Model (LDM). 3D GANs can generate high-quality head avatars
given a single or no condition. However, it is challenging to generate samples
that adhere to multiple conditions of different modalities. On the other hand,
LDMs excel at learning complex conditional distributions. To this end, we
propose to exploit the conditioning capabilities of LDMs to enable multi-modal
control over the latent space of a pre-trained 3D GAN. Our method can generate
and edit 3D head avatars given a mixture of control signals such as RGB input,
segmentation masks, and global attributes. This provides better control over
the generation and editing of synthetic avatars both globally and locally.
Experiments show that our proposed approach outperforms a solely GAN-based
approach both qualitatively and quantitatively on generation and editing tasks.
To the best of our knowledge, our approach is the first to introduce
multi-modal conditioning to 3D avatar generation and editing.
\\href{avatarmmc-sig24.github.io}{Project Page} |
This paper proposes AvatarMMC, a novel framework for 3D head avatar generation and editing with multi-modal conditioning, using a 1D Latent Diffusion Model (LDM) to control the latent space of a pre-trained 3D GAN (Next3D). |
Existing methods for 3D avatar generation often struggle to incorporate multiple conditions simultaneously, limiting their controllability. AvatarMMC addresses this by enabling multi-modal control over avatar generation and editing, combining the quality of 3D GANs with the controllability of diffusion models. |
The method utilizes a pre-trained Next3D GAN for avatar generation and a 1D LDM to learn the mapping between multi-modal conditions (RGB input, segmentation masks, and attributes) and the GAN's latent space. Different encoders embed these conditions into a common space, and cross-attention layers in the LDM incorporate the conditions during the denoising process. |
AvatarMMC generates high-quality, diverse avatars adhering to various multi-modal conditions (e.g., RGB images, segmentation masks, and attributes).
It enables high-fidelity avatar editing while preserving identity compared to a GAN-based baseline.
The method is lightweight and fast for training and sampling, as it doesn't require retraining the GAN. |
The method inherits biases present in the training data and methods of the pre-trained 3D GAN.
Future work could explore incorporating more conditioning modalities (e.g., sketches, landmarks) and joint control over animation. |
3d avatar generation, multi-modal conditioning, latent diffusion models, generative adversarial networks, avatar editing |
2402.05608
Report |
Scalable Diffusion Models with State Space Backbone |
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang |
This paper presents a new exploration into a category of diffusion models
built upon state space architecture. We endeavor to train diffusion models for
image data, wherein the traditional U-Net backbone is supplanted by a state
space backbone, functioning on raw patches or latent space. Given its notable
efficacy in accommodating long-range dependencies, Diffusion State Space Models
(DiS) are distinguished by treating all inputs including time, condition, and
noisy image patches as tokens. Our assessment of DiS encompasses both
unconditional and class-conditional image generation scenarios, revealing that
DiS exhibits comparable, if not superior, performance to CNN-based or
Transformer-based U-Net architectures of commensurate size. Furthermore, we
analyze the scalability of DiS, gauged by the forward pass complexity
quantified in Gflops. DiS models with higher Gflops, achieved through
augmentation of depth/width or augmentation of input tokens, consistently
demonstrate lower FID. In addition to demonstrating commendable scalability
characteristics, DiS-H/2 models in latent space achieve performance levels akin
to prior diffusion models on class-conditional ImageNet benchmarks at the
resolution of 256$\times$256 and 512$\times$512, while significantly reducing
the computational burden. The code and models are available at:
https://github.com/feizc/DiS. |
This paper introduces DiS, a novel diffusion model architecture employing a state space backbone instead of the traditional U-Net structure for image generation. |
DiS aims to leverage the state space model's strength in handling long-range dependencies for efficient and scalable image generation, potentially surpassing CNN-based and Transformer-based U-Net models. |
DiS treats all inputs (time, condition, noisy image patches) as tokens processed by a bidirectional Mamba architecture, incorporating skip connections and a linear decoder for noise prediction. |
DiS achieves comparable performance to U-Net and Transformer-based models on CIFAR10 and CelebA 64x64 with fewer parameters.
Scaling DiS by increasing depth/width consistently improves FID scores on ImageNet 256x256.
DiS-H/2 achieves state-of-the-art FID on ImageNet 256x256 and outperforms ADM-G on ImageNet 512x512 in latent space. |
Model's performance hasn't fully converged, suggesting potential for further improvement.
Exploration of larger models and token counts is left for future work. |
diffusion models, state space models, image generation, scalability, mamba architecture |
2402.05472
Report |
Question Aware Vision Transformer for Multimodal Reasoning |
Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman |
Vision-Language (VL) models have gained significant research focus, enabling
remarkable advances in multimodal reasoning. These architectures typically
comprise a vision encoder, a Large Language Model (LLM), and a projection
module that aligns visual features with the LLM's representation space. Despite
their success, a critical limitation persists: the vision encoding process
remains decoupled from user queries, often in the form of image-related
questions. Consequently, the resulting visual features may not be optimally
attuned to the query-specific elements of the image. To address this, we
introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal
reasoning, which embeds question awareness directly within the vision encoder.
This integration results in dynamic visual features focusing on relevant image
aspects to the posed question. QA-ViT is model-agnostic and can be incorporated
efficiently into any VL architecture. Extensive experiments demonstrate the
effectiveness of applying our method to various multimodal architectures,
leading to consistent improvement across diverse tasks and showcasing its
potential for enhancing visual and scene-text understanding. |
This paper introduces \AlgoNameNoSpace, a question-aware vision transformer approach for multimodal reasoning. \AlgoNameNoSpace embeds question awareness directly into the vision encoder, resulting in dynamic visual features focused on relevant image aspects. |
Existing Vision-Language models suffer from a decoupling of vision encoding and user queries. This leads to suboptimal visual features that may not be attuned to the specific elements of an image relevant to the query. |
\AlgoNameNoSpace uses a two-stage process: 1) a question encoding module processes the textual prompt into representations, 2) a question fusing module integrates the representations into the vision model via the self-attention mechanism. This approach allows the model to extract text-aware visual features. |
\AlgoNameNoSpace leads to substantial and consistent improvements across diverse VL tasks and architectures, including ViT+T5, BLIP2, InstructBLIP, and LLaVA-1.5.
\AlgoNameNoSpace shows significant benefits in scenarios requiring reasoning over nuanced, low-level image details, which are often overlooked by standard vision encoders.
The method exhibits consistent performance gains across various LLM scales, demonstrating its compatibility with different model sizes. |
While \AlgoNameNoSpace demonstrates strong performance in natural image domains, its effectiveness is limited in dense-text scenarios like document understanding.
Future work could explore designated pretraining techniques tailored for \AlgoNameNoSpace to further enhance its capabilities. |
vision-language models, multimodal reasoning, question answering, image captioning, vision transformer |
2402.05408
Report |
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis |
Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, Yi Yang |
We present a Multi-Instance Generation (MIG) task, simultaneously generating
multiple instances with diverse controls in one image. Given a set of
predefined coordinates and their corresponding descriptions, the task is to
ensure that generated instances are accurately at the designated locations and
that all instances' attributes adhere to their corresponding description. This
broadens the scope of current research on Single-instance generation, elevating
it to a more versatile and practical dimension. Inspired by the idea of divide
and conquer, we introduce an innovative approach named Multi-Instance
Generation Controller (MIGC) to address the challenges of the MIG task.
Initially, we break down the MIG task into several subtasks, each involving the
shading of a single instance. To ensure precise shading for each instance, we
introduce an instance enhancement attention mechanism. Lastly, we aggregate all
the shaded instances to provide the necessary information for accurately
generating multiple instances in stable diffusion (SD). To evaluate how well
generation models perform on the MIG task, we provide a COCO-MIG benchmark
along with an evaluation pipeline. Extensive experiments were conducted on the
proposed COCO-MIG benchmark, as well as on various commonly used benchmarks.
The evaluation results illustrate the exceptional control capabilities of our
model in terms of quantity, position, attribute, and interaction. Code and
demos will be released at https://migcproject.github.io/. |
This paper presents Multi-Instance Generation (MIG), a task focused on generating multiple instances with diverse user controls within a single image, along with a novel approach named Multi-Instance Generation Controller (MIGC) to address this task. |
MIG tackles limitations of single-instance generation, offering more versatile and practical applications in image synthesis by enabling control over quantity, position, attributes, and interaction of multiple instances. |
MIGC leverages a divide and conquer strategy: dividing the task into single-instance shading subtasks, conquering them using an Enhancement Attention Layer, and combining the results via Layout Attention and a Shading Aggregation Controller. |
On the COCO-MIG benchmark, MIGC significantly improved Instance Success Rate from 32.39% to 58.43%.
On the COCO benchmark, MIGC demonstrated notable improvements in Average Precision (AP), increasing it from 40.68/68.26/42.85 to 54.69/84.17/61.71.
On DrawBench, MIGC achieved advancements across position, attribute, and count control, especially raising the attribute success rate from 48.20% to 97.50%. |
MIGC relies on the single-instance generation capabilities of the pre-trained stable diffusion model. If stable diffusion struggles to generate a specific instance, MIGC will also face difficulties.
While MIGC exhibits strong control over instance attributes and positions, further research is needed to enhance the control of interactive relationships between instances. |
multi-instance generation, text-to-image synthesis, layout control, stable diffusion, attention mechanisms |
2402.05382
Report |
Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts |
Zhili Liu, Kai Chen, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, James T. Kwok |
Masked Autoencoder~(MAE) is a prevailing self-supervised learning method that
achieves promising results in model pre-training. However, when the various
downstream tasks have data distributions different from the pre-training data,
the semantically irrelevant pre-training information might result in negative
transfer, impeding MAE's scalability. To address this issue, we propose a novel
MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE),
which can be trained once but provides customized pre-training models for
diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE
trains each expert only with semantically relevant images by using
cluster-conditional gates. Thus, each downstream task can be allocated to its
customized model pre-trained with data most similar to the downstream data.
Experiments on a collection of 11 downstream tasks show that MoCE outperforms
the vanilla MAE by 2.45\% on average. It also obtains new state-of-the-art
self-supervised learning results on detection and segmentation. |
This paper proposes MoCE (Mixture of Cluster-conditional Experts), a novel MAE-based pre-training paradigm that addresses the negative transfer problem in MAE by providing customized pre-trained models for diverse downstream tasks. |
MAE, while effective for model pre-training, can suffer from negative transfer when applied to downstream tasks with data distributions different from the pre-training data. This limits MAE’s scalability and transferability. |
MoCE first clusters the dataset using a pre-trained MAE. Then, it trains a multi-expert architecture where each expert focuses on images from specific clusters with similar semantics, guided by cluster-conditional gates. For deployment, MoCE selects the most relevant expert for each downstream task based on its data distribution. |
MoCE outperforms vanilla MAE by 2.45% on average across 11 downstream tasks.
MoCE achieves state-of-the-art self-supervised learning results on detection and segmentation.
MoCE demonstrates superior performance compared to TokenMoE and SDR, showcasing the effectiveness of its cluster-conditional expert routing. |
The number of experts and clusters might be a bottleneck for further performance improvement.
Exploration on larger datasets and more diverse downstream tasks is needed. |
self-supervised learning, masked autoencoder, mixture of experts, negative transfer, task-customized pre-training |
2402.05375
Report |
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models |
Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang |
The success of recent text-to-image diffusion models is largely due to their
capacity to be guided by a complex text prompt, which enables users to
precisely describe the desired content. However, these models struggle to
effectively suppress the generation of undesired content, which is explicitly
requested to be omitted from the generated image in the prompt. In this paper,
we analyze how to manipulate the text embeddings and remove unwanted content
from them. We introduce two contributions, which we refer to as
$\textit{soft-weighted regularization}$ and $\textit{inference-time text
embedding optimization}$. The first regularizes the text embedding matrix and
effectively suppresses the undesired content. The second method aims to further
suppress the unwanted content generation of the prompt, and encourages the
generation of desired content. We evaluate our method quantitatively and
qualitatively on extensive experiments, validating its effectiveness.
Furthermore, our method is generalizability to both the pixel-space diffusion
models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable
Diffusion). |
This paper introduces a novel method for suppressing the generation of undesired content (negative targets) in text-to-image diffusion models by manipulating text embeddings, enabling more precise control over image generation. |
Current text-to-image models struggle to effectively omit content explicitly requested to be excluded in the prompt, limiting precise image generation control. |
The method utilizes two steps: 1) **Soft-weighted regularization**: Applying SVD to a negative target embedding matrix, then regularizing singular values to suppress negative target information in the [EOT] embeddings. 2) **Inference-time text embedding optimization**: Optimizing the whole text embeddings with two losses - negative target prompt suppression (weakens negative target attention) and positive target prompt preservation (strengthens desired content attention). |
The proposed method effectively suppresses negative target generation without needing to fine-tune the image generator or collect paired images.
Quantitative and qualitative evaluation on various datasets demonstrate superior performance compared to existing baselines, achieving the best scores in Clipscore, DetScore and comparable IFID.
The method proves versatile, applicable to both pixel-space and latent-space diffusion models, and adaptable for tasks like image restoration and content strengthening. |
The current implementation requires around 30 seconds for inference-time optimization, limiting its practicality in real-time applications.
The method relies on concise prompts primarily describing objects, struggling with lengthy and abstract descriptions. |
text-to-image generation, diffusion models, negative content suppression, text embedding manipulation, image editing |
2402.05235
Report |
SPAD : Spatially Aware Multiview Diffusers |
Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, Aliaksandr Siarohin |
We present SPAD, a novel approach for creating consistent multi-view images
from text prompts or single images. To enable multi-view generation, we
repurpose a pretrained 2D diffusion model by extending its self-attention
layers with cross-view interactions, and fine-tune it on a high quality subset
of Objaverse. We find that a naive extension of the self-attention proposed in
prior work (e.g. MVDream) leads to content copying between views. Therefore, we
explicitly constrain the cross-view attention based on epipolar geometry. To
further enhance 3D consistency, we utilize Plucker coordinates derived from
camera rays and inject them as positional encoding. This enables SPAD to reason
over spatial proximity in 3D well. In contrast to recent works that can only
generate views at fixed azimuth and elevation, SPAD offers full camera control
and achieves state-of-the-art results in novel view synthesis on unseen objects
from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate
that text-to-3D generation using SPAD prevents the multi-face Janus issue. See
more details at our webpage: https://yashkant.github.io/spad |
This paper introduces SPAD, a novel framework that leverages pre-trained text-to-image diffusion models to generate consistent multi-view images from text prompts or single images. |
Generating high-quality 3D content is crucial for various applications. SPAD addresses limitations in existing methods by incorporating 3D understanding into 2D diffusion models, enabling consistent multi-view generation with precise camera control. |
The authors extend a pre-trained 2D diffusion model with cross-view interactions using Epipolar Attention and Plücker Ray Embeddings. Epipolar Attention restricts attention to epipolar lines, enhancing 3D consistency. Plücker Embeddings provide positional encoding based on camera rays, preventing object flipping artifacts. |
SPAD achieves state-of-the-art results in novel view synthesis on unseen objects from Objaverse and Google Scanned Objects datasets.
The method demonstrates better camera control and generates consistent multi-view images from diverse viewpoints.
SPAD effectively prevents the multi-face Janus issue in text-to-3D generation using multi-view Score Distillation Sampling. |
Limitations: The method currently relies on a two-view training setup and could benefit from exploring monocular depth estimation for improved correspondences.
Future Work: Extending SPAD to generate dynamic 4D assets and multi-object scenes, as well as leveraging larger diffusion models like SDXL for enhanced performance. |
multi-view generation, text-to-3d synthesis, diffusion models, epipolar geometry, plücker coordinates |
2402.05195
Report |
$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space |
Maitreya Patel, Sangmin Jung, Chitta Baral, Yezhou Yang |
Despite the recent advances in personalized text-to-image (P-T2I) generative
models, it remains challenging to perform finetuning-free multi-subject-driven
T2I in a resource-efficient manner. Predominantly, contemporary approaches,
involving the training of Hypernetworks and Multimodal Large Language Models
(MLLMs), require heavy computing resources that range from 600 to 12300 GPU
hours of training. These subject-driven T2I methods hinge on Latent Diffusion
Models (LDMs), which facilitate T2I mapping through cross-attention layers.
While LDMs offer distinct advantages, P-T2I methods' reliance on the latent
space of these diffusion models significantly escalates resource demands,
leading to inconsistent results and necessitating numerous iterations for a
single desired image. In this paper, we present $\lambda$-ECLIPSE, an
alternative prior-training strategy that works in the latent space of a
pre-trained CLIP model without relying on the diffusion UNet models.
$\lambda$-ECLIPSE leverages the image-text interleaved pre-training for fast
and effective multi-subject-driven P-T2I. Through extensive experiments, we
establish that $\lambda$-ECLIPSE surpasses existing baselines in composition
alignment while preserving concept alignment performance, even with
significantly lower resource utilization. $\lambda$-ECLIPSE performs
multi-subject driven P-T2I with just 34M parameters and is trained on a mere 74
GPU hours. Additionally, $\lambda$-ECLIPSE demonstrates the unique ability to
perform multi-concept interpolations. |
\ours~ is a resource-efficient, diffusion-independent prior learning strategy for enabling fast multi-subject customization in personalized text-to-image generation. |
Existing personalized text-to-image generation methods, especially those involving multi-subject customization, are computationally expensive, requiring significant GPU hours and large models. \ours~ addresses this resource efficiency issue. |
\ours~ leverages a contrastive text-to-image strategy within the latent space of a pre-trained CLIP model, eliminating the dependence on diffusion models during training. It employs an image-text interleaved pre-training approach, substituting subject-specific text embeddings with corresponding image embeddings, and incorporates Canny edge maps for enhanced control over image generation. |
\ours~ achieves competitive performance in composition alignment while maintaining concept alignment, even with significantly lower resource utilization (34M parameters and 74 GPU hours).
It outperforms baseline methods on the Multibench dataset for multi-subject generation, particularly in text-composition alignment.
The method effectively incorporates Canny edge maps as conditional guidance, balancing subject details with edge map adherence, unlike other methods that overemphasize edge maps. |
CLIP's limitations in capturing hierarchical representations can lead to less-than-ideal results, particularly for complex subjects.
While significantly more efficient, there is still a performance gap between \ours~ and fine-tuning-based methods, suggesting room for improvement potentially through larger datasets and models. |
personalized text-to-image generation, multi-subject customization, resource-efficient, diffusion-free, clip latent space |
2402.05054
Report |
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation |
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, Ziwei Liu |
3D content creation has achieved significant progress in terms of both
quality and speed. Although current feed-forward models can produce 3D objects
in seconds, their resolution is constrained by the intensive computation
required during training. In this paper, we introduce Large Multi-View Gaussian
Model (LGM), a novel framework designed to generate high-resolution 3D models
from text prompts or single-view images. Our key insights are two-fold: 1) 3D
Representation: We propose multi-view Gaussian features as an efficient yet
powerful representation, which can then be fused together for differentiable
rendering. 2) 3D Backbone: We present an asymmetric U-Net as a high-throughput
backbone operating on multi-view images, which can be produced from text or
single-view image input by leveraging multi-view diffusion models. Extensive
experiments demonstrate the high fidelity and efficiency of our approach.
Notably, we maintain the fast speed to generate 3D objects within 5 seconds
while boosting the training resolution to 512, thereby achieving
high-resolution 3D content generation. |
This paper introduces LGM, a novel framework that generates high-resolution 3D models from text prompts or single-view images using multi-view Gaussian features and an asymmetric U-Net backbone. |
Current 3D generation methods are either slow (optimization-based) or limited in resolution (feed-forward). LGM aims to achieve both high fidelity and speed in 3D content creation. |
LGM utilizes an asymmetric U-Net to predict and fuse 3D Gaussian features from multi-view images. It leverages existing multi-view diffusion models for image/text-to-multi-view image generation and employs data augmentation for robust training. A mesh extraction algorithm converts 3D Gaussians to polygonal meshes. |
LGM generates high-quality 3D Gaussians and meshes, outperforming previous methods in both image-to-3D and text-to-3D tasks.
The method maintains fast generation speed (around 5 seconds) while significantly increasing training resolution (up to 512).
LGM demonstrates good diversity in generating various plausible 3D objects from a single input. |
The quality of LGM's output depends on the accuracy and resolution of the multi-view images generated by diffusion models.
Current multi-view diffusion models struggle with high elevation angles and may produce inconsistent 3D information, affecting the final 3D model quality. |
3d generation, gaussian splatting, high resolution, multi-view diffusion models, u-net |
2402.05008
Report |
EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss |
Zhuoyang Zhang, Han Cai, Song Han |
We present EfficientViT-SAM, a new family of accelerated segment anything
models. We retain SAM's lightweight prompt encoder and mask decoder while
replacing the heavy image encoder with EfficientViT. For the training, we begin
with the knowledge distillation from the SAM-ViT-H image encoder to
EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B
dataset. Benefiting from EfficientViT's efficiency and capacity,
EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over
SAM-ViT-H without sacrificing performance. Our code and pre-trained models are
released at https://github.com/mit-han-lab/efficientvit. |
Presents EfficientViT-SAM, an accelerated version of Segment Anything Model (SAM) using EfficientViT for improved efficiency in image segmentation. |
Addresses the high computational cost of SAM, making it more practical for time-sensitive applications while maintaining performance. |
Replaces SAM's image encoder with EfficientViT. The model is trained in two phases: knowledge distillation from SAM-ViT-H to EfficientViT and end-to-end training on the SA-1B dataset. |
EfficientViT-SAM achieves a 17x to 69x speedup compared to SAM.
It outperforms other accelerated SAM models in terms of both efficiency and accuracy on zero-shot segmentation benchmarks.
EfficientViT-SAM demonstrates strong performance in point-prompted, box-prompted, and segment-everything segmentation modes. |
The model's performance might be further enhanced by exploring advanced knowledge distillation techniques.
Future work can investigate the application of EfficientViT-SAM in real-world scenarios such as video segmentation. |
image segmentation, segment anything model, efficientvit, zero-shot learning, model acceleration |
2402.04930
Report |
Blue noise for diffusion models |
Xingchang Huang, Corentin Salaün, Cristina Vasconcelos, Christian Theobalt, Cengiz Öztireli, Gurprit Singh |
Most of the existing diffusion models use Gaussian noise for training and
sampling across all time steps, which may not optimally account for the
frequency contents reconstructed by the denoising network. Despite the diverse
applications of correlated noise in computer graphics, its potential for
improving the training process has been underexplored. In this paper, we
introduce a novel and general class of diffusion models taking correlated noise
within and across images into account. More specifically, we propose a
time-varying noise model to incorporate correlated noise into the training
process, as well as a method for fast generation of correlated noise mask. Our
model is built upon deterministic diffusion models and utilizes blue noise to
help improve the generation quality compared to using Gaussian white (random)
noise only. Further, our framework allows introducing correlation across images
within a single mini-batch to improve gradient flow. We perform both
qualitative and quantitative evaluations on a variety of datasets using our
method, achieving improvements on different tasks over existing deterministic
diffusion models in terms of FID metric. |
This paper introduces a novel diffusion model framework that leverages correlated noise, particularly blue noise, to enhance the quality of generated images. |
Most diffusion models rely solely on Gaussian noise, which may not be optimal for capturing the frequency content during the denoising process. Correlated noise, with its frequency-specific properties, offers a potential solution for improving generation quality. |
The authors propose a time-varying noise model that interpolates between Gaussian noise and blue noise throughout the diffusion process. They also introduce a method for fast generation of correlated noise masks using padding, ensuring efficient training. The model is evaluated on various image generation tasks using deterministic diffusion models like IADB. |
The proposed method consistently outperforms existing deterministic models, such as IADB and DDIM, on several datasets in terms of FID scores, particularly for resolutions of 64x64.
Visual comparisons highlight the superior quality of generated images, particularly in detail-rich areas like hair, eyes, and mouths.
Analysis reveals that incorporating blue noise from the middle or later stages of the diffusion process, when low-frequency components are established, yields the best results. |
The optimal parameters for the noise scheduler currently depend on the image resolution, requiring further investigation for a more general approach.
Extending the model to higher resolutions presents computational challenges for generating correlated noise masks, demanding more efficient solutions. |
blue noise, diffusion models, generative modeling, image generation, time-varying noise |
2402.04648
Report |
OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding |
Guibiao Liao, Kaichen Zhou, Zhenyu Bao, Kanglin Liu, Qing Li |
The development of Neural Radiance Fields (NeRFs) has provided a potent
representation for encapsulating the geometric and appearance characteristics
of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D
semantic perception tasks has been a recent focus. However, current methods
that extract semantics directly from Contrastive Language-Image Pretraining
(CLIP) for semantic field learning encounter difficulties due to noisy and
view-inconsistent semantics provided by CLIP. To tackle these limitations, we
propose OV-NeRF, which exploits the potential of pre-trained vision and
language foundation models to enhance semantic field learning through proposed
single-view and cross-view strategies. First, from the single-view perspective,
we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask
proposals derived from SAM to rectify the noisy semantics of each training
view, facilitating accurate semantic field learning. Second, from the
cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy
to address the challenge raised by view-inconsistent semantics. Rather than
invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the
3D consistent semantics generated from the well-trained semantic field itself
for semantic field training, aiming to reduce ambiguity and enhance overall
semantic consistency across different views. Extensive experiments validate our
OV-NeRF outperforms current state-of-the-art methods, achieving a significant
improvement of 20.31% and 18.42% in mIoU metric on Replica and Scannet,
respectively. Furthermore, our approach exhibits consistent superior results
across various CLIP configurations, further verifying its robustness. |
This paper introduces OV-NeRF, a novel approach for accurate open-vocabulary 3D semantic understanding of Neural Radiance Fields (NeRFs) leveraging the capabilities of pre-trained vision and language foundation models, such as CLIP and Segment Anything (SAM). |
Existing methods for extracting semantics from CLIP for NeRF semantic field learning face challenges due to the noisy and view-inconsistent nature of CLIP-derived semantics, hindering accurate 3D semantic understanding. |
OV-NeRF tackles these limitations through two key strategies: 1) **Region Semantic Ranking (RSR) regularization**: employs region proposals from SAM to rectify noisy semantics in each training view, improving the accuracy of single-view relevancy maps. 2) **Cross-view Self-enhancement (CSE)**: addresses view inconsistency by leveraging the 3D consistency of NeRFs, utilizing rendered outputs from the trained semantic field to refine and enhance the consistency of semantic maps across multiple views. |
OV-NeRF significantly outperforms state-of-the-art methods, achieving a remarkable improvement of 20.31% and 18.42% in mIoU on Replica and Scannet datasets, respectively.
The method exhibits consistent superior performance across various CLIP configurations, indicating its robustness and generalizability.
Ablation studies confirm the effectiveness of both RSR and CSE strategies in enhancing the accuracy and view consistency of semantic understanding in NeRFs. |
The reliance on pre-computed CLIP features and SAM proposals could introduce limitations in scenarios with significant domain shifts.
Future work could explore extending OV-NeRF to handle dynamic scenes and incorporate temporal consistency. |
neural radiance fields, 3d semantic segmentation, open-vocabulary learning, vision and language models, clip |
2402.04630
Report |
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors |
Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, Shijian Lu |
Inspired by the outstanding zero-shot capability of vision language models
(VLMs) in image classification tasks, open-vocabulary object detection has
attracted increasing interest by distilling the broad VLM knowledge into
detector training. However, most existing open-vocabulary detectors learn by
aligning region embeddings with categorical labels (e.g., bicycle) only,
disregarding the capability of VLMs on aligning visual embeddings with
fine-grained text description of object parts (e.g., pedals and bells). This
paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that
introduces conditional context prompts and hierarchical textual descriptors
that enable precise region-text alignment as well as open-vocabulary detection
training in general. Specifically, the conditional context prompt transforms
regional embeddings into image-like representations that can be directly
integrated into general open vocabulary detection training. In addition, we
introduce large language models as an interactive and implicit knowledge
repository which enables iterative mining and refining visually oriented
textual descriptors for precise region-text alignment. Extensive experiments
over multiple large-scale benchmarks show that DVDet outperforms the
state-of-the-art consistently by large margins. |
This paper introduces DVDet, a novel open-vocabulary object detection method that leverages fine-grained textual descriptors to improve region-text alignment. |
Existing open-vocabulary detectors underutilize the knowledge in VLMs by focusing solely on category-level alignment and neglecting the fine-grained descriptor-level alignment where VLMs excel. |
DVDet utilizes a Conditional Context regional Prompt (CCP) to transform region embeddings into image-like representations for improved integration with existing detectors. It also employs a hierarchical descriptor generation mechanism that iteratively interacts with LLMs to refine fine-grained descriptors for precise region-text alignment. |
DVDet consistently outperforms state-of-the-art open-vocabulary detectors on COCO and LVIS benchmarks.
The iterative interaction with LLMs for descriptor generation proves superior to using LLMs as a static knowledge base.
DVDet demonstrates strong generalization ability, showing improvements when transferred to PASCAL VOC and LVIS datasets even without re-training. |
The descriptor generation process relies on the performance of LLMs, which can be a bottleneck.
The method primarily focuses on improving classification accuracy, and future work could explore incorporating fine-grained descriptors into the localization branch. |
open-vocabulary object detection, vision language models, large language models, prompt learning, fine-grained descriptors |
2402.04625
Report |
Noise Map Guidance: Inversion with Spatial Context for Real Image Editing |
Hansam Cho, Jonghyun Lee, Seoung Bum Kim, Tae-Hyun Oh, Yonghyun Jeong |
Text-guided diffusion models have become a popular tool in image synthesis,
known for producing high-quality and diverse images. However, their application
to editing real images often encounters hurdles primarily due to the text
condition deteriorating the reconstruction quality and subsequently affecting
editing fidelity. Null-text Inversion (NTI) has made strides in this area, but
it fails to capture spatial context and requires computationally intensive
per-timestep optimization. Addressing these challenges, we present Noise Map
Guidance (NMG), an inversion method rich in a spatial context, tailored for
real-image editing. Significantly, NMG achieves this without necessitating
optimization, yet preserves the editing quality. Our empirical investigations
highlight NMG's adaptability across various editing techniques and its
robustness to variants of DDIM inversions. |
This paper introduces Noise Map Guidance (NMG), an inversion method for real-image editing with text-guided diffusion models that preserves spatial context without requiring optimization. |
Existing text-guided diffusion models struggle to edit real images due to the deterioration of reconstruction quality stemming from text conditions. While Null-text Inversion (NTI) addresses this, it requires computationally intensive per-timestep optimization and can fail to capture spatial context. |
NMG leverages latent variables from DDIM inversion, referred to as 'noise maps,' which inherently capture spatial context. By conditioning the reverse process on noise maps and reformulating it using energy guidance, NMG guides the reconstruction path to align with the DDIM inversion trajectory. |
NMG effectively preserves spatial context during real-image editing, surpassing DDIM, NTI, NPI, and ProxNPI in qualitative and quantitative evaluations.
NMG shows consistent robustness across variations of DDIM inversion, as demonstrated by its integration with pix2pix-zero for image-to-image translation.
Evaluations using CLIPScore, TIFA, and a user study confirm that NMG achieves high editing quality, aligning with human perception of image fidelity. |
NMG faces challenges integrating with methods that deviate from the inversion-based editing paradigm, such as SGC-Net for relationship change tasks.
NMG's reliance on text for image editing limits its ability to perform precise spatial changes, such as removing specific individuals or adding objects at exact locations. |
image editing, diffusion models, spatial context, inversion, noise map guidance |
2402.04618
Report |
Multi-Scale Semantic Segmentation with Modified MBConv Blocks |
Xi Chen, Yang Cai, Yuan Wu, Bo Xiong, Taesung Park |
Recently, MBConv blocks, initially designed for efficiency in
resource-limited settings and later adapted for cutting-edge image
classification performances, have demonstrated significant potential in image
classification tasks. Despite their success, their application in semantic
segmentation has remained relatively unexplored. This paper introduces a novel
adaptation of MBConv blocks specifically tailored for semantic segmentation.
Our modification stems from the insight that semantic segmentation requires the
extraction of more detailed spatial information than image classification. We
argue that to effectively perform multi-scale semantic segmentation, each
branch of a U-Net architecture, regardless of its resolution, should possess
equivalent segmentation capabilities. By implementing these changes, our
approach achieves impressive mean Intersection over Union (IoU) scores of 84.5%
and 84.0% on the Cityscapes test and validation datasets, respectively,
demonstrating the efficacy of our proposed modifications in enhancing semantic
segmentation performance. |
This paper proposes a novel adaptation of MBConv blocks, incorporating modifications in multi-scale segmentation and block structure, to enhance their efficacy in semantic segmentation tasks. |
Existing MBConv blocks, despite their success in image classification, remain largely unexplored for semantic segmentation, which requires detailed spatial information extraction, unlike classification. |
The study modifies the U-Net architecture by maintaining uniform feature maps and architectural blocks across all scales. It also replaces 1x1 convolutions within MBConv blocks with 3x3 convolutions to capture more spatial context. |
Achieved mean Intersection over Union (IoU) scores of 84.5% and 84.0% on Cityscapes test and validation datasets, respectively, outperforming existing methods.
Demonstrated that maintaining consistent learning power across scales improves segmentation accuracy.
Showed that replacing 1x1 with 3x3 convolutions in MBConv blocks enhances spatial detail capture, despite increasing memory and processing demands. |
The modification to MBConv blocks increases memory usage by 10% and processing time by 30%.
The switch to 3x3 convolutions increases the number of parameters significantly. |
semantic segmentation, mbconv blocks, u-net, multi-scale segmentation, spatial context |
2402.04563
Report |
Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention |
Saebom Leem, Hyunseok Seo |
Vision Transformer(ViT) is one of the most widely used models in the computer
vision field with its great performance on various tasks. In order to fully
utilize the ViT-based architecture in various applications, proper
visualization methods with a decent localization performance are necessary, but
these methods employed in CNN-based models are still not available in ViT due
to its unique structure. In this work, we propose an attention-guided
visualization method applied to ViT that provides a high-level semantic
explanation for its decision. Our method selectively aggregates the gradients
directly propagated from the classification output to each self-attention,
collecting the contribution of image features extracted from each location of
the input image. These gradients are additionally guided by the normalized
self-attention scores, which are the pairwise patch correlation scores. They
are used to supplement the gradients on the patch-level context information
efficiently detected by the self-attention mechanism. This approach of our
method provides elaborate high-level semantic explanations with great
localization performance only with the class labels. As a result, our method
outperforms the previous leading explainability methods of ViT in the
weakly-supervised localization task and presents great capability in capturing
the full instances of the target class object. Meanwhile, our method provides a
visualization that faithfully explains the model, which is demonstrated in the
perturbation comparison test. |
This paper presents an attention-guided gradient analysis method for Vision Transformer (ViT) to enhance weakly-supervised localization performance. |
Proper visualization methods with good localization ability are crucial for utilizing ViT models in various applications, and existing methods often fall short due to ViT's unique architecture. |
The method aggregates gradients from the classification output to each self-attention block, guided by self-attention scores normalized with sigmoid. This approach combines high-level semantic information from gradients with patch correlation information from self-attention. |
The method outperforms previous ViT visualization techniques in weakly-supervised object detection on ImageNet, PASCAL VOC, and CUB200 datasets.
It effectively mitigates peak intensities that hinder accurate localization in other methods.
The method excels at capturing full object areas, including multiple instances of the target class. |
The method exhibits a slight trade-off between precision and recall compared to some existing methods.
Future work can explore incorporating information from non-target classes for further localization improvement. |
vision transformer, explainable ai, weakly-supervised localization, class activation map, self-attention |
2402.04504
Report |
Text2Street: Controllable Text-to-image Generation for Street Views |
Jinming Su, Songen Gu, Yiting Duan, Xingyue Chen, Junfeng Luo |
Text-to-image generation has made remarkable progress with the emergence of
diffusion models. However, it is still a difficult task to generate images for
street views based on text, mainly because the road topology of street scenes
is complex, the traffic status is diverse and the weather condition is various,
which makes conventional text-to-image models difficult to deal with. To
address these challenges, we propose a novel controllable text-to-image
framework, named \textbf{Text2Street}. In the framework, we first introduce the
lane-aware road topology generator, which achieves text-to-map generation with
the accurate road structure and lane lines armed with the counting adapter,
realizing the controllable road topology generation. Then, the position-based
object layout generator is proposed to obtain text-to-layout generation through
an object-level bounding box diffusion strategy, realizing the controllable
traffic object layout generation. Finally, the multiple control image generator
is designed to integrate the road topology, object layout and weather
description to realize controllable street-view image generation. Extensive
experiments show that the proposed approach achieves controllable street-view
text-to-image generation and validates the effectiveness of the Text2Street
framework for street views. |
Proposes Text2Street, a controllable text-to-image generation framework for street views that controls road topology, traffic status, and weather conditions using text descriptions. |
Street-view image generation is valuable for autonomous driving perception and map construction, but existing methods struggle with complex road topology, diverse traffic status, and various weather conditions. |
Utilizes three main components: (1) Lane-aware road topology generator (LRTG) creates a local semantic map with lane lines conforming to traffic regulations. (2) Position-based object layout generator (POLG) generates traffic object layout based on text descriptions of object quantity, adhering to traffic rules. (3) Multiple control image generator (MCIG) integrates road topology, object layout, and weather descriptions to produce the final street-view image. |
Outperforms state-of-the-art methods in both image fidelity and attribute-level accuracy on nuScenes dataset.
Demonstrates superior controllability in generating images with varying road structures, lane lines, traffic objects, and weather conditions.
Generated images improve the performance of downstream tasks like object detection. |
Relies on fixed camera parameters for image projection, limiting viewpoint diversity.
Further exploration of using generated images for other autonomous driving tasks is needed. |
text-to-image generation, street view synthesis, controllable image generation, autonomous driving, diffusion models |
2402.04492
Report |
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation |
Jirayu Burapacheep, Ishan Gaur, Agam Bhatia, Tristan Thrush |
This paper introduces the ColorSwap dataset, designed to assess and improve
the proficiency of multimodal models in matching objects with their colors. The
dataset is comprised of 2,000 unique image-caption pairs, grouped into 1,000
examples. Each example includes a caption-image pair, along with a
``color-swapped'' pair. We follow the Winoground schema: the two captions in an
example have the same words, but the color words have been rearranged to modify
different objects. The dataset was created through a novel blend of automated
caption and image generation with humans in the loop. We evaluate image-text
matching (ITM) and visual language models (VLMs) and find that even the latest
ones are still not robust at this task. GPT-4V and LLaVA score 72% and 42% on
our main VLM metric, although they may improve with more advanced prompting
techniques. On the main ITM metric, contrastive models such as CLIP and SigLIP
perform close to chance (at 12% and 30%, respectively), although the
non-contrastive BLIP ITM model is stronger (87%). We also find that finetuning
on fewer than 2,000 examples yields significant performance gains on this
out-of-distribution word-order understanding task. The dataset is here:
https://github.com/Top34051/colorswap. |
The paper introduces ColorSwap, a dataset of 2,000 image-caption pairs designed to assess the ability of multimodal models to match objects with their colors, focusing on understanding word order in captions. |
This is important because despite advancements in multimodal models, they still struggle with compositional understanding, particularly in tasks involving word order, which is crucial for tasks like AI-generated art. |
The dataset was created using a combination of automated caption and image generation (using GPT-4, Claude-2, Stable Diffusion, Midjourney, and DALL-E 3) and human review for quality control and caption refinement. |
Even the latest models like GPT-4V make significant errors on the ColorSwap dataset, highlighting their limitations in color composition understanding.
Contrastive models (CLIP, SigLIP) struggle significantly compared to non-contrastive models (BLIP) on this task.
Fine-tuning on the ColorSwap dataset significantly improves the performance of CLIP and BLIP, demonstrating their capacity to learn word order understanding from a small, focused dataset. |
The dataset focuses on color-object associations, which is a simplification of the broader word order understanding problem.
The study primarily focuses on evaluating existing models, and future work could explore novel architectures or training methods specifically designed to address the limitations highlighted by ColorSwap. |
multimodal models, word order understanding, compositional reasoning, image-text matching, dataset |
2402.04324
Report |
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation |
Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun Du, Stephen Huang, Wenhu Chen |
Image-to-video (I2V) generation aims to use the initial frame (alongside a
text prompt) to create a video sequence. A grand challenge in I2V generation is
to maintain visual consistency throughout the video: existing methods often
struggle to preserve the integrity of the subject, background, and style from
the first frame, as well as ensure a fluid and logical progression within the
video narrative. To mitigate these issues, we propose ConsistI2V, a
diffusion-based method to enhance visual consistency for I2V generation.
Specifically, we introduce (1) spatiotemporal attention over the first frame to
maintain spatial and motion consistency, (2) noise initialization from the
low-frequency band of the first frame to enhance layout consistency. These two
approaches enable ConsistI2V to generate highly consistent videos. We also
extend the proposed approaches to show their potential to improve consistency
in auto-regressive long video generation and camera motion control. To verify
the effectiveness of our method, we propose I2V-Bench, a comprehensive
evaluation benchmark for I2V generation. Our automatic and human evaluation
results demonstrate the superiority of ConsistI2V over existing methods. |
This paper introduces a novel approach for image-to-video (I2V) generation that enhances video quality and consistency by leveraging spatiotemporal first frame conditioning mechanisms and FrameInit. |
Existing I2V generation methods struggle to maintain appearance and motion consistency in generated video sequences. This paper addresses those challenges. |
The authors propose spatiotemporal first frame conditioning to leverage both spatial and temporal information from the first frame. They further stabilize the generated video and reduce abrupt changes by integrating FrameInit during inference. |
The proposed method significantly outperforms existing open-sourced I2V generation models on benchmark datasets UCF-101 and MSR-VTT in quantitative metrics including FVD, IS, FID and CLIPSIM.
The method achieves state-of-the-art performance on the I2V-Bench, demonstrating its capability in generating high-quality and consistent videos.
Human evaluation confirms that the proposed method generates videos with superior appearance and motion consistency compared to other baselines. |
The model is primarily trained on WebVid-10M, which may limit its generalization ability to videos with unseen domains or styles.
Future work can explore incorporating large language models to enable more complex and controllable video generation. |
image-to-video generation, video consistency, diffusion models, frameinit, spatiotemporal conditioning |
2402.04252
Report |
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters |
Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang |
Scaling up contrastive language-image pretraining (CLIP) is critical for
empowering both vision and multimodal models. We present EVA-CLIP-18B, the
largest and most powerful open-source CLIP model to date, with 18-billion
parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an
exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized
image classification benchmarks, outperforming its forerunner EVA-CLIP
(5-billion parameters) and other open-source CLIP models by a large margin.
Remarkably, we observe a consistent performance improvement with the model size
scaling of EVA-CLIP, despite maintaining a constant training dataset of
2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly
available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B)
employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the
potential of EVA-style weak-to-strong visual model scaling. With our model
weights made publicly available, we hope to facilitate future research in
vision and multimodal foundation models. |
This paper introduces EVA-CLIP-18B, the largest open-source CLIP model to date, with 18 billion parameters, achieving state-of-the-art zero-shot performance on various image and video classification benchmarks. |
Scaling up CLIP models is crucial for enhancing visual and multimodal understanding, bridging the gap between vision models and large language models. |
The authors leverage a weak-to-strong vision scaling approach, pre-training a large EVA model as the vision encoder initialization for EVA-CLIP and scaling up the model size progressively. |
EVA-CLIP-18B achieves 80.7% average zero-shot top-1 accuracy on 27 image classification benchmarks, outperforming previous open-source CLIP models.
The model demonstrates significant improvements in zero-shot video classification, surpassing other models by a large margin.
Scaling up EVA-CLIP consistently enhances performance with no sign of saturation, suggesting potential for further vision model scaling. |
The training dataset, while large, is smaller than some used in other state-of-the-art CLIP models.
Future work can explore larger and more diverse datasets to further improve performance and generalization. |
clip, multimodal learning, vision scaling, zero-shot learning, image classification |
2402.04236
Report |
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations |
Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang |
Vision-Language Models (VLMs) have demonstrated their broad effectiveness
thanks to extensive training in aligning visual instructions to responses.
However, such training of conclusive alignment leads models to ignore essential
visual reasoning, further resulting in failures in meticulous visual problems
and unfaithful responses. Drawing inspiration from human cognition in solving
visual problems (e.g., marking, zoom in), this paper introduces Chain of
Manipulations, a mechanism that enables VLMs to solve problems step-by-step
with evidence. After training, models can solve various visual problems by
eliciting intrinsic manipulations (e.g., grounding, zoom in) with results
(e.g., boxes, image) actively without involving external tools, while also
allowing users to trace error causes. We study the roadmap to implement this
mechanism, including (1) a flexible design of manipulations upon extensive
analysis, (2) an efficient automated data generation pipeline, (3) a compatible
VLM architecture capable of multi-turn multi-image, and (4) a model training
process for versatile capabilities. With the design, we also manually annotate
6K high-quality samples for the challenging graphical mathematical problems.
Our trained model, \textbf{CogCoM}, equipped with this mechanism with 17B
parameters achieves state-of-the-art performance across 9 benchmarks from 4
categories, demonstrating the effectiveness while preserving the
interpretability. Our code, model weights, and collected data are publicly
available at https://github.com/THUDM/CogCoM. |
This paper introduces Chain of Manipulations (CoM), a mechanism that enables Vision-Language Models (VLMs) to solve problems step-by-step with evidence by actively manipulating visual inputs. |
Existing VLMs often ignore essential visual reasoning steps, leading to failures in meticulous visual problems and unfaithful responses. CoM addresses this by mimicking human-like problem-solving with visual evidence. |
The paper proposes (1) a flexible CoM data structure, (2) an automated data generation pipeline using LLMs and VFMs, (3) a memory-based multi-turn multi-image VLM architecture, and (4) a training process incorporating CoM data. |
CogCoM achieves state-of-the-art performance on 9 benchmarks across 4 categories (detailed VQA, visual grounding, general multimodal capabilities, hallucination).
Significant accuracy improvements are observed on detailed VQA and grounding benchmarks (up to 9.0 and 1.09 points, respectively).
CogCoM produces informative reasoning content without significant time overhead compared to baseline models. |
The diversity of linguistic solving steps and accuracy of visual tools are limited, leading to negative reasoning paths.
Re-inputting manipulated images with hard prompts causes speed losses, which can be improved by implementing manipulations in vector space. |
vision-language models, visual reasoning, chain of manipulations, multimodal understanding, data augmentation |
2402.04009
Report |
Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning |
Ningyuan Tang, Minghao Fu, Ke Zhu, Jianxin Wu |
In finetuning a large pretrained model to downstream tasks,
parameter-efficient fine-tuning (PEFT) methods can effectively finetune
pretrained models with few trainable parameters, but suffer from high GPU
memory consumption and slow training speed. Because learnable parameters from
these methods are entangled with the pretrained model, gradients related to the
frozen pretrained model's parameters have to be computed and stored during
finetuning. We propose Low-rank Attention Side-Tuning (LAST), which
disentangles the trainable module from the pretrained model by freezing not
only parameters but also outputs of the pretrained network. LAST trains a
side-network composed of only low-rank self-attention modules. By viewing the
pretrained model as a frozen feature extractor, the side-network takes
intermediate output from the pretrained model and focus on learning
task-specific knowledge. We also show that LAST can be highly parallel across
multiple optimization objectives, making it very efficient in downstream task
adaptation, for example, in finding optimal hyperparameters. LAST outperforms
previous state-of-the-art methods on VTAB-1K and other visual adaptation tasks
with roughly only 30\% of GPU memory footprint and 60\% of training time
compared to existing PEFT methods, but achieves significantly higher accuracy. |
This paper proposes LAST (Low-rank Attention Side-Tuning), a novel parameter-efficient fine-tuning (PEFT) method that disentangles trainable parameters from the pretrained model by freezing both parameters and outputs of the pretrained network, leading to lower GPU memory consumption and faster training speed. |
Existing PEFT methods, though effective in fine-tuning pretrained models with few trainable parameters, suffer from high GPU memory consumption and slow training speed due to entanglement of trainable parameters and the frozen model. |
LAST introduces a side-network comprised of low-rank self-attention modules that operate on intermediate outputs of the frozen pretrained model, focusing on learning task-specific knowledge without modifying the pretrained model's parameters or computation graph. |
LAST outperforms state-of-the-art PEFT methods on the VTAB-1K benchmark, achieving higher accuracy with significantly reduced GPU memory footprint (around 30% of other methods) and training time (around 60%).
The study demonstrates the surprising effectiveness of low-rank self-attention with very low dimensionality for downstream vision tasks, challenging the necessity of large feed-forward networks in side-tuning.
LAST's architecture enables highly efficient parallel training, facilitating hyperparameter search by allowing simultaneous fine-tuning of multiple models with different hyperparameter sets. |
One limitation is the lack of convenient transferability of LAST to other backbone networks beyond Transformers.
Future work includes extending LAST to other model architectures like ResNet and DenseNet, as well as to different visual adaptation tasks like object detection and image generation. |
parameter-efficient fine-tuning, side-tuning, vision transformers, low-rank attention, parallel training |
2402.03908
Report |
EscherNet: A Generative Model for Scalable View Synthesis |
Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison |
We introduce EscherNet, a multi-view conditioned diffusion model for view
synthesis. EscherNet learns implicit and generative 3D representations coupled
with a specialised camera positional encoding, allowing precise and continuous
relative control of the camera transformation between an arbitrary number of
reference and target views. EscherNet offers exceptional generality,
flexibility, and scalability in view synthesis -- it can generate more than 100
consistent target views simultaneously on a single consumer-grade GPU, despite
being trained with a fixed number of 3 reference views to 3 target views. As a
result, EscherNet not only addresses zero-shot novel view synthesis, but also
naturally unifies single- and multi-image 3D reconstruction, combining these
diverse tasks into a single, cohesive framework. Our extensive experiments
demonstrate that EscherNet achieves state-of-the-art performance in multiple
benchmarks, even when compared to methods specifically tailored for each
individual problem. This remarkable versatility opens up new directions for
designing scalable neural architectures for 3D vision. Project page:
https://kxhit.github.io/EscherNet. |
Introduces EscherNet, a multi-view conditioned diffusion model for view synthesis that allows precise camera control and generalization across synthetic and real-world images. |
Existing view synthesis methods are often scene-specific or limited in handling varying input information. EscherNet addresses these limitations by learning implicit 3D representations and accommodating varying levels of input information. |
EscherNet leverages a transformer architecture with a specialized camera positional encoding (CaPE) to capture relationships between reference and target views, enabling consistent and scalable view synthesis. |
Significantly outperforms existing 3D diffusion models in view synthesis quality on GSO and RTMV datasets.
Generates plausible novel views in a zero-shot manner on NeRF Synthetic dataset, outperforming scene-specific methods with limited reference views.
Achieves superior 3D reconstruction quality compared to other image-to-3D generative models on GSO dataset. |
Current implementation is limited to a 3 DoF setting due to training dataset constraints.
Autoregressive generation, while faster, leads to degraded quality due to content drifting. |
view synthesis, diffusion models, 3d reconstruction, camera positional encoding, generative modeling |
2402.03766
Report |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model |
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen |
We introduce MobileVLM V2, a family of significantly improved vision language
models upon MobileVLM, which proves that a delicate orchestration of novel
architectural design, an improved training scheme tailored for mobile VLMs, and
rich high-quality dataset curation can substantially benefit VLMs' performance.
Specifically, MobileVLM V2 1.7B achieves better or on-par performance on
standard VLM benchmarks compared with much larger VLMs at the 3B scale.
Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our
models will be released at https://github.com/Meituan-AutoML/MobileVLM . |
Introduces MobileVLM V2, a family of significantly improved vision language models for mobile scenarios, achieving state-of-the-art performance with faster inference speed. |
Enabling capable vision language models on real-world applications like mobile devices, self-driving cars, and embodied AI systems. |
Leverages novel architectural design (lightweight downsample projector LDPv2), improved training scheme (training projector and language model throughout), and high-quality dataset curation (ShareGPT4V, ScienceQA, TextVQA, SBU, etc.). |
MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale.
MobileVLM V2 3B model outperforms a large variety of VLMs at the 7B+ scale.
Demonstrates lower inference latency than counterparts on NVIDIA AGX Jetson Orin platform. |
Exploring even more powerful small language models based on open-source datasets.
Investigating methods for effectively utilizing high-resolution input for tasks involving small objects. |
vision language models, mobile ai, efficient deep learning, multimodal learning, computer vision |
2402.03723
Report |
Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos |
Alfredo Rivero, ShahRukh Athar, Zhixin Shu, Dimitris Samaras |
Creating controllable 3D human portraits from casual smartphone videos is
highly desirable due to their immense value in AR/VR applications. The recent
development of 3D Gaussian Splatting (3DGS) has shown improvements in rendering
quality and training efficiency. However, it still remains a challenge to
accurately model and disentangle head movements and facial expressions from a
single-view capture to achieve high-quality renderings. In this paper, we
introduce Rig3DGS to address this challenge. We represent the entire scene,
including the dynamic subject, using a set of 3D Gaussians in a canonical
space. Using a set of control signals, such as head pose and expressions, we
transform them to the 3D space with learned deformations to generate the
desired rendering. Our key innovation is a carefully designed deformation
method which is guided by a learnable prior derived from a 3D morphable model.
This approach is highly efficient in training and effective in controlling
facial expressions, head positions, and view synthesis across various captures.
We demonstrate the effectiveness of our learned deformation through extensive
quantitative and qualitative experiments. The project page can be found at
http://shahrukhathar.github.io/2024/02/05/Rig3DGS.html |
Introduces Rig3DGS, a method for creating reanimatable 3D human portraits with controllable facial expressions and head pose from monocular phone videos. |
Creating such controllable portraits from casual videos is highly desirable for AR/VR applications but challenging due to the need to accurately disentangle facial deformations from head movements in single-view captures. |
Represents the scene using 3D Gaussians in a canonical space, deformed by a learned prior based on a 3D morphable model (FLAME). This deformation is guided by predicted weights for each Gaussian, determined by its proximity to vertices on the FLAME mesh. |
Achieves higher-quality renderings than prior work (RigNeRF, INSTA, PointAvatar) with greater fidelity to facial expressions and head poses.
Demonstrates successful novel view synthesis of the entire scene while maintaining high fidelity to the target expression and head pose.
Shows that the learnable deformation prior is crucial for generalization to novel expressions and head poses compared to fixed priors or no prior. |
Limitations include an inability to model strong non-uniform illumination.
Requires the subject to remain relatively still during capture.
Future work will address these limitations. |
3d human reconstruction, neural rendering, 3d gaussian splatting, facial expression control, novel view synthesis |
2402.03445
Report |
Denoising Diffusion via Image-Based Rendering |
Titas Anciukevičius, Fabian Manhardt, Federico Tombari, Paul Henderson |
Generating 3D scenes is a challenging open problem, which requires
synthesizing plausible content that is fully consistent in 3D space. While
recent methods such as neural radiance fields excel at view synthesis and 3D
reconstruction, they cannot synthesize plausible details in unobserved regions
since they lack a generative capability. Conversely, existing generative
methods are typically not capable of reconstructing detailed, large-scale
scenes in the wild, as they use limited-capacity 3D scene representations,
require aligned camera poses, or rely on additional regularizers. In this work,
we introduce the first diffusion model able to perform fast, detailed
reconstruction and generation of real-world 3D scenes. To achieve this, we make
three contributions. First, we introduce a new neural scene representation,
IB-planes, that can efficiently and accurately represent large 3D scenes,
dynamically allocating more capacity as needed to capture details visible in
each image. Second, we propose a denoising-diffusion framework to learn a prior
over this novel 3D scene representation, using only 2D images without the need
for any additional supervision signal such as masks or depths. This supports 3D
reconstruction and generation in a unified architecture. Third, we develop a
principled approach to avoid trivial 3D solutions when integrating the
image-based rendering with the diffusion model, by dropping out representations
of some images. We evaluate the model on several challenging datasets of real
and synthetic images, and demonstrate superior results on generation, novel
view synthesis and 3D reconstruction. |
The paper introduces GIBR, the first denoising diffusion model capable of generating and reconstructing large-scale, detailed 3D scenes from 2D images. |
Existing 3D scene generation methods struggle with real-world scenes due to limitations in representing large, detailed scenes, reliance on scarce 3D datasets, and difficulty in sampling from complex scene distributions. |
GIBR uses a novel image-based 3D scene representation (IB-planes) that adapts its capacity based on image details. It employs a multi-view denoising diffusion framework with a 3D-consistent denoising mechanism and dropout of neural representations during training to prevent trivial solutions. |
GIBR outperforms baselines in 3D reconstruction from single and multiple images, generating plausible details in unobserved regions.
GIBR successfully generates coherent and detailed 3D scenes unconditionally, demonstrating its ability to learn a strong prior over 3D scenes from 2D images.
Ablation studies confirm the importance of key design choices such as IB-planes, representation dropout, and cross-view attention. |
The model currently assumes static scenes and does not handle dynamic elements.
Despite approximations, training GIBR remains computationally demanding compared to 2D diffusion models due to volumetric rendering. |
3d scene generation, denoising diffusion models, image-based rendering, multi-view reconstruction, neural scene representation |
2402.03328
Report |
Visual Enumeration is Challenging for Large-scale Generative AI |
Alberto Testolin, Kuinan Hou, Marco Zorzi |
Humans can readily judge the number of objects in a visual scene, even
without counting, and such a skill has been documented in many animal species
and babies prior to language development and formal schooling. Numerical
judgments are error-free for small sets, while for larger collections responses
become approximate, with variability increasing proportionally to the target
number. This response pattern is observed for items of all kinds, despite
variation in object features (such as color or shape), suggesting that our
visual number sense relies on abstract representations of numerosity. Here, we
investigate whether large-scale generative Artificial Intelligence (AI) systems
have a human-like number sense, which should allow them to reliably name the
number of objects in simple visual stimuli or generate images containing a
target number of items in the 1-10 range. Surprisingly, most of the foundation
models considered have a poor number sense: They make striking errors even with
small numbers, the response variability does not increase in a systematic way,
and the pattern of errors depends on object category. Only the most recent
proprietary systems exhibit signatures of a visual number sense. Our findings
demonstrate that having an intuitive visual understanding of number remains
challenging for foundation models, which in turn might be detrimental to the
perceptual grounding of numeracy that in humans is crucial for mathematical
learning. |
The paper investigates whether large-scale generative AI systems possess a human-like number sense, enabling them to accurately judge and generate images with specific numbers of objects. |
Understanding visual numerosity is crucial for mathematical learning in humans, and this study aims to assess if advanced AI models exhibit similar capabilities. |
The researchers tested several foundation models, including ViLT, BLIP-2, GPT-4V, Gemini, Stable Diffusion, and DALL-E, using numerosity naming (identifying the number of objects in an image) and numerosity production (generating images with a target number of objects) tasks. |
Most foundation models struggle with visual numerosity, even for small numbers, indicating a limited number sense compared to humans.
Only the most recent models, GPT-4V and DALL-E 3, show signs of human-like number sense, exhibiting subitizing for small numbers and sometimes following Weber's law for larger numbers.
Many models exhibit response variability that does not align with the psychophysics of human numerosity perception, suggesting a lack of abstract numerical understanding. |
The study primarily focused on a limited numerical range (1-10) and specific object categories.
The closed-source nature of proprietary models (GPT-4V, DALL-E, Gemini) limits insights into the underlying mechanisms of their numerosity processing. |
foundation models, machine vision, numerical cognition, deep learning, generative ai |
2402.03327
Report |
Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models |
Dingning Liu, Xiaoshui Huang, Yuenan Hou, Zhihui Wang, Zhenfei Yin, Yongshun Gong, Peng Gao, Wanli Ouyang |
In this paper, we introduce Uni3D-LLM, a unified framework that leverages a
Large Language Model (LLM) to integrate tasks of 3D perception, generation, and
editing within point cloud scenes. This framework empowers users to
effortlessly generate and modify objects at specified locations within a scene,
guided by the versatility of natural language descriptions. Uni3D-LLM harnesses
the expressive power of natural language to allow for precise command over the
generation and editing of 3D objects, thereby significantly enhancing
operational flexibility and controllability. By mapping point cloud into the
unified representation space, Uni3D-LLM achieves cross-application
functionality, enabling the seamless execution of a wide array of tasks,
ranging from the accurate instantiation of 3D objects to the diverse
requirements of interactive design. Through a comprehensive suite of rigorous
experiments, the efficacy of Uni3D-LLM in the comprehension, generation, and
editing of point cloud has been validated. Additionally, we have assessed the
impact of integrating a point cloud perception module on the generation and
editing processes, confirming the substantial potential of our approach for
practical applications. |
Uni3D-LLM, a novel unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes, allowing for precise, language-guided manipulation of 3D objects. |
Existing methods for integrating LLMs into 3D scene processing suffer from limitations such as inaccurate spatial understanding, occlusion issues, and lack of scene-level alignment, highlighting the importance of a unified framework for enhanced efficiency and collaborative work. |
The framework aligns point cloud and image data with text using modality-specific projectors, maps LLM semantic features to a generation model (DreamGaussian), and enables iterative 3D model editing using InstructPix2Pix. |
Uni3D-LLM effectively performs grounding tasks by incorporating image features as spatial assistance, overcoming limitations of using point cloud data alone.
Adding object-level image information significantly improves object classification accuracy, emphasizing the importance of multi-modal data.
Introducing a perception module (Lora) does not negatively impact generation, showcasing the synergistic effects of combining multiple 3D tasks. |
Enhancing the positioning capability of point clouds for improved accuracy.
Addressing limitations inherited from DreamGaussian and InstructPix2Pix, such as generating large-scale scenes and performing freeform editing. |
3d perception, point cloud generation, 3d object editing, large language models (llms), multimodal learning |
2402.03310
Report |
V-IRL: Grounding Virtual Intelligence in Real Life |
Jihan Yang, Runyu Ding, Ellis Brown, Xiaojuan Qi, Saining Xie |
There is a sensory gulf between the Earth that humans inhabit and the digital
realms in which modern AI agents are created. To develop AI agents that can
sense, think, and act as flexibly as humans in real-world settings, it is
imperative to bridge the realism gap between the digital and physical worlds.
How can we embody agents in an environment as rich and diverse as the one we
inhabit, without the constraints imposed by real hardware and control? Towards
this end, we introduce V-IRL: a platform that enables agents to scalably
interact with the real world in a virtual yet realistic environment. Our
platform serves as a playground for developing agents that can accomplish
various practical tasks and as a vast testbed for measuring progress in
capabilities spanning perception, decision-making, and interaction with
real-world data across the entire globe. |
This paper introduces \virl, a platform that enables AI agents to interact with the real world using a virtual replica built from real-world geospatial and street-view data. |
Existing AI agents often lack grounding in the sensory richness of the real world. \virl bridges this realism gap, paving the way for agents that can effectively sense, think, and act in real-world scenarios. |
\virl leverages the Google Maps Platform to create a navigable virtual environment. Agents interact with this environment through various components, including modules for geolocation, street-view imagery, movement, mapping, place information retrieval, vision (perception), and language (reasoning & collaboration). |
Open-world vision models exhibit significant biases towards frequently observed place types, highlighting the need for more diverse and representative data.
Scaling model size significantly improves performance on both place recognition and visual question answering tasks, emphasizing the importance of model capacity.
Vision models are particularly challenged in non-English speaking regions, indicating potential linguistic biases in existing models. |
The current version of \virl primarily focuses on navigation and place recognition tasks, and could be extended to encompass a broader range of real-world interactions, such as object manipulation.
Further research is required to mitigate model biases and enhance robustness to noisy visual observations, particularly for complex tasks like vision-language navigation. |
ai agents, embodied ai, vision-language navigation, open-world vision, global benchmarks |
2402.03307
Report |
4D Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes |
Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, Baoquan Chen |
We consider the problem of novel view synthesis (NVS) for dynamic scenes.
Recent neural approaches have accomplished exceptional NVS results for static
3D scenes, but extensions to 4D time-varying scenes remain non-trivial. Prior
efforts often encode dynamics by learning a canonical space plus implicit or
explicit deformation fields, which struggle in challenging scenarios like
sudden movements or capturing high-fidelity renderings. In this paper, we
introduce 4D Gaussian Splatting (4DGS), a novel method that represents dynamic
scenes with anisotropic 4D XYZT Gaussians, inspired by the success of 3D
Gaussian Splatting in static scenes. We model dynamics at each timestamp by
temporally slicing the 4D Gaussians, which naturally compose dynamic 3D
Gaussians and can be seamlessly projected into images. As an explicit
spatial-temporal representation, 4DGS demonstrates powerful capabilities for
modeling complicated dynamics and fine details, especially for scenes with
abrupt motions. We further implement our temporal slicing and splatting
techniques in a highly optimized CUDA acceleration framework, achieving
real-time inference rendering speeds of up to 277 FPS on an RTX 3090 GPU and
583 FPS on an RTX 4090 GPU. Rigorous evaluations on scenes with diverse motions
showcase the superior efficiency and effectiveness of 4DGS, which consistently
outperforms existing methods both quantitatively and qualitatively. |
This paper presents 4D Gaussian Splatting, a novel approach for novel view synthesis of dynamic scenes, by representing them using anisotropic 4D Gaussians. |
Efficient and accurate novel view synthesis for dynamic scenes is crucial for various applications but remains challenging due to the complexities of the temporal dimension and diverse motion patterns. |
The method models dynamics by temporally slicing 4D Gaussians, which naturally compose dynamic 3D Gaussians, and utilizes a highly optimized CUDA acceleration framework for real-time rendering speeds. |
The method achieves state-of-the-art rendering quality, outperforming prior arts in PSNR and SSIM metrics on both Plenoptic Video and D-NeRF datasets.
It achieves unprecedented rendering speed of up to 277 FPS on an RTX 3090 GPU and 583 FPS on an RTX 4090 GPU, significantly surpassing previous methods.
The proposed entropy and 4D consistency losses are shown to effectively improve rendering quality by reducing floaters and enhancing motion consistency. |
While the method effectively reduces artifacts like floaters and inconsistent motions, challenges remain in constraining 4D Gaussians due to increased dimensions.
Future work includes exploring the use of 4D Gaussians for downstream tasks such as tracking and dynamic scene generation. |
novel view synthesis, dynamic scenes, 4d gaussian splatting, real-time rendering, cuda acceleration |
2402.03302
Report |
Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining |
Jiarun Liu, Hao Yang, Hong-Yu Zhou, Yan Xi, Lequan Yu, Yizhou Yu, Yong Liang, Guangming Shi, Shaoting Zhang, Hairong Zheng, Shanshan Wang |
Accurate medical image segmentation demands the integration of multi-scale
information, spanning from local features to global dependencies. However, it
is challenging for existing methods to model long-range global information,
where convolutional neural networks (CNNs) are constrained by their local
receptive fields, and vision transformers (ViTs) suffer from high quadratic
complexity of their attention mechanism. Recently, Mamba-based models have
gained great attention for their impressive ability in long sequence modeling.
Several studies have demonstrated that these models can outperform popular
vision models in various tasks, offering higher accuracy, lower memory
consumption, and less computational burden. However, existing Mamba-based
models are mostly trained from scratch and do not explore the power of
pretraining, which has been proven to be quite effective for data-efficient
medical image analysis. This paper introduces a novel Mamba-based model,
Swin-UMamba, designed specifically for medical image segmentation tasks,
leveraging the advantages of ImageNet-based pretraining. Our experimental
results reveal the vital role of ImageNet-based training in enhancing the
performance of Mamba-based models. Swin-UMamba demonstrates superior
performance with a large margin compared to CNNs, ViTs, and latest Mamba-based
models. Notably, on AbdomenMRI, Encoscopy, and Microscopy datasets, Swin-UMamba
outperforms its closest counterpart U-Mamba_Enc by an average score of 2.72%. |
This paper introduces Swin-UMamba, a novel Mamba-based UNet model for medical image segmentation that leverages ImageNet-based pretraining. |
Accurate medical image segmentation requires efficient modeling of long-range dependencies, which remains a challenge for existing CNN and ViT models. Mamba-based models offer a promising solution, but their potential with pretraining in medical image segmentation is underexplored. |
The authors designed Swin-UMamba to integrate a Mamba-based encoder pretrained on ImageNet with a UNet-like decoder. They also proposed a variant, Swin-UMamba†, with a Mamba-based decoder for efficiency. Experiments were conducted on AbdomenMRI, Endoscopy, and Microscopy datasets. |
Swin-UMamba significantly outperformed CNN, ViT, and existing Mamba-based models on all datasets.
ImageNet-based pretraining substantially improved performance, especially on smaller datasets, highlighting its importance for Mamba-based models.
Swin-UMamba† achieved competitive results with fewer parameters and lower FLOPs, demonstrating the potential of Mamba in resource-constrained settings. |
The study focused on 2D segmentation and may not generalize directly to 3D medical images.
Further hyperparameter tuning and exploration of different pretraining strategies could potentially improve performance. |
medical image segmentation, imagenet pretraining, mamba, long-range dependency modeling, unet |
2402.03290
Report |
InstanceDiffusion: Instance-level Control for Image Generation |
Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra |
Text-to-image diffusion models produce high quality images but do not offer
control over individual instances in the image. We introduce InstanceDiffusion
that adds precise instance-level control to text-to-image diffusion models.
InstanceDiffusion supports free-form language conditions per instance and
allows flexible ways to specify instance locations such as simple single
points, scribbles, bounding boxes or intricate instance segmentation masks, and
combinations thereof. We propose three major changes to text-to-image models
that enable precise instance-level control. Our UniFusion block enables
instance-level conditions for text-to-image models, the ScaleU block improves
image fidelity, and our Multi-instance Sampler improves generations for
multiple instances. InstanceDiffusion significantly surpasses specialized
state-of-the-art models for each location condition. Notably, on the COCO
dataset, we outperform previous state-of-the-art by 20.4% AP$_{50}^\text{box}$
for box inputs, and 25.4% IoU for mask inputs. |
\ours enables precise instance-level control for text-to-image generation by allowing users to specify the location and textual description of each instance in an image. |
Existing text-to-image generation models lack control over individual instances within an image. This limits their use in applications that require fine-grained control over image composition, such as design or data generation. |
The authors introduce \ours, a model built on top of a frozen pretrained text-to-image diffusion model. It leverages three key components:
- **UniFusion block:** Projects various instance location formats (points, scribbles, boxes, masks) into a unified feature space and fuses them with the visual features of the diffusion model.
- **ScaleU block:** Improves the model's ability to adhere to specified instance locations by dynamically rescaling the skip connection and main features in the UNet.
- **Multi-instance Sampler (MIS):** Reduces information leakage and confusion between multiple instance conditions during inference. |
\ours significantly outperforms previous state-of-the-art methods specialized for specific instance conditions on COCO and LVIS datasets.
The model exhibits superior attribute binding capability, accurately reflecting instance colors and textures specified in the input prompts.
Using multiple location formats simultaneously improves the model's fidelity to instance locations, leading to better image generation results. |
The generation quality of smaller objects shows a noticeable gap compared to larger objects.
Texture binding remains challenging for all tested methods, including \ours. |
text-to-image generation, instance-level control, diffusion models, location conditioning, attribute binding |
2402.03286
Report |
Training-Free Consistent Text-to-Image Generation |
Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon |
Text-to-image models offer a new level of creative flexibility by allowing
users to guide the image generation process through natural language. However,
using these models to consistently portray the same subject across diverse
prompts remains challenging. Existing approaches fine-tune the model to teach
it new words that describe specific user-provided subjects or add image
conditioning to the model. These methods require lengthy per-subject
optimization or large-scale pre-training. Moreover, they struggle to align
generated images with text prompts and face difficulties in portraying multiple
subjects. Here, we present ConsiStory, a training-free approach that enables
consistent subject generation by sharing the internal activations of the
pretrained model. We introduce a subject-driven shared attention block and
correspondence-based feature injection to promote subject consistency between
images. Additionally, we develop strategies to encourage layout diversity while
maintaining subject consistency. We compare ConsiStory to a range of baselines,
and demonstrate state-of-the-art performance on subject consistency and text
alignment, without requiring a single optimization step. Finally, ConsiStory
can naturally extend to multi-subject scenarios, and even enable training-free
personalization for common objects. |
This paper proposes Consistory, a training-free method for generating consistent subjects across multiple images with diverse prompts using pre-trained text-to-image diffusion models. |
Maintaining visual consistency of subjects across different images generated from varying text prompts is crucial for various applications like storytelling, virtual asset design, and synthetic data creation. Existing methods have limitations such as requiring per-subject training, struggling with multi-subject consistency, or compromising prompt alignment. |
Consistory leverages internal feature representations of diffusion models to align generated images during denoising. It employs subject-driven self-attention to share subject-specific information, incorporates vanilla query features and attention dropout for layout diversity, and utilizes feature injection for fine-grained consistency. |
Consistory achieves state-of-the-art performance on subject consistency and text alignment without requiring any training.
It significantly outperforms existing methods in terms of speed, being approximately 20 times faster.
The method can be extended to multi-subject scenarios and training-free personalization for common objects. |
Consistory's performance depends on the accuracy of object localization through cross-attention maps, which can be imperfect for unusual styles.
The method struggles to disentangle appearance and style, limiting consistent generation to images sharing the same style. |
text-to-image synthesis, consistent image generation, diffusion models, self-attention, training-free methods |
2402.03251
Report |
CLIP Can Understand Depth |
Dunam Kim, Seokju Lee |
Recent studies on generalizing CLIP for monocular depth estimation reveal
that CLIP pre-trained on web-crawled data is inefficient for deriving proper
similarities between image patches and depth-related prompts. In this paper, we
adapt CLIP for meaningful quality of monocular depth estimation with dense
prediction, without fine-tuning its original vision-language alignment. By
jointly training a compact deconvolutional decoder with a tiny learnable
embedding matrix named mirror, as a static prompt for its text encoder, CLIP is
enabled to understand depth. With this approach, our model exhibits impressive
performance matching several previous state-of-the-art vision-only models on
the NYU Depth v2 and KITTI datasets, outperforming every CLIP-based depth
estimation model with a large margin. Experiments on temporal depth consistency
and spatial continuity demonstrate that the prior knowledge of CLIP can be
effectively refined by our proposed framework. Furthermore, an ablation study
on mirror proves that the resulting model estimates depth utilizing knowledge
not only from the image encoder but also text encoder despite not being given
any prompt written in a human way. This research demonstrates that through
minimal adjustments, the prior knowledge of vision-language foundation models,
such as CLIP, can be generalized even to domains where learning during
pretraining is challenging. We facilitate future works focused on methods to
adjust suboptimal prior knowledge of vision-language models using non-human
language prompts, achieving performance on par with task-specific
state-of-the-art methodologies. |
This paper introduces CLIP2Depth, a framework that adapts a pretrained and frozen CLIP model for monocular dense depth estimation using non-human language supervision. |
This research is important because it demonstrates that pretrained vision-language models like CLIP can be effectively generalized to complex domains, such as depth estimation, without requiring direct fine-tuning. |
The authors jointly train a compact deconvolutional decoder with a learnable embedding matrix named *mirror*. *Mirror* acts as a non-human language prompt, conditioning the CLIP text encoder to understand depth. |
CLIP2Depth outperforms all previous CLIP-based depth estimation models on NYU Depth v2 and KITTI datasets.
The model achieves performance comparable to state-of-the-art vision-only models while preserving CLIP's task-agnostic characteristics.
Ablation studies validate the effectiveness of *mirror* and the overall design choices of the framework. |
The model exhibits a performance gap between NYU Depth v2 and KITTI, suggesting room for improvement in generalizing to unseen domains.
Further exploration is needed to better understand and leverage the correlation between human and AI knowledge systems through non-human language prompts. |
depth estimation, clip, vision-language models, non-human language prompts, prompt learning |
2402.03246
Report |
SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM |
Mingrui Li, Shuhong Liu, Heng Zhou, Guohao Zhu, Na Cheng, Tianchen Deng, Hongyu Wang |
We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian
Splatting. It incorporates appearance, geometry, and semantic features through
multi-channel optimization, addressing the oversmoothing limitations of neural
implicit SLAM systems in high-quality rendering, scene understanding, and
object-level geometry. We introduce a unique semantic feature loss that
effectively compensates for the shortcomings of traditional depth and color
losses in object optimization. Through a semantic-guided keyframe selection
strategy, we prevent erroneous reconstructions caused by cumulative errors.
Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art
performance in camera pose estimation, map reconstruction, precise semantic
segmentation, and object-level geometric accuracy, while ensuring real-time
rendering capabilities. |
SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting, incorporates appearance, geometry, and semantic features for enhanced scene understanding and object-level geometry. |
Addresses limitations of neural implicit SLAM systems, such as oversmoothing, by leveraging the speed and direct gradient flow of Gaussian Splatting for high-quality rendering, scene understanding, and object-level geometry. |
Utilizes multi-channel optimization with appearance, depth, and semantic information. Employs semantic-guided keyframe selection to improve map reconstruction accuracy. |
Achieves state-of-the-art performance in camera pose estimation, surpassing baselines in ATE RMSE by up to 34%.
Delivers high-fidelity dense map reconstruction, outperforming baselines in PSNR by a margin of 10dB.
Provides highly accurate 3D semantic segmentation, exceeding NeRF-based methods by over 10% in mIoU. |
Relies on depth and 2D semantic input, limiting performance in environments where this data is scarce.
Faces challenges with high memory consumption in large-scale scenes. |
slam, 3d reconstruction, semantic segmentation, gaussian splatting, scene understanding |
2402.03241
Report |
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition |
Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han |
In this paper, we introduce FROSTER, an effective framework for
open-vocabulary action recognition. The CLIP model has achieved remarkable
success in a range of image-based tasks, benefiting from its strong
generalization capability stemming from pretaining on massive image-text pairs.
However, applying CLIP directly to the open-vocabulary action recognition task
is challenging due to the absence of temporal information in CLIP's
pretraining. Further, fine-tuning CLIP on action recognition datasets may lead
to overfitting and hinder its generalizability, resulting in unsatisfactory
results when dealing with unseen actions.
To address these issues, FROSTER employs a residual feature distillation
approach to ensure that CLIP retains its generalization capability while
effectively adapting to the action recognition task. Specifically, the residual
feature distillation treats the frozen CLIP model as a teacher to maintain the
generalizability exhibited by the original CLIP and supervises the feature
learning for the extraction of video-specific features to bridge the gap
between images and videos. Meanwhile, it uses a residual sub-network for
feature distillation to reach a balance between the two distinct objectives of
learning generalizable and video-specific features.
We extensively evaluate FROSTER on open-vocabulary action recognition
benchmarks under both base-to-novel and cross-dataset settings. FROSTER
consistently achieves state-of-the-art performance on all datasets across the
board. Project page: https://visual-ai.github.io/froster. |
This paper introduces FROSTER, a novel framework for open-vocabulary action recognition that enhances the adaptation of the CLIP model to video data while preserving its generalization capabilities. |
Applying CLIP to open-vocabulary action recognition is challenging due to CLIP's training on image-text pairs, which lacks temporal information, leading to suboptimal performance on unseen actions. |
FROSTER employs residual feature distillation using a frozen CLIP model as a teacher to guide a student model. This approach balances video-specific learning with the retention of CLIP's generalizability. |
FROSTER consistently outperforms previous state-of-the-art methods on both base-to-novel and cross-dataset action recognition benchmarks.
The residual feature distillation approach effectively balances the learning of video-specific features while preserving generalization abilities.
FROSTER demonstrates compatibility with various network architectures, highlighting its adaptability and effectiveness. |
The model's performance on fine-grained action datasets like SSv2 suggests room for improvement in capturing temporal dynamics.
Exploring more sophisticated text augmentation techniques to further enhance action understanding is an area for future work. |
action recognition, open vocabulary, clip, knowledge distillation, generalizability |
2402.03214
Report |
Organic or Diffused: Can We Distinguish Human Art from AI-generated Images? |
Anna Yoo Jeong Ha, Josephine Passananti, Ronik Bhaskar, Shawn Shan, Reid Southen, Haitao Zheng, Ben Y. Zhao |
The advent of generative AI images has completely disrupted the art world.
Distinguishing AI generated images from human art is a challenging problem
whose impact is growing over time. A failure to address this problem allows bad
actors to defraud individuals paying a premium for human art and companies
whose stated policies forbid AI imagery. It is also critical for content owners
to establish copyright, and for model trainers interested in curating training
data in order to avoid potential model collapse.
There are several different approaches to distinguishing human art from AI
images, including classifiers trained by supervised learning, research tools
targeting diffusion models, and identification by professional artists using
their knowledge of artistic techniques. In this paper, we seek to understand
how well these approaches can perform against today's modern generative models
in both benign and adversarial settings. We curate real human art across 7
styles, generate matching images from 5 generative models, and apply 8
detectors (5 automated detectors and 3 different human groups including 180
crowdworkers, 4000+ professional artists, and 13 expert artists experienced at
detecting AI). Both Hive and expert artists do very well, but make mistakes in
different ways (Hive is weaker against adversarial perturbations while Expert
artists produce higher false positives). We believe these weaknesses will
remain as models continue to evolve, and use our data to demonstrate why a
combined team of human and automated detectors provides the best combination of
accuracy and robustness. |
This paper explores the effectiveness of different methods for distinguishing between human-created art and AI-generated images. |
Identifying AI-generated images is crucial for art authenticity, copyright, preventing fraud, and ensuring the quality of AI model training data. |
The study uses a dataset of human and AI art across 7 styles, tested with 5 automated detectors (Hive, Optic, Illuminarty, DIRE, DE-FAKE), and 3 human groups (crowdworkers, artists, experts) under various adversarial conditions. |
Hive exhibits the highest accuracy (98%) among all detectors but struggles with adversarially perturbed images, especially those processed with Glaze.
Human experts outperform machines in judging Glazed images, leveraging domain knowledge and detecting subtle artistic inconsistencies missed by AI models.
Combining human experts and automated detectors, such as Hive, yields a more robust and accurate detection approach, particularly against adversarial examples. |
Limited number of expert artists and artworks used in the study, potentially impacting the generalizability of results.
Rapid evolution of AI image generation models necessitates constant updates to detectors and the evaluation framework. |
ai-generated art, image detection, human perception, adversarial machine learning, art authentication |
2402.03161
Report |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization |
Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu |
In light of recent advances in multimodal Large Language Models (LLMs), there
is increasing attention to scaling them from image-text data to more
informative real-world videos. Compared to static images, video poses unique
challenges for effective large-scale pre-training due to the modeling of its
spatiotemporal dynamics. In this paper, we address such limitations in
video-language pre-training with an efficient video decomposition that
represents each video as keyframes and temporal motions. These are then adapted
to an LLM using well-designed tokenizers that discretize visual and temporal
information as a few tokens, thus enabling unified generative pre-training of
videos, images, and text. At inference, the generated tokens from the LLM are
carefully recovered to the original continuous pixel space to create various
video content. Our proposed framework is both capable of comprehending and
generating image and video content, as demonstrated by its competitive
performance across 13 multimodal benchmarks in image and video understanding
and generation. Our code and models will be available at
https://video-lavit.github.io. |
This paper introduces Video-LaVIT, a multimodal pre-training method for unified comprehension and generation of videos, images, and language using Large Language Models (LLMs). |
Existing multimodal LLMs struggle to effectively encode video data due to the computational cost of capturing complex spatiotemporal dynamics. |
Video-LaVIT decomposes videos into keyframes and motion vectors. It employs a novel video tokenizer to represent these components as discrete tokens, enabling unified pre-training with LLMs. A video detokenizer then maps generated tokens back into the continuous pixel space for video generation. |
Video-LaVIT achieves state-of-the-art results on various image and video understanding benchmarks.
It demonstrates competitive performance on text-to-video and image-to-video generation tasks.
The method supports long video generation by progressively decoding multiple short clips while maintaining temporal consistency. |
The model's limited context window restricts direct processing of very long videos.
Training cost remains high, limiting scalability to massive video datasets. |
multimodal learning, large language models, video understanding, video generation, motion tokenization |
2402.03119
Report |
Good Teachers Explain: Explanation-Enhanced Knowledge Distillation |
Amin Parchami-Araghi, Moritz Böhle, Sukrut Rao, Bernt Schiele |
Knowledge Distillation (KD) has proven effective for compressing large
teacher models into smaller student models. While it is well known that student
models can achieve similar accuracies as the teachers, it has also been shown
that they nonetheless often do not learn the same function. It is, however,
often highly desirable that the student's and teacher's functions share similar
properties such as basing the prediction on the same input features, as this
ensures that students learn the 'right features' from the teachers. In this
work, we explore whether this can be achieved by not only optimizing the
classic KD loss but also the similarity of the explanations generated by the
teacher and the student. Despite the idea being simple and intuitive, we find
that our proposed 'explanation-enhanced' KD (e$^2$KD) (1) consistently provides
large gains in terms of accuracy and student-teacher agreement, (2) ensures
that the student learns from the teacher to be right for the right reasons and
to give similar explanations, and (3) is robust with respect to the model
architectures, the amount of training data, and even works with 'approximate',
pre-computed explanations. |
The paper proposes explanation-enhanced knowledge distillation (e^2KD), which encourages student models to not only match teacher model logits but also their explanations (e.g., GradCAM, B-cos), thereby improving distillation fidelity. |
Faithful knowledge distillation is crucial for ensuring student models learn the same function and reasoning process as their teachers, leading to improved generalization, robustness to distribution shifts, and interpretability. |
e^2KD introduces an explanation similarity loss term alongside the traditional KD loss. This term minimizes the difference between teacher and student explanations, typically using cosine similarity on GradCAM or B-cos explanations. |
e^2KD significantly boosts student accuracy and agreement with teachers, especially with limited training data (e.g., ImageNet with 50 shots).
Students trained with e^2KD learn to rely on the 'right' features, exhibiting improved robustness to distribution shifts (demonstrated on the Waterbirds dataset).
e^2KD effectively transfers desirable explanation properties from teachers to students, including architectural priors (shown by distilling CNN to ViT, resulting in shift-invariant explanations). |
The computational cost of e^2KD is higher than vanilla KD due to the additional explanation computations (mitigated by using 'frozen' pre-computed explanations).
The effectiveness of e^2KD relies on the quality and faithfulness of the chosen explanation method. |
knowledge distillation, explainable ai (xai), model compression, distribution shift, model interpretability |
2402.03040
Report |
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions |
Yiyuan Zhang, Yuhao Kang, Zhixin Zhang, Xiaohan Ding, Sanyuan Zhao, Xiangyu Yue |
We introduce $\textit{InteractiveVideo}$, a user-centric framework for video
generation. Different from traditional generative approaches that operate based
on user-provided images or text, our framework is designed for dynamic
interaction, allowing users to instruct the generative model through various
intuitive mechanisms during the whole generation process, e.g. text and image
prompts, painting, drag-and-drop, etc. We propose a Synergistic Multimodal
Instruction mechanism, designed to seamlessly integrate users' multimodal
instructions into generative models, thus facilitating a cooperative and
responsive interaction between user inputs and the generative process. This
approach enables iterative and fine-grained refinement of the generation result
through precise and effective user instructions. With
$\textit{InteractiveVideo}$, users are given the flexibility to meticulously
tailor key aspects of a video. They can paint the reference image, edit
semantics, and adjust video motions until their requirements are fully met.
Code, models, and demo are available at
https://github.com/invictus717/InteractiveVideo |
Presents "InteractiveVideo", a user-centric framework for video generation that allows users to iteratively control and refine the generation process using multimodal instructions, such as text prompts, image editing, and motion trajectories. |
Existing video generation models, relying on image and text inputs, often fail to fully capture user intentions and offer limited control over the generated content, particularly in terms of complex motion and dynamic scenes. |
The framework utilizes two generative pipelines (text-to-image and image-to-video) based on latent diffusion models. It incorporates user interactions (e.g., painting, dragging) as denoising residuals to influence the video denoising process, enabling fine-grained control over video elements. |
Allows for personalization of video content by adding or animating objects absent in the original reference image.
Enables fine-grained video editing, including regional semantic changes like color and appearance modifications.
Exhibits precise motion control, demonstrated through large motion control, precise gesture control, and multi-object motion control. |
Ensuring accessibility and intuitive usability across diverse user groups.
Maintaining computational efficiency amidst dynamic and diverse user inputs. |
video generation, interactive ai, multimodal instructions, user-centric design, diffusion models |
2402.02972
Report |
Retrieval-Augmented Score Distillation for Text-to-3D Generation |
Junyoung Seo, Susung Hong, Wooseok Jang, Inès Hyeonsu Kim, Minseop Kwak, Doyup Lee, Seungryong Kim |
Text-to-3D generation has achieved significant success by incorporating
powerful 2D diffusion models, but insufficient 3D prior knowledge also leads to
the inconsistency of 3D geometry. Recently, since large-scale multi-view
datasets have been released, fine-tuning the diffusion model on the multi-view
datasets becomes a mainstream to solve the 3D inconsistency problem. However,
it has confronted with fundamental difficulties regarding the limited quality
and diversity of 3D data, compared with 2D data. To sidestep these trade-offs,
we explore a retrieval-augmented approach tailored for score distillation,
dubbed ReDream. We postulate that both expressiveness of 2D diffusion models
and geometric consistency of 3D assets can be fully leveraged by employing the
semantically relevant assets directly within the optimization process. To this
end, we introduce novel framework for retrieval-based quality enhancement in
text-to-3D generation. We leverage the retrieved asset to incorporate its
geometric prior in the variational objective and adapt the diffusion model's 2D
prior toward view consistency, achieving drastic improvements in both geometry
and fidelity of generated scenes. We conduct extensive experiments to
demonstrate that ReDream exhibits superior quality with increased geometric
consistency. Project page is available at https://ku-cvlab.github.io/ReDream/. |
This paper proposes ReDream, a retrieval-augmented score distillation framework for text-to-3D generation that enhances the quality and geometric consistency of generated 3D scenes. |
Existing text-to-3D methods struggle with geometric inconsistencies or limited fidelity due to insufficient 3D training data. ReDream addresses this by leveraging the strengths of both 2D diffusion models and 3D assets. |
ReDream retrieves semantically relevant 3D assets and uses them in two ways: 1) Initializing the variational distribution of the 3D scene to incorporate geometric priors. 2) Lightweight adaptation of the 2D diffusion model for improved view consistency. |
ReDream generates high-quality 3D scenes with improved geometric consistency compared to previous text-to-3D methods.
The method allows for flexible control over the generation process, influenced by both text prompts and retrieved assets.
Quantitative and qualitative evaluations, including a user study, demonstrate the effectiveness of ReDream over existing approaches. |
The generation process, while faster than the baseline, is still time-consuming compared to methods focused on fast inference.
The ability to handle complex text prompts is limited by the capabilities of the underlying 2D diffusion model. |
text-to-3d generation, score distillation sampling, retrieval-augmented generation, 3d consistency, variational inference |
2402.02906
Report |
ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis |
Bernard Spiegl, Andrea Perin, Stéphane Deny, Alexander Ilin |
Deep learning is providing a wealth of new approaches to the old problem of
novel view synthesis, from Neural Radiance Field (NeRF) based approaches to
end-to-end style architectures. Each approach offers specific strengths but
also comes with specific limitations in their applicability. This work
introduces ViewFusion, a state-of-the-art end-to-end generative approach to
novel view synthesis with unparalleled flexibility. ViewFusion consists in
simultaneously applying a diffusion denoising step to any number of input views
of a scene, then combining the noise gradients obtained for each view with an
(inferred) pixel-weighting mask, ensuring that for each region of the target
scene only the most informative input views are taken into account. Our
approach resolves several limitations of previous approaches by (1) being
trainable and generalizing across multiple scenes and object classes, (2)
adaptively taking in a variable number of pose-free views at both train and
test time, (3) generating plausible views even in severely undetermined
conditions (thanks to its generative nature) -- all while generating views of
quality on par or even better than state-of-the-art methods. Limitations
include not generating a 3D embedding of the scene, resulting in a relatively
slow inference speed, and our method only being tested on the relatively small
dataset NMR. Code is available. |
Introduces \textit{\shorthand}, a flexible and pose-free generative approach for novel view synthesis using composable diffusion models with a novel weighting scheme to adaptively handle an arbitrary number of input views. |
Existing novel view synthesis methods often require expensive per-scene retraining, struggle with pose-free inputs, or cannot adapt to a variable number of input views at test time. This work aims to address these limitations. |
The method utilizes a composable diffusion probabilistic framework where each input view is processed through identical U-Net streams. A learned weighting mechanism infers the importance of each view at each denoising step, composing the final prediction. The approach is trained end-to-end on a dataset of multiple object classes and input view poses. |
Achieves state-of-the-art or near state-of-the-art performance on NMR dataset across different metrics.
Demonstrates flexibility by handling variable input view counts, generalizing across classes, producing plausible views for occluded regions, and maintaining 3D consistency in autoregressive generation.
Shows the model's ability to adaptively shift weighting based on the informativeness of input views for a given target view. |
Currently lacks explicit incorporation of 3D semantics, potentially limiting its performance on out-of-distribution scenes.
Inference speed scales linearly with the number of input views, which can be computationally expensive for a large number of views or high-resolution images. |
novel view synthesis, diffusion models, generative models, pose-free, composable |
2402.02887
Report |
Time-, Memory- and Parameter-Efficient Visual Adaptation |
Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab |
As foundation models become more popular, there is a growing need to
efficiently finetune them for downstream tasks. Although numerous adaptation
methods have been proposed, they are designed to be efficient only in terms of
how many parameters are trained. They, however, typically still require
backpropagating gradients throughout the model, meaning that their
training-time and -memory cost does not reduce as significantly. We propose an
adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on
features from the frozen, pretrained backbone. As a result, our method is
efficient not only in terms of parameters, but also in training-time and memory
usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on
the popular VTAB benchmark, and we further show how we outperform prior works
with respect to training-time and -memory usage too. We further demonstrate the
training efficiency and scalability of our method by adapting a vision
transformer backbone of 4 billion parameters for the computationally demanding
task of video classification, without any intricate model parallelism. Here, we
outperform a prior adaptor-based method which could only scale to a 1 billion
parameter backbone, or fully-finetuning a smaller backbone, with the same GPU
and less training time. |
This paper proposes Low-Rank Side Adaptation (LoSA), an efficient adaptation method for large pre-trained models focusing on reducing training time and memory usage without backpropagating gradients through the entire model. |
Existing parameter-efficient adaptation methods mainly focus on reducing the number of trained parameters but still require significant training time and memory due to backpropagation through the entire model. |
LoSA introduces a lightweight parallel network operating on frozen activations from the pre-trained backbone model, refining features for the target task without backpropagating gradients through the backbone. |
LoSA achieves state-of-the-art accuracy-parameter trade-offs on the VTAB benchmark.
LoSA demonstrates scalability by adapting a 4-billion parameter vision transformer for video classification, outperforming previous methods while using less memory and training time.
Ablation studies validate the design choices of LoSA, such as the use of a low-rank mixer and the selection of backbone activations. |
The current work focuses on vision tasks. Further exploration is needed to extend its applicability to other domains.
Future research can investigate the extension of LoSA to more complex vision tasks beyond image and video classification. |
parameter-efficient finetuning, vision transformers, adaptation methods, training efficiency, memory efficiency |
2402.02800
Report |
Extreme Two-View Geometry From Object Poses with Diffusion Models |
Yujing Sun, Caiyi Sun, Yuan Liu, Yuexin Ma, Siu Ming Yiu |
Human has an incredible ability to effortlessly perceive the viewpoint
difference between two images containing the same object, even when the
viewpoint change is astonishingly vast with no co-visible regions in the
images. This remarkable skill, however, has proven to be a challenge for
existing camera pose estimation methods, which often fail when faced with large
viewpoint differences due to the lack of overlapping local features for
matching. In this paper, we aim to effectively harness the power of object
priors to accurately determine two-view geometry in the face of extreme
viewpoint changes. In our method, we first mathematically transform the
relative camera pose estimation problem to an object pose estimation problem.
Then, to estimate the object pose, we utilize the object priors learned from a
diffusion model Zero123 to synthesize novel-view images of the object. The
novel-view images are matched to determine the object pose and thus the
two-view camera pose. In experiments, our method has demonstrated extraordinary
robustness and resilience to large viewpoint changes, consistently estimating
two-view poses with exceptional generalization ability across both synthetic
and real-world datasets. Code will be available at
https://github.com/scy639/Extreme-Two-View-Geometry-From-Object-Poses-with-Diffusion-Models. |
This paper introduces a novel algorithm leveraging object priors from diffusion models to estimate the relative camera pose of two images with extreme viewpoint changes, where traditional feature matching methods struggle due to minimal overlapping regions. |
Estimating relative camera poses for images with extreme viewpoint differences is crucial for applications like 3D reconstruction and augmented reality, but remains a challenge for existing feature-matching methods. |
The algorithm transforms the camera pose estimation into an object pose estimation problem. It utilizes a diffusion model (Zero123) to generate novel-view images of the co-visible object and estimates object poses for input and generated images. Finally, it matches an input image against the generated images to determine the relative camera pose. |
The method demonstrates superior accuracy in estimating relative camera poses compared to baseline feature matching and regression-based methods on GSO and Navi datasets.
It shows robustness to in-plane rotations by incorporating in-plane rotation estimation in the pipeline.
The method shows potential in improving visual odometry (VO) accuracy as demonstrated by an application example. |
The method may face challenges in accurately predicting poses for symmetrical objects.
Future work could focus on improving the runtime of the algorithm. |
camera pose estimation, object pose estimation, diffusion models, extreme viewpoints, two-view geometry |
2402.02705
Report |
Representation Surgery for Multi-Task Model Merging |
Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, Dacheng Tao |
Multi-task learning (MTL) compresses the information from multiple tasks into
a unified backbone to improve computational efficiency and generalization.
Recent work directly merges multiple independently trained models to perform
MTL instead of collecting their raw data for joint training, greatly expanding
the application scenarios of MTL. However, by visualizing the representation
distribution of existing model merging schemes, we find that the merged model
often suffers from the dilemma of representation bias. That is, there is a
significant discrepancy in the representation distribution between the merged
and individual models, resulting in poor performance of merged MTL. In this
paper, we propose a representation surgery solution called "Surgery" to reduce
representation bias in the merged model. Specifically, Surgery is a lightweight
task-specific module that takes the representation of the merged model as input
and attempts to output the biases contained in the representation from the
merged model. We then designed an unsupervised optimization objective that
updates the Surgery module by minimizing the distance between the merged
model's representation and the individual model's representation. Extensive
experiments demonstrate significant MTL performance improvements when our
Surgery module is applied to state-of-the-art (SOTA) model merging schemes. |
This paper identifies and addresses the "representation bias" problem in multi-task model merging, where the merged model's representations differ from individually trained models, leading to performance degradation. It proposes a novel "representation surgery" approach to alleviate this issue. |
Model merging for multi-task learning (MTL) offers advantages over traditional MTL by enabling the combination of independently trained models without requiring access to raw training data. However, existing model merging methods often suffer from a performance gap compared to traditional MTL, hindering their effectiveness. |
The paper introduces "Surgery," a lightweight, task-specific module added after model merging. Surgery takes the merged model's representation as input and aims to minimize the distance between its output and the corresponding individual model's representation using an unsupervised objective. |
Representation bias is shown to exist across tasks, architectures, and merging methods, hindering performance.
Surgery effectively reduces representation bias, leading to significant performance improvements across various model merging baselines.
The proposed method is lightweight, requiring minimal additional parameters and training iterations. |
The current study focuses on ViT architectures. Exploring the effectiveness of representation surgery on other architectures is left for future work.
Future work will investigate model merging from different architectures or initializations. |
model merging, multi-task learning, representation bias, representation surgery, unsupervised learning |
2402.02474
Report |
Deep Spectral Improvement for Unsupervised Image Instance Segmentation |
Farnoosh Arefi, Amir M. Mansourian, Shohreh Kasaei |
Deep spectral methods reframe the image decomposition process as a graph
partitioning task by extracting features using self-supervised learning and
utilizing the Laplacian of the affinity matrix to obtain eigensegments.
However, instance segmentation has received less attention compared to other
tasks within the context of deep spectral methods. This paper addresses the
fact that not all channels of the feature map extracted from a self-supervised
backbone contain sufficient information for instance segmentation purposes. In
fact, Some channels are noisy and hinder the accuracy of the task. To overcome
this issue, this paper proposes two channel reduction modules: Noise Channel
Reduction (NCR) and Deviation-based Channel Reduction (DCR). The NCR retains
channels with lower entropy, as they are less likely to be noisy, while DCR
prunes channels with low standard deviation, as they lack sufficient
information for effective instance segmentation. Furthermore, the paper
demonstrates that the dot product, commonly used in deep spectral methods, is
not suitable for instance segmentation due to its sensitivity to feature map
values, potentially leading to incorrect instance segments. A new similarity
metric called Bray-Curtis over Chebyshev (BoC) is proposed to address this
issue. It takes into account the distribution of features in addition to their
values, providing a more robust similarity measure for instance segmentation.
Quantitative and qualitative results on the Youtube-VIS2019 dataset highlight
the improvements achieved by the proposed channel reduction methods and the use
of BoC instead of the conventional dot product for creating the affinity
matrix. These improvements are observed in terms of mean Intersection over
Union and extracted instance segments, demonstrating enhanced instance
segmentation performance. The code is available on:
https://github.com/farnooshar/SpecUnIIS |
This paper proposes two channel reduction modules, Noise Channel Reduction (NCR) and Deviation-based Channel Reduction (DCR), and a new similarity metric, Bray-Curtis over Chebyshev (BoC), to improve deep spectral methods for unsupervised image instance segmentation. |
Existing deep spectral methods for instance segmentation struggle because not all feature map channels from self-supervised backbones are informative, and the commonly used dot product for affinity matrix creation is sensitive to feature values and ignores feature distribution. |
NCR removes noisy channels based on entropy. DCR further reduces channels based on standard deviation to prioritize informative features for instance segmentation. BoC leverages Bray-Curtis and Chebyshev distances to consider both feature distribution and values in affinity matrix creation. |
NCR improved Fg-Bg segmentation F-score by up to 3% on YouTube-VIS2019 and PascalVOC 2012 datasets.
BoC outperformed other similarity metrics, achieving 2% higher mIoU than the dot product on instance segmentation.
The proposed method, combining NCR, DCR, and BoC, demonstrated robustness in handling occlusions and variations in object sizes. |
Exploring alternative channel reduction techniques beyond entropy and standard deviation.
Investigating the integration of the proposed method within a supervised learning framework for potentially improved channel selection. |
deep spectral methods, image instance segmentation, self-supervised learning, unsupervised learning, transformer models |
2402.02453
Report |
AI Art Neural Constellation: Revealing the Collective and Contrastive State of AI-Generated and Human Art |
Faizan Farooq Khan, Diana Kim, Divyansh Jha, Youssef Mohamed, Hanna H Chang, Ahmed Elgammal, Luba Elliott, Mohamed Elhoseiny |
Discovering the creative potentials of a random signal to various artistic
expressions in aesthetic and conceptual richness is a ground for the recent
success of generative machine learning as a way of art creation. To understand
the new artistic medium better, we conduct a comprehensive analysis to position
AI-generated art within the context of human art heritage. Our comparative
analysis is based on an extensive dataset, dubbed ``ArtConstellation,''
consisting of annotations about art principles, likability, and emotions for
6,000 WikiArt and 3,200 AI-generated artworks. After training various
state-of-the-art generative models, art samples are produced and compared with
WikiArt data on the last hidden layer of a deep-CNN trained for style
classification. We actively examined the various art principles to interpret
the neural representations and used them to drive the comparative knowledge
about human and AI-generated art. A key finding in the semantic analysis is
that AI-generated artworks are visually related to the principle concepts for
modern period art made in 1800-2000. In addition, through Out-Of-Distribution
(OOD) and In-Distribution (ID) detection in CLIP space, we find that
AI-generated artworks are ID to human art when they depict landscapes and
geometric abstract figures, while detected as OOD when the machine art consists
of deformed and twisted figures. We observe that machine-generated art is
uniquely characterized by incomplete and reduced figuration. Lastly, we
conducted a human survey about emotional experience. Color composition and
familiar subjects are the key factors of likability and emotions in art
appreciation. We propose our whole methodologies and collected dataset as our
analytical framework to contrast human and AI-generated art, which we refer to
as ``ArtNeuralConstellation''. Code is available at:
https://github.com/faixan-khan/ArtNeuralConstellation |
This paper presents "ArtNeuralConstellation," an analytical framework to contrast AI-generated and human art using art principles, time analysis, and emotional responses. |
With the rise of AI-generated art, understanding its differences and similarities to human-created art is crucial for appreciating this new medium and its place within art history. |
The authors analyze a dataset of 6,000 WikiArt and 3,200 AI-generated artworks, evaluating them based on Wölfflin's art principles, general art principles, out-of-distribution detection in CLIP space, time period similarity, and emotional responses from human surveys. |
AI-generated art leans towards visual concepts associated with modern art (1800-2000) and Baroque styles.
Landscapes and geometric abstractions in AI art are often indistinguishable from human art, while deformed figures are identified as distinctly AI-generated.
AI art evokes a diverse range of emotions comparable to human art, with likability linked to successful depictions of familiar subjects like landscapes and portraits. |
The study primarily focuses on AI art generated without human intervention, limiting insights into collaborative human-AI art creation.
Future work can expand the analysis to non-Western art and explore additional art principles beyond those considered in this study. |
ai-generated art, art history, computational aesthetics, deep learning, emotional analysis |
2402.02369
Report |
M$^3$Face: A Unified Multi-Modal Multilingual Framework for Human Face Generation and Editing |
Mohammadreza Mofayezi, Reza Alipour, Mohammad Ali Kakavand, Ehsaneddin Asgari |
Human face generation and editing represent an essential task in the era of
computer vision and the digital world. Recent studies have shown remarkable
progress in multi-modal face generation and editing, for instance, using face
segmentation to guide image generation. However, it may be challenging for some
users to create these conditioning modalities manually. Thus, we introduce
M3Face, a unified multi-modal multilingual framework for controllable face
generation and editing. This framework enables users to utilize only text input
to generate controlling modalities automatically, for instance, semantic
segmentation or facial landmarks, and subsequently generate face images. We
conduct extensive qualitative and quantitative experiments to showcase our
frameworks face generation and editing capabilities. Additionally, we propose
the M3CelebA Dataset, a large-scale multi-modal and multilingual face dataset
containing high-quality images, semantic segmentations, facial landmarks, and
different captions for each image in multiple languages. The code and the
dataset will be released upon publication. |
M extsuperscript{3}Face, a unified multi-modal multilingual framework for controllable face generation and editing, simplifies multi-modal generation by automatically creating conditioning modalities (e.g., semantic segmentation, facial landmarks) from text input. |
Existing multi-modal methods, while powerful, require manual creation of conditioning modalities, which is complex for users. M extsuperscript{3}Face addresses this by automatically generating these modalities from text, enhancing user experience and accessibility. |
The framework uses a masked transformer model (Muse) to generate conditioning modalities from text. Then, ControlNet generates face images from these modalities. For editing, inpainting edits the modalities, followed by Imagic manipulation using the trained ControlNet models. M extsuperscript{3}CelebA Dataset, a large-scale multi-modal and multilingual face dataset, is introduced to train and evaluate the framework. |
Generates realistic face images from both multi-modal conditions and text prompts, capturing intricate details like hair style, glasses, and emotions.
Enables consistent and controllable face editing using text, masks, landmarks, or a combination thereof, surpassing baselines in preserving identity and adhering to target prompts.
Outperforms existing methods in quantitative metrics like FID, CLIP Score, and directional CLIP similarity for both face generation and editing tasks. |
The quality of Muse-generated segmentation and landmarks can affect the final image quality.
The performance relies heavily on the Stable Diffusion backbone used in ControlNet, and exploring more robust backbones could further enhance results. |
face generation, face editing, multi-modal generation, diffusion models, multilingual |
2402.02352
Report |
Region-Based Representations Revisited |
Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman T V, Heyi Tao, Jae Yong Lee, Wilfredo Torres, Yu-Xiong Wang, Derek Hoiem |
We investigate whether region-based representations are effective for
recognition. Regions were once a mainstay in recognition approaches, but pixel
and patch-based features are now used almost exclusively. We show that recent
class-agnostic segmenters like SAM can be effectively combined with strong
unsupervised representations like DINOv2 and used for a wide variety of tasks,
including semantic segmentation, object-based image retrieval, and multi-image
analysis. Once the masks and features are extracted, these representations,
even with linear decoders, enable competitive performance, making them well
suited to applications that require custom queries. The compactness of the
representation also makes it well-suited to video analysis and other problems
requiring inference across many images. |
This paper investigates the effectiveness of region-based representations for various recognition tasks, leveraging class-agnostic segmenters like SAM and strong self-supervised representations like DINOv2. |
Region-based representations offer advantages such as scalability, flexibility, and interpretability, enabling applications like custom image retrieval, interactive learning, and multi-image inference. |
The methodology involves generating regions using SAM and SLIC, extracting image features using DINOv2, pooling features within masks to create region representations, and employing these representations for semantic segmentation, object retrieval, and activity classification. |
Region-based representations with simple linear decoders achieve competitive performance on semantic segmentation, outperforming patch-based approaches.
One-shot object-based image retrieval using region representations significantly surpasses single-token representations like DINOv2 and CLIP.
Region-based representations prove beneficial for multi-frame activity classification, allowing for efficient processing of multiple frames and capturing temporal dynamics. |
The current speed of SAM for region generation can be a bottleneck for real-time applications.
Future work could explore incorporating additional information into region features, such as human pose or optical flow, to further enhance their representational power for tasks like activity recognition. |
region-based representation, semantic segmentation, object retrieval, activity classification, self-supervised learning |
2402.02209
Report |
On the Exploitation of DCT-Traces in the Generative-AI Domain |
Orazio Pontorno, Luca Guarnera, Sebastiano Battiato |
Deepfakes represent one of the toughest challenges in the world of
Cybersecurity and Digital Forensics, especially considering the high-quality
results obtained with recent generative AI-based solutions. Almost all
generative models leave unique traces in synthetic data that, if analyzed and
identified in detail, can be exploited to improve the generalization
limitations of existing deepfake detectors. In this paper we analyzed deepfake
images in the frequency domain generated by both GAN and Diffusion Model
engines, examining in detail the underlying statistical distribution of
Discrete Cosine Transform (DCT) coefficients. Recognizing that not all
coefficients contribute equally to image detection, we hypothesize the
existence of a unique "discriminative fingerprint", embedded in specific
combinations of coefficients. To identify them, Machine Learning classifiers
were trained on various combinations of coefficients. In addition, the
Explainable AI (XAI) LIME algorithm was used to search for intrinsic
discriminative combinations of coefficients. Finally, we performed a robustness
test to analyze the persistence of traces by applying JPEG compression. The
experimental results reveal the existence of traces left by the generative
models that are more discriminative and persistent at JPEG attacks. |
This paper presents an analysis of Discrete Cosine Transform (DCT) coefficients to identify unique traces left by GAN and Diffusion Model based deepfakes. |
Identifying these traces is crucial for improving deepfake detection methods and overcoming their generalization limitations. |
The authors analyze the statistical distribution of DCT coefficients, particularly the AC statistics (β^AC), from real, GAN-generated, and Diffusion Model-generated images. They use machine learning classifiers trained on different β^AC subsets and the XAI algorithm LIME to pinpoint the most discriminative coefficients. The persistence of these traces is further evaluated under JPEG compression. |
β^AC coefficients effectively distinguish between real, GAN, and Diffusion Model generated images.
Specific subsets of β^AC coefficients, particularly those identified by LIME, exhibit high discriminative power.
The discriminative power of high-frequency β^AC coefficients diminishes with JPEG compression, while low-frequency coefficients retain some discriminative ability. |
The study primarily focuses on low-resolution images.
Further investigation is needed to explore the persistence of low-frequency β^AC traces under stronger compression and other image manipulations. |
deepfakes, multimedia forensics, synthetic traces, discrete cosine transform, explainable ai |
2402.01950
Report |
ConRF: Zero-shot Stylization of 3D Scenes with Conditioned Radiation Fields |
Xingyu Miao, Yang Bai, Haoran Duan, Fan Wan, Yawen Huang, Yang Long, Yefeng Zheng |
Most of the existing works on arbitrary 3D NeRF style transfer required
retraining on each single style condition. This work aims to achieve zero-shot
controlled stylization in 3D scenes utilizing text or visual input as
conditioning factors. We introduce ConRF, a novel method of zero-shot
stylization. Specifically, due to the ambiguity of CLIP features, we employ a
conversion process that maps the CLIP feature space to the style space of a
pre-trained VGG network and then refine the CLIP multi-modal knowledge into a
style transfer neural radiation field. Additionally, we use a 3D volumetric
representation to perform local style transfer. By combining these operations,
ConRF offers the capability to utilize either text or images as references,
resulting in the generation of sequences with novel views enhanced by global or
local stylization. Our experiment demonstrates that ConRF outperforms other
existing methods for 3D scene and single-text stylization in terms of visual
quality. |
ConRF: a novel NeRF-based method for zero-shot 3D scene stylization using text or image as a single reference. |
Existing 3D scene stylization methods are limited to known styles or require retraining for new styles. ConRF offers flexibility and control by enabling zero-shot transfer with either text or image references. |
ConRF leverages a pre-trained CLIP encoder to extract features and maps them to the style space of a pre-trained VGG network using a mapping network. It employs a 3D selection volume for localized style manipulation based on text prompts. |
ConRF achieves zero-shot 3D style transfer using single-text or single-image references.
It outperforms existing methods in terms of visual quality and consistency across multiple views.
ConRF allows for localized style transfer based on text prompts, enabling fine-grained control over stylization. |
The method's performance is limited by the capabilities of the pre-trained CLIP model, which may not always accurately capture subtle style nuances.
The local style transfer is currently limited to face-forwarding scenes and may require further development for broader applicability. Future work will explore incorporating generative models for enhanced creative capabilities. |
nerf, style transfer, zero-shot learning, clip, 3d scene stylization |
2402.01832
Report |
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training? |
Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem |
We present SynthCLIP, a novel framework for training CLIP models with
entirely synthetic text-image pairs, significantly departing from previous
methods relying on real data. Leveraging recent text-to-image (TTI) generative
networks and large language models (LLM), we are able to generate synthetic
datasets of images and corresponding captions at any scale, with no human
intervention. With training at scale, SynthCLIP achieves performance comparable
to CLIP models trained on real datasets. We also introduce SynthCI-30M, a
purely synthetic dataset comprising 30 million captioned images. Our code,
trained models, and generated data are released at
https://github.com/hammoudhasan/SynthCLIP |
This paper presents SynthCLIP, a novel framework for training CLIP models using entirely synthetic text-image pairs generated through a pipeline leveraging text-to-image networks and large language models. |
Training CLIP traditionally relies on large, web-scraped datasets that suffer from noise, imbalanced representation, and potential safety concerns. SynthCLIP addresses these limitations by offering a scalable, controlled, and safe data generation process. |
SynthCLIP utilizes a four-step process: (1) Concept-based caption generation using an LLM, (2) Caption filtering for balanced concept distribution, (3) Image generation from captions using a text-to-image model (Stable Diffusion), and (4) Standard CLIP training on the synthetic pairs. |
SynthCLIP, trained on a large-scale synthetic dataset (SynthCI-30M), achieves performance comparable to CLIP models trained on real datasets like CC12M.
Scaling the size of the synthetic dataset significantly improves performance across various vision and vision-language tasks.
The quality of captions significantly impacts performance, with synthetic captions generated through a multi-step process, potentially augmented with captioning models, showing the most promise. |
The current generation pipeline, while scalable, requires significant computational resources.
Future work will explore optimizing resource usage and further improving caption quality and alignment with generated images. |
clip, diffusion models, vision-language models, synthetic data, generative networks |
2402.01590
Report |
NeuroCine: Decoding Vivid Video Sequences from Human Brain Activties |
Jingyuan Sun, Mingxiao Li, Zijiao Chen, Marie-Francine Moens |
In the pursuit to understand the intricacies of human brain's visual
processing, reconstructing dynamic visual experiences from brain activities
emerges as a challenging yet fascinating endeavor. While recent advancements
have achieved success in reconstructing static images from non-invasive brain
recordings, the domain of translating continuous brain activities into video
format remains underexplored. In this work, we introduce NeuroCine, a novel
dual-phase framework to targeting the inherent challenges of decoding fMRI
data, such as noises, spatial redundancy and temporal lags. This framework
proposes spatial masking and temporal interpolation-based augmentation for
contrastive learning fMRI representations and a diffusion model enhanced by
dependent prior noise for video generation. Tested on a publicly available fMRI
dataset, our method shows promising results, outperforming the previous
state-of-the-art models by a notable margin of ${20.97\%}$, ${31.00\%}$ and
${12.30\%}$ respectively on decoding the brain activities of three subjects in
the fMRI dataset, as measured by SSIM. Additionally, our attention analysis
suggests that the model aligns with existing brain structures and functions,
indicating its biological plausibility and interpretability. |
Introduces NeuralFlix, a novel dual-phase framework for reconstructing high-resolution videos from fMRI data, addressing challenges like noise, spatial redundancy, and temporal lags. |
Decoding dynamic visual experiences from brain activity is crucial for understanding visual processing and developing technologies for sensory impairments. |
Employs spatial masking and temporal interpolation for contrastive learning of fMRI representations, and a diffusion model with dependent prior noise for generating videos. |
Significantly outperforms previous state-of-the-art models in decoding brain activities, as measured by SSIM (20.97%, 31.00%, and 12.30% improvements on three subjects).
Generates videos with higher semantic accuracy compared to previous methods.
Attention analysis suggests biological plausibility with alignments to visual cortex and higher cognitive networks. |
Limited fMRI-video paired datasets restrict training data size.
Further research is needed to improve temporal coherence and clarity in generated videos. |
fmri decoding, video reconstruction, diffusion models, contrastive learning, brain-computer interface |
2402.01566
Report |
Boximator: Generating Rich and Controllable Motions for Video Synthesis |
Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, Hang Li |
Generating rich and controllable motion is a pivotal challenge in video
synthesis. We propose Boximator, a new approach for fine-grained motion
control. Boximator introduces two constraint types: hard box and soft box.
Users select objects in the conditional frame using hard boxes and then use
either type of boxes to roughly or rigorously define the object's position,
shape, or motion path in future frames. Boximator functions as a plug-in for
existing video diffusion models. Its training process preserves the base
model's knowledge by freezing the original weights and training only the
control module. To address training challenges, we introduce a novel
self-tracking technique that greatly simplifies the learning of box-object
correlations. Empirically, Boximator achieves state-of-the-art video quality
(FVD) scores, improving on two base models, and further enhanced after
incorporating box constraints. Its robust motion controllability is validated
by drastic increases in the bounding box alignment metric. Human evaluation
also shows that users favor Boximator generation results over the base model. |
Introduces Boximator, a novel approach for fine-grained video motion control using hard and soft box constraints, functioning as a plug-in for existing video diffusion models. |
Addresses the limitations of existing video synthesis methods by enabling precise control over object motion, pose, and interactions using intuitive box-based constraints. |
Leverages a novel self-tracking technique during training to learn box-object correlations by generating colored bounding boxes, guiding object generation and motion. |
Achieves state-of-the-art video quality (FVD) scores, outperforming base models with and without box constraints.
Demonstrates robust motion controllability with significant improvements in bounding box alignment metrics (AP) on MSR-VTT and ActivityNet.
Receives strong preference in human evaluation for both video quality and motion control compared to base models. |
Reliance on automated bounding box annotations may introduce noise and limit control accuracy.
Current implementation focuses on single-object control per box, requiring further exploration for multi-object interactions within a single box. |
video synthesis, motion control, diffusion models, self-tracking, box constraints |
2402.01524
Report |
HyperPlanes: Hypernetwork Approach to Rapid NeRF Adaptation |
Paweł Batorski, Dawid Malarz, Marcin Przewięźlikowski, Marcin Mazur, Sławomir Tadeja, Przemysław Spurek |
Neural radiance fields (NeRFs) are a widely accepted standard for
synthesizing new 3D object views from a small number of base images. However,
NeRFs have limited generalization properties, which means that we need to use
significant computational resources to train individual architectures for each
item we want to represent. To address this issue, we propose a few-shot
learning approach based on the hypernetwork paradigm that does not require
gradient optimization during inference. The hypernetwork gathers information
from the training data and generates an update for universal weights. As a
result, we have developed an efficient method for generating a high-quality 3D
object representation from a small number of images in a single step. This has
been confirmed by direct comparison with the state-of-the-art solutions and a
comprehensive ablation study. |
Presents HyperPlanes, a novel few-shot learning approach for NeRF-based 3D object representation, leveraging the hypernetwork paradigm to generate updates for universal weights in a single step, eliminating the need for gradient optimization during inference. |
Addresses limitations of traditional NeRF models, such as their inability to generalize to new data and the need for extensive training times, by enabling rapid adaptation to new objects with limited data. |
Employs a hypernetwork that takes a few support ImagePlanes (HyperPlanes) and the target network weights to generate updates for the target PointMultiPlaneNeRF model, enabling efficient adaptation to new object representations. |
Achieves superior results in reconstructing unseen objects compared to gradient-based few-shot learning methods like REPTILE, even without fine-tuning.
Exhibits strong generalization capabilities across different object types, outperforming MultiPlaneNeRF in cross-class object rendering.
Demonstrates significantly faster object reconstruction (up to 380 times) than vanilla NeRF trained for a large number of epochs. |
Potential limitation in achieving the same rendering quality as a vanilla NeRF with extensive training.
Future work will focus on exploring techniques to further enhance the rendering quality of the generated 3D objects. |
neural radiance fields, nerf, few-shot learning, hypernetworks, 3d object representation |
2402.01472
Report |
Synthetic Data for the Mitigation of Demographic Biases in Face Recognition |
Pietro Melzi, Christian Rathgeb, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Dominik Lawatsch, Florian Domin, Maxim Schaubert |
This study investigates the possibility of mitigating the demographic biases
that affect face recognition technologies through the use of synthetic data.
Demographic biases have the potential to impact individuals from specific
demographic groups, and can be identified by observing disparate performance of
face recognition systems across demographic groups. They primarily arise from
the unequal representations of demographic groups in the training data. In
recent times, synthetic data have emerged as a solution to some problems that
affect face recognition systems. In particular, during the generation process
it is possible to specify the desired demographic and facial attributes of
images, in order to control the demographic distribution of the synthesized
dataset, and fairly represent the different demographic groups. We propose to
fine-tune with synthetic data existing face recognition systems that present
some demographic biases. We use synthetic datasets generated with GANDiffFace,
a novel framework able to synthesize datasets for face recognition with
controllable demographic distribution and realistic intra-class variations. We
consider multiple datasets representing different demographic groups for
training and evaluation. Also, we fine-tune different face recognition systems,
and evaluate their demographic fairness with different metrics. Our results
support the proposed approach and the use of synthetic data to mitigate
demographic biases in face recognition. |
This paper investigates the use of synthetic data, generated by their novel GANDiffFace framework, to mitigate demographic bias in face recognition systems. |
Face recognition systems often exhibit biases against certain demographic groups due to unequal representation in training datasets. This study explores a solution using synthetic data to improve fairness. |
The authors fine-tuned two popular face recognition systems (ArcFace and CosFace) with synthetic datasets generated by GANDiffFace. They used two different fine-tuning datasets: one specifically representing the biased demographic (Asian), and another with balanced demographic representation. The effectiveness of bias mitigation was assessed using fairness metrics (FDR, IR, and GARBE) on two real-world datasets (DiveFace and RFW). |
Fine-tuning ArcFace with the balanced synthetic dataset mitigated bias effectively, leading to improved fairness metrics.
Fine-tuning ArcFace with the Asian-specific synthetic dataset negatively impacted fairness by reducing FMR for Asians to levels significantly lower than other groups.
Fine-tuning CosFace with the Asian-specific synthetic dataset showed minor fairness improvements, while using the balanced synthetic dataset did not yield consistent positive results. |
The study is limited to addressing bias against one specific demographic (Asian).
Further investigation with different synthetic datasets and a wider range of demographic groups is needed. |
face recognition, demographic bias, fairness, synthetic data, gandiffface |
2402.01459
Report |
GaMeS: Mesh-Based Adapting and Modification of Gaussian Splatting |
Joanna Waczyńska, Piotr Borycki, Sławomir Tadeja, Jacek Tabor, Przemysław Spurek |
Recently, a range of neural network-based methods for image rendering have
been introduced. One such widely-researched neural radiance field (NeRF) relies
on a neural network to represent 3D scenes, allowing for realistic view
synthesis from a small number of 2D images. However, most NeRF models are
constrained by long training and inference times. In comparison, Gaussian
Splatting (GS) is a novel, state-of-the-art technique for rendering points in a
3D scene by approximating their contribution to image pixels through Gaussian
distributions, warranting fast training and swift, real-time rendering. A
drawback of GS is the absence of a well-defined approach for its conditioning
due to the necessity to condition several hundred thousand Gaussian components.
To solve this, we introduce the Gaussian Mesh Splatting (GaMeS) model, which
allows modification of Gaussian components in a similar way as meshes. We
parameterize each Gaussian component by the vertices of the mesh face.
Furthermore, our model needs mesh initialization on input or estimated mesh
during training. We also define Gaussian splats solely based on their location
on the mesh, allowing for automatic adjustments in position, scale, and
rotation during animation. As a result, we obtain a real-time rendering of
editable GS. |
This paper introduces Gaussian Mesh Splatting (GaMeS), a novel method for representing and rendering editable Gaussian Splatting (GS) models using meshes. GaMeS parameterizes Gaussian components on mesh faces, enabling automatic adaptation to mesh modifications and facilitating real-time animation. |
Efficiently conditioning GS models, which consist of hundreds of thousands of Gaussian components, is challenging. GaMeS addresses this by directly coupling Gaussian components with mesh structures, enabling real-time editing and animation while maintaining rendering quality comparable to GS. |
GaMeS represents 3D scenes using Gaussian components positioned on mesh faces. It either utilizes existing meshes or generates a simplified mesh (pseudo-mesh) directly from Gaussian components. Gaussian parameters (mean, covariance) are parameterized by mesh vertices, ensuring automatic adaptation to mesh transformations. |
GaMeS achieves comparable rendering quality to state-of-the-art methods on the NeRF-Synthetic and Mip-NeRF360 datasets.
GaMeS allows for real-time editing and animation of 3D scenes by manipulating the underlying mesh.
The method effectively handles scenarios with and without pre-existing meshes, demonstrating flexibility in diverse applications. |
GaMeS may exhibit artifacts during significant mesh modifications, particularly with large mesh faces.
Future work involves exploring strategies to handle Gaussian component adaptation when mesh faces are split during modification. |
gaussian splatting, mesh representation, 3d scene editing, real-time rendering, neural rendering |
2402.01368
Report |
LIR: A Lightweight Baseline for Image Restoration |
Dongqi Fan, Ting Yue, Xin Zhao, Liang Chang |
Recently, there have been significant advancements in Image Restoration based
on CNN and transformer. However, the inherent characteristics of the Image
Restoration task are often overlooked. Many works, instead, only focus on the
basic block design and stack numerous such blocks to the model, leading to
parameters redundant and computations unnecessary. Thus, the efficiency of the
image restoration is hindered. In this paper, we propose a Lightweight Baseline
for Image Restoration called LIR to efficiently reconstruct the image and
remove degradations (blur, rain, noise, haze). First of all, LIR addresses the
degradations existing in the local and global residual connections that are
ignored by modern networks, through a simple structural design. Then, to
achieve lightweight, a Lightweight Adaptive Attention (LAA) Block is introduced
depending on the inherent characteristics of the Image Restoration, which is
mainly composed of proposed Adaptive Filters and Attention Blocks. LAA is
capable of adaptively sharpening contours, removing degradation, and capturing
global information in various Image Restoration scenes in a
computation-friendly manner. Extensive experiments demonstrate that our LIR
achieves comparable performance to state-of-the-art models with fewer
parameters and computations in certain tasks. In addition, it is worth noting
that our LIR produces better visual results than state-of-the-art networks that
are more in line with the human aesthetic. |
This paper introduces LIR, a lightweight image restoration network that effectively removes degradations (blur, rain, noise, haze) from images. |
Many existing image restoration networks prioritize complex block designs and stacking, leading to excessive parameters and computations. LIR aims to address this by offering a more efficient and lightweight solution. |
LIR leverages a novel Lightweight Adaptive Attention (LAA) block composed of Adaptive Filters and Attention Blocks. This design enables adaptive sharpening of contours, degradation removal, and efficient global information capture. |
LIR achieves comparable performance to state-of-the-art models on Rain100L for deraining while using fewer parameters and computations.
LIR surpasses MLP and Transformer-based methods on SOTS outdoor for dehazing, demonstrating strong performance with fewer computations.
LIR shows comparable performance to state-of-the-art on denoising (CBSD68, Urban100) and deblurring (GoPro, HIDE) tasks with a lightweight design. |
LIR's performance on deblurring, while strong, is less prominent compared to deraining and denoising, possibly due to limitations in handling dynamic contours of fast-moving objects.
Future work could explore enhancements to the Adaptive Filter to better address the challenges posed by dynamic contours in deblurring tasks. |
image restoration, lightweight, attention, cnn, adaptive filter |
2402.01355
Report |
FindingEmo: An Image Dataset for Emotion Recognition in the Wild |
Laurent Mertens, Elahe' Yargholi, Hans Op de Beeck, Jan Van den Stock, Joost Vennekens |
We introduce FindingEmo, a new image dataset containing annotations for 25k
images, specifically tailored to Emotion Recognition. Contrary to existing
datasets, it focuses on complex scenes depicting multiple people in various
naturalistic, social settings, with images being annotated as a whole, thereby
going beyond the traditional focus on faces or single individuals. Annotated
dimensions include Valence, Arousal and Emotion label, with annotations
gathered using Prolific. Together with the annotations, we release the list of
URLs pointing to the original images, as well as all associated source code. |
Introduces FindingEmo, a new image dataset for Emotion Recognition in the Wild, focusing on complex scenes with multiple people in naturalistic social settings, annotated for Valence, Arousal, and Emotion label. |
Addresses the lack of datasets for emotion recognition beyond facial expressions, emphasizing the importance of context and social dynamics in understanding emotions. |
Collected 25k images from the internet using a custom scraper and filtering process, followed by annotation using Plutchik's Wheel of Emotions via Prolific platform. |
Annotations show expected correlations between Valence, Arousal, and Emotion labels.
Baseline models trained on ImageNet outperform Places365-trained models, suggesting natural object features are more salient for emotion recognition.
Late fusion with facial emotion recognition features significantly improves performance, highlighting the importance of facial expressions in complex social scenes. |
Dataset exhibits imbalance in emotion label distribution, reflecting real-world prevalence but potentially impacting model training.
Limited annotator diversity (primarily young adults) may introduce bias in annotations. |
computer vision, dataset, emotion recognition, affective computing, social cognition |
2402.01345
Report |
Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models |
Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, Mike Zheng Shou |
Recent advancements in large vision-language models (LVLMs) have demonstrated
impressive capability in visual information understanding with human language.
Despite these advances, LVLMs still face challenges with multimodal
hallucination, such as generating text descriptions of objects that are not
present in the visual information. However, the underlying fundamental reasons
of multimodal hallucinations remain poorly explored. In this paper, we propose
a new perspective, suggesting that the inherent biases in LVLMs might be a key
factor in hallucinations. Specifically, we systematically identify a semantic
shift bias related to paragraph breaks (\n\n), where the content before and
after '\n\n' in the training data frequently exhibit significant semantic
changes. This pattern leads the model to infer that the contents following
'\n\n' should be obviously different from the preceding contents with less
hallucinatory descriptions, thereby increasing the probability of hallucinatory
descriptions subsequent to the '\n\n'. We have validated this hypothesis on
multiple publicly available LVLMs. Besides, we find that deliberately inserting
'\n\n' at the generated description can induce more hallucinations. A simple
method is proposed to effectively mitigate the hallucination of LVLMs by
skipping the output of '\n'. |
This paper identifies a semantic shift bias in LVLMs triggered by paragraph breaks ('\n\n') that can induce hallucinations. |
Hallucinations in LVLMs, where models generate descriptions of objects not present in the visual input, limit their deployment in safety-critical applications. |
The authors analyze the impact of paragraph breaks on hallucination severity in six LVLMs using the CHAIR evaluation framework and propose two mitigation methods: modifying prompts (MiHI) and adjusting decoding strategies (MiHO) to avoid '\n\n'. |
Descriptions generated after '\n\n' exhibit significantly more hallucinations.
Manually inserting '\n\n' in generated descriptions increases hallucination probability.
Both MiHI and MiHO effectively reduce hallucinations across most LVLMs, especially with greedy decoding. |
MiHI effectiveness depends on the LVLMs' instruction fine-tuning, showing less improvement in models like Fuyu-8B.
The influence of model scale on the '\n\n'-induced hallucination problem requires further investigation. |
multimodal hallucination, large vision-language models, semantic shift bias, hallucination mitigation, paragraph breaks |
2402.01239
Report |
PRIME: Protect Your Videos From Malicious Editing |
Guanlin Li, Shuai Yang, Jie Zhang, Tianwei Zhang |
With the development of generative models, the quality of generated content
keeps increasing. Recently, open-source models have made it surprisingly easy
to manipulate and edit photos and videos, with just a few simple prompts. While
these cutting-edge technologies have gained popularity, they have also given
rise to concerns regarding the privacy and portrait rights of individuals.
Malicious users can exploit these tools for deceptive or illegal purposes.
Although some previous works focus on protecting photos against generative
models, we find there are still gaps between protecting videos and images in
the aspects of efficiency and effectiveness. Therefore, we introduce our
protection method, PRIME, to significantly reduce the time cost and improve the
protection performance. Moreover, to evaluate our proposed protection method,
we consider both objective metrics and human subjective metrics. Our evaluation
results indicate that PRIME only costs 8.3% GPU hours of the cost of the
previous state-of-the-art method and achieves better protection results on both
human evaluation and objective metrics. Code can be found in
https://github.com/GuanlinLee/prime. |
This paper introduces PRIME, a novel black-box video protection method designed to safeguard videos against malicious editing techniques that exploit Latent Diffusion Models (LDMs). |
The rise of advanced video editing tools powered by LDMs poses a significant threat to individuals' privacy and portrait rights, enabling malicious actors to create and spread harmful content. |
PRIME leverages the transferability of adversarial perturbations, incorporating them into every frame of the video while employing mechanisms like 'fast convergence searching' and 'early stage stopping' to reduce computation time. Additionally, it utilizes an 'anti-dynamic compression' method to maintain perturbation effectiveness even after video compression. |
PRIME significantly reduces the time needed for video protection, requiring only 8.3% of the time taken by the baseline method Photoguard.
It effectively disrupts malicious editing attempts, resulting in lower quality edited videos with reduced prompt matching as per human evaluations.
PRIME demonstrates better transferability across different LDM models and editing pipelines compared to previous methods. |
The evaluation of malicious video editing is limited by the absence of a standardized benchmark dataset and the reliance on subjective human evaluation.
Future work could explore the development of a robust, publicly available dataset for malicious video editing and protection research. |
video protection, malicious editing, latent diffusion models, adversarial perturbations, privacy protection |
2402.01217
Report |
Taming Uncertainty in Sparse-view Generalizable NeRF via Indirect Diffusion Guidance |
Yaokun Li, Chao Gou, Guang Tan |
Neural Radiance Fields (NeRF) have demonstrated effectiveness in synthesizing
novel views. However, their reliance on dense inputs and scene-specific
optimization has limited their broader applicability. Generalizable NeRFs
(Gen-NeRF), while intended to address this, often produce blurring artifacts in
unobserved regions with sparse inputs, which are full of uncertainty. In this
paper, we aim to diminish the uncertainty in Gen-NeRF for plausible renderings.
We assume that NeRF's inability to effectively mitigate this uncertainty stems
from its inherent lack of generative capacity. Therefore, we innovatively
propose an Indirect Diffusion-guided NeRF framework, termed ID-NeRF, to address
this uncertainty from a generative perspective by leveraging a distilled
diffusion prior as guidance. Specifically, to avoid model confusion caused by
directly regularizing with inconsistent samplings as in previous methods, our
approach introduces a strategy to indirectly inject the inherently missing
imagination into the learned implicit function through a diffusion-guided
latent space. Empirical evaluation across various benchmarks demonstrates the
superior performance of our approach in handling uncertainty with sparse
inputs. |
Presents ID-NeRF, a novel Gen-NeRF framework that addresses uncertainty in unobserved regions of sparse-view scenarios through indirect guidance from a pre-trained diffusion model. |
Existing Gen-NeRFs struggle with blurry artifacts in unobserved regions due to a lack of generative capacity to handle the uncertainty associated with sparse inputs. |
ID-NeRF leverages score-based distillation to inject generative knowledge into a latent space, which then guides the refinement of reprojected visual features extracted from sparse views. This indirect guidance avoids model confusion caused by inconsistent direct supervision. |
ID-NeRF outperforms SOTA Gen-NeRFs on DTU, Blender, and RFF datasets, especially in challenging 2-input view settings.
The method demonstrates superior performance in handling uncertainty, achieving better results as input sparsity increases.
Ablation studies confirm the effectiveness of the latent space, attention-based guidance, and indirect supervision strategy. |
There's room for improvement in image fidelity, particularly in terms of SSIM and LPIPS metrics.
Future work could explore faster inference and reduced model size for practical deployment. |
generative neural radiance fields, sparse-view reconstruction, uncertainty mitigation, indirect diffusion guidance, score-based distillation |
2402.01162
Report |
2AFC Prompting of Large Multimodal Models for Image Quality Assessment |
Hanwei Zhu, Xiangjie Sui, Baoliang Chen, Xuelin Liu, Peilin Chen, Yuming Fang, Shiqi Wang |
While abundant research has been conducted on improving high-level visual
understanding and reasoning capabilities of large multimodal models~(LMMs),
their visual quality assessment~(IQA) ability has been relatively
under-explored. Here we take initial steps towards this goal by employing the
two-alternative forced choice~(2AFC) prompting, as 2AFC is widely regarded as
the most reliable way of collecting human opinions of visual quality.
Subsequently, the global quality score of each image estimated by a particular
LMM can be efficiently aggregated using the maximum a posterior estimation.
Meanwhile, we introduce three evaluation criteria: consistency, accuracy, and
correlation, to provide comprehensive quantifications and deeper insights into
the IQA capability of five LMMs. Extensive experiments show that existing LMMs
exhibit remarkable IQA ability on coarse-grained quality comparison, but there
is room for improvement on fine-grained quality discrimination. The proposed
dataset sheds light on the future development of IQA models based on LMMs. The
codes will be made publicly available at https://github.com/h4nwei/2AFC-LMMs. |
This paper proposes a framework to evaluate the Image Quality Assessment (IQA) capability of Large Multimodal Models (LMMs) using a 2AFC prompting approach and three evaluation metrics. |
This is important because while LMMs' high-level visual understanding has been studied, their low-level visual processing abilities, such as IQA, remain largely unexplored. |
The study uses coarse-to-fine pairing rules for image comparison and employs maximum a posterior (MAP) estimation to aggregate pairwise preferences into global quality rankings. Three evaluation criteria: consistency, accuracy, and correlation are introduced to quantify LMMs' IQA performance. |
Open-source LMMs show poor consistency and potential biases in IQA tasks.
GPT-4V exhibits superior IQA ability compared to other LMMs, especially on realistically distorted images.
Existing LMMs, including GPT-4V, struggle with fine-grained IQA, indicating areas for improvement. |
Limited number of open-source LMMs are evaluated due to the requirement of accepting multiple images as input.
Future work includes extending the evaluation to more LMMs and exploring advanced prompting techniques to further enhance their IQA capabilities. |
large multimodal models, image quality assessment, two-alternative forced choice, map estimation, fine-grained iqa |
2402.01123
Report |
A Single Simple Patch is All You Need for AI-generated Image Detection |
Jiaxuan Chen, Jieteng Yao, Li Niu |
The recent development of generative models unleashes the potential of
generating hyper-realistic fake images. To prevent the malicious usage of fake
images, AI-generated image detection aims to distinguish fake images from real
images. However, existing method suffer from severe performance drop when
detecting images generated by unseen generators. We find that generative models
tend to focus on generating the patches with rich textures to make the images
more realistic while neglecting the hidden noise caused by camera capture
present in simple patches. In this paper, we propose to exploit the noise
pattern of a single simple patch to identify fake images. Furthermore, due to
the performance decline when handling low-quality generated images, we
introduce an enhancement module and a perception module to remove the
interfering information. Extensive experiments demonstrate that our method can
achieve state-of-the-art performance on public benchmarks. |
This paper proposes a novel AI-generated image detection method called Single Simple Patch (SSP) network that leverages noise patterns in simple image patches to distinguish between real and fake images. |
AI-generated image detection is crucial to prevent the malicious use of hyper-realistic fake images, but existing methods struggle with generalization across different generators and image quality degradation. |
The method extracts the simplest patch from an image, analyzes its noise fingerprints using SRM filters, and employs a ResNet50 classifier. An enhanced version incorporates an enhancement module and a perception module to mitigate blur and compression artifacts. |
SSP network effectively distinguishes real and fake images by focusing on noise patterns in simple patches, outperforming existing methods on cross-generator settings.
The enhanced SSP network demonstrates robustness to image quality degradation like blur and compression, achieving improved accuracy on low-quality images.
Experimental results on GenImage and ForenSynths datasets show superior performance compared to state-of-the-art methods, highlighting the effectiveness of the proposed approach. |
The method's performance may be limited when dealing with extremely low-quality images, particularly those with compression quality lower than 90.
Future work could explore integrating information from simple patches with other aspects of the original image to further enhance robustness. |
ai-generated image detection, generative models, noise pattern analysis, simple patch, image forensics |
2402.00909
Report |
Generalizing GradCAM for Embedding Networks |
Mudit Bachhawat |
Visualizing CNN is an important part in building trust and explaining model's
prediction. Methods like CAM and GradCAM have been really successful in
localizing area of the image responsible for the output but are only limited to
classification models. In this paper, we present a new method EmbeddingCAM,
which generalizes the Grad-CAM for embedding networks. We show that for
classification networks, EmbeddingCAM reduces to GradCAM. We show the
effectiveness of our method on CUB-200-2011 dataset and also present
quantitative and qualitative analysis on the dataset. |
This paper presents EmbeddingCAM, a novel method for generating GradCAM-style heatmaps to explain predictions from any visual embedding network. |
Visualizing and explaining predictions of embedding networks, increasingly used in applications like open-set classification, is crucial for building trust and understanding model behavior. |
EmbeddingCAM uses class proxies as substitutes for class labels and defines a custom loss based on the agreement between the model output and the proxy. This loss is then backpropagated to generate a heatmap, similar to GradCAM for classification networks. |
EmbeddingCAM successfully generates heatmaps highlighting relevant image regions for embedding networks.
It outperforms or achieves comparable performance to previous methods on the CUB-200-2011 dataset for mean heatmap ratio and weakly supervised localization accuracy.
Unlike prior methods, EmbeddingCAM does not require multiple image sampling or test-time indexing, making it more efficient and generalizable. |
The paper primarily evaluates EmbeddingCAM on the CUB-200-2011 dataset, focusing on fine-grained classification; further exploration on diverse datasets and tasks is needed.
Future work could explore generating heatmaps at the input image scale, potentially revealing finer-grained insights. |
explainable ai, visual embedding networks, gradcam, heatmap visualization, metric learning |
2402.00867
Report |
AToM: Amortized Text-to-Mesh using 2D Diffusion |
Guocheng Qian, Junli Cao, Aliaksandr Siarohin, Yash Kant, Chaoyang Wang, Michael Vasilkovsky, Hsin-Ying Lee, Yuwei Fang, Ivan Skorokhodov, Peiye Zhuang, Igor Gilitschenski, Jian Ren, Bernard Ghanem, Kfir Aberman, Sergey Tulyakov |
We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh
framework optimized across multiple text prompts simultaneously. In contrast to
existing text-to-3D methods that often entail time-consuming per-prompt
optimization and commonly output representations other than polygonal meshes,
AToM directly generates high-quality textured meshes in less than 1 second with
around 10 times reduction in the training cost, and generalizes to unseen
prompts. Our key idea is a novel triplane-based text-to-mesh architecture with
a two-stage amortized optimization strategy that ensures stable training and
enables scalability. Through extensive experiments on various prompt
benchmarks, AToM significantly outperforms state-of-the-art amortized
approaches with over 4 times higher accuracy (in DF415 dataset) and produces
more distinguishable and higher-quality 3D outputs. AToM demonstrates strong
generalizability, offering finegrained 3D assets for unseen interpolated
prompts without further optimization during inference, unlike per-prompt
solutions. |
This paper introduces AToM, the first amortized text-to-mesh model that directly generates textured meshes from text prompts. |
AToM addresses the limitations of existing text-to-3D methods that are either time-consuming per-prompt optimizations or limited to representations other than polygonal meshes. |
AToM employs a novel triplane-based text-to-mesh architecture and a two-stage amortized optimization strategy. It first trains with low-resolution volumetric rendering and then refines with high-resolution mesh rasterization. |
AToM generates high-quality textured meshes in under one second from a text prompt.
AToM generalizes to unseen text prompts without requiring further optimization, unlike per-prompt methods.
AToM outperforms state-of-the-art amortized approaches like ATT3D with significantly higher accuracy and quality, especially in large-scale datasets. |
The quality of AToM is currently limited by the resolution of the text-to-image diffusion prior used.
The DMTet mesh representation used in AToM cannot model surfaces with nonzero genus. |
text-to-mesh, amortized optimization, 3d generation, generative ai, diffusion models |
2402.00864
Report |
ViCA-NeRF: View-Consistency-Aware 3D Editing of Neural Radiance Fields |
Jiahua Dong, Yu-Xiong Wang |
We introduce ViCA-NeRF, the first view-consistency-aware method for 3D
editing with text instructions. In addition to the implicit neural radiance
field (NeRF) modeling, our key insight is to exploit two sources of
regularization that explicitly propagate the editing information across
different views, thus ensuring multi-view consistency. For geometric
regularization, we leverage the depth information derived from NeRF to
establish image correspondences between different views. For learned
regularization, we align the latent codes in the 2D diffusion model between
edited and unedited images, enabling us to edit key views and propagate the
update throughout the entire scene. Incorporating these two strategies, our
ViCA-NeRF operates in two stages. In the initial stage, we blend edits from
different views to create a preliminary 3D edit. This is followed by a second
stage of NeRF training, dedicated to further refining the scene's appearance.
Experimental results demonstrate that ViCA-NeRF provides more flexible,
efficient (3 times faster) editing with higher levels of consistency and
details, compared with the state of the art. Our code is publicly available. |
Introduces ViCA-NeRF, the first view-consistency-aware method for editing 3D scenes via text instructions, improving upon existing methods by enhancing multi-view consistency and editing efficiency. |
Addresses the limitations of existing NeRF editing methods which lack explicit 3D structure and are computationally expensive, leading to inconsistencies and inefficiencies in editing. |
Leverages two sources of regularization: geometric regularization through depth-guided image correspondence for preliminary edits and learned regularization via a blending refinement model (modified Instruct-Pix2Pix) to align latent codes across views, ensuring consistency. |
Achieves multi-view consistent 3D editing with text instructions across diverse scenes.
Offers controllability by allowing edits in key views to propagate throughout the 3D scene.
Significantly faster than previous methods (3 times faster than Instruct-NeRF2NeRF). |
Effectiveness depends on the accuracy of depth maps generated by NeRF.
Edited outputs may exhibit increased blurriness compared to the original NeRF. |
nerf, 3d editing, text-guided editing, view consistency, diffusion models |
2402.00863
Report |
Geometry Transfer for Stylizing Radiance Fields |
Hyunyoung Jung, Seonghyeon Nam, Nikolaos Sarafianos, Sungjoo Yoo, Alexander Sorkine-Hornung, Rakesh Ranjan |
Shape and geometric patterns are essential in defining stylistic identity.
However, current 3D style transfer methods predominantly focus on transferring
colors and textures, often overlooking geometric aspects. In this paper, we
introduce Geometry Transfer, a novel method that leverages geometric
deformation for 3D style transfer. This technique employs depth maps to extract
a style guide, subsequently applied to stylize the geometry of radiance fields.
Moreover, we propose new techniques that utilize geometric cues from the 3D
scene, thereby enhancing aesthetic expressiveness and more accurately
reflecting intended styles. Our extensive experiments show that Geometry
Transfer enables a broader and more expressive range of stylizations, thereby
significantly expanding the scope of 3D style transfer. |
Introduces "Geometry Transfer," a novel method that leverages geometric deformation for 3D style transfer using depth maps to extract style guides, thereby stylizing the geometry of radiance fields. |
Existing 3D style transfer techniques primarily focus on transferring color and texture while neglecting geometry, which is crucial for defining stylistic identity. |
Utilizes depth maps as style guides, introduces a deformation network for synchronized shape and appearance modification, and proposes RGB-D stylization techniques like geometry-aware matching and perspective style augmentation. |
Enables coherent stylization of both shape and appearance in 3D scenes.
Demonstrates superior performance compared to existing 3D style transfer methods in quantitative metrics and user studies.
Seamlessly integrates with Panoptic Lifting for partial stylization of 3D scenes. |
Limited to non-360° scenes due to the reliance on TensoRF representation.
Stylizing 360° scenes with a single style image is ill-posed; exploring multi-view or 3D style guides could be beneficial. |
3d style transfer, geometry stylization, radiance fields, depth maps, deformation fields |
2402.00769
Report |
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning |
Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi, Weikang Bian, Guanglu Song, Yu Liu, Hongsheng Li |
Video diffusion models has been gaining increasing attention for its ability
to produce videos that are both coherent and of high fidelity. However, the
iterative denoising process makes it computationally intensive and
time-consuming, thus limiting its applications. Inspired by the Consistency
Model (CM) that distills pretrained image diffusion models to accelerate the
sampling with minimal steps and its successful extension Latent Consistency
Model (LCM) on conditional image generation, we propose AnimateLCM, allowing
for high-fidelity video generation within minimal steps. Instead of directly
conducting consistency learning on the raw video dataset, we propose a
decoupled consistency learning strategy that decouples the distillation of
image generation priors and motion generation priors, which improves the
training efficiency and enhance the generation visual quality. Additionally, to
enable the combination of plug-and-play adapters in stable diffusion community
to achieve various functions (e.g., ControlNet for controllable generation). we
propose an efficient strategy to adapt existing adapters to our distilled
text-conditioned video consistency model or train adapters from scratch without
harming the sampling speed. We validate the proposed strategy in
image-conditioned video generation and layout-conditioned video generation, all
achieving top-performing results. Experimental results validate the
effectiveness of our proposed method. Code and weights will be made public.
More details are available at https://github.com/G-U-N/AnimateLCM. |
Presents AnimateLCM, a novel approach for fast and high-fidelity video generation within minimal steps by adapting Stable Diffusion-based video models to follow the self-consistency property. |
Addresses the computational intensity and slow generation of video diffusion models, aiming for high-quality video generation with significantly reduced steps. |
Employs a decoupled consistency learning strategy, separating image generation and motion priors distillation, and a teacher-free adaptation strategy for integrating or training adapters without sacrificing speed. |
Achieves state-of-the-art results on UCF-101 in terms of FVD and CLIPSIM metrics, outperforming baseline diffusion models, especially in low step regimes.
Demonstrates good compatibility with personalized image diffusion models, allowing diverse and high-quality video generation in various styles.
Enables fast and high-quality image-to-video and controllable video generation with minimal steps through the teacher-free adaptation strategy. |
Limited performance for one-step sample generation, potentially resulting in blurry or artifact-ridden outputs.
Future work includes exploring more sophisticated ODE solvers and alternative score estimation strategies for improved one-step generation quality. |
video generation, diffusion models, consistency models, stable diffusion, video acceleration |
2402.00752
Report |
On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy |
Letian Huang, Jiayang Bai, Jie Guo, Yuanqi Li, Yanwen Guo |
3D Gaussian Splatting has garnered extensive attention and application in
real-time neural rendering. Concurrently, concerns have been raised about the
limitations of this technology in aspects such as point cloud storage,
performance, and robustness in sparse viewpoints, leading to various
improvements. However, there has been a notable lack of attention to the
fundamental problem of projection errors introduced by the local affine
approximation inherent in the splatting itself, and the consequential impact of
these errors on the quality of photo-realistic rendering. This paper addresses
the projection error function of 3D Gaussian Splatting, commencing with the
residual error from the first-order Taylor expansion of the projection
function. The analysis establishes a correlation between the error and the
Gaussian mean position. Subsequently, leveraging function optimization theory,
this paper analyzes the function's minima to provide an optimal projection
strategy for Gaussian Splatting referred to Optimal Gaussian Splatting, which
can accommodate a variety of camera models. Experimental validation further
confirms that this projection methodology reduces artifacts, resulting in a
more convincingly realistic rendering. |
This paper presents Optimal Gaussian Splatting, a novel projection method for 3D Gaussian Splatting (3D-GS) that minimizes projection errors to enhance rendering quality. |
Existing 3D-GS techniques suffer from projection errors due to local affine approximations, leading to artifacts in rendered images, especially with wide-angle lenses. This work addresses this by minimizing these errors to improve rendering realism. |
The authors analyze the error function of the 3D-GS projection, identifying its correlation with Gaussian mean position. They then derive an optimal projection strategy that minimizes this error by projecting each Gaussian onto a tangent plane based on its mean and the camera center. |
Optimal Gaussian Splatting reduces artifacts and enhances rendering quality compared to the original 3D-GS.
The method demonstrates robustness against increasing field of view and decreasing focal length, outperforming 3D-GS in wide-angle settings.
It is easily adaptable to various camera models like fisheye and panorama with simple modifications. |
Training time slightly increases due to the additional transformation from tangent plane to image plane.
Future work could explore optimizing Gaussian covariance's influence on projection and further enhance Gaussian Splatting as a scene representation technique. |
3d gaussian splatting, novel view synthesis, error analysis, optimal projection, real-time rendering |
2402.00631
Report |
Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation |
Yang Li, Songlin Yang, Wei Wang, Jing Dong |
Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable
Diffusion Model, have made significant progress in generating diverse and
high-quality images using text prompts alone. However, when non-famous users
require personalized image generation for their identities (IDs), the T2I
models fail to accurately generate their ID-related images. The main problem is
that pre-trained T2I models do not learn the mapping between the new ID prompts
and their corresponding visual content. The previous methods either failed to
accurately fit the face region or lost the interactive generative ability with
other existing concepts in T2I models. In other words, they are unable to
generate T2I-aligned and semantic-fidelity images for the given prompts with
other concepts such as scenes (``Eiffel Tower''), actions (``holding a
basketball''), and facial attributes (``eyes closed''). In this paper, we focus
on inserting accurate and interactive ID embedding into the Stable Diffusion
Model for semantic-fidelity personalized generation. We address this challenge
from two perspectives: face-wise region fitting and semantic-fidelity token
optimization. Specifically, we first visualize the attention overfit problem
and propose a face-wise attention loss to fit the face region instead of
entangling ID-unrelated information, such as face layout and background. This
key trick significantly enhances the ID accuracy and interactive generative
ability with other existing concepts. Then, we optimize one ID representation
as multiple per-stage tokens where each token contains two disentangled
features. This expansion of the textual conditioning space improves
semantic-fidelity control. Extensive experiments validate that our results
exhibit superior ID accuracy, text-based manipulation ability, and
generalization compared to previous methods. |
This paper introduces a novel method for inserting accurate and interactive identity embeddings into pre-trained Text-to-Image diffusion models for personalized image generation. |
Existing methods for personalized generation often struggle with attention overfitting (embedding ID-unrelated information) and limited semantic fidelity, leading to inaccurate and inflexible image generation. |
The proposed method utilizes a face-wise attention loss to focus on ID-related face regions and neglect background information, and optimizes ID representation as multiple per-stage tokens with disentangled features for enhanced semantic control. |
The method achieves higher accuracy in identity embedding compared to previous approaches.
It exhibits superior interactive generative ability, enabling control over scenes, facial attributes, and actions.
The approach requires minimal training time and introduces fewer parameters compared to some existing techniques. |
The manipulation capacity for diverse and high-fidelity image generation can be further improved.
The study primarily focuses on face embedding and can be extended to encompass a wider range of object categories. |
text-to-image generation, diffusion models, personalized generation, semantic fidelity, attention overfitting |
2402.00627
Report |
CapHuman: Capture Your Moments in Parallel Universes |
Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang |
We concentrate on a novel human-centric image synthesis task, that is, given
only one reference facial photograph, it is expected to generate specific
individual images with diverse head positions, poses, facial expressions, and
illuminations in different contexts. To accomplish this goal, we argue that our
generative model should be capable of the following favorable characteristics:
(1) a strong visual and semantic understanding of our world and human society
for basic object and human image generation. (2) generalizable identity
preservation ability. (3) flexible and fine-grained head control. Recently,
large pre-trained text-to-image diffusion models have shown remarkable results,
serving as a powerful generative foundation. As a basis, we aim to unleash the
above two capabilities of the pre-trained model. In this work, we present a new
framework named CapHuman. We embrace the "encode then learn to align" paradigm,
which enables generalizable identity preservation for new individuals without
cumbersome tuning at inference. CapHuman encodes identity features and then
learns to align them into the latent space. Moreover, we introduce the 3D
facial prior to equip our model with control over the human head in a flexible
and 3D-consistent manner. Extensive qualitative and quantitative analyses
demonstrate our CapHuman can produce well-identity-preserved, photo-realistic,
and high-fidelity portraits with content-rich representations and various head
renditions, superior to established baselines. Code and checkpoint will be
released at https://github.com/VamosC/CapHuman. |
This paper presents CapHuman, a novel framework for human-centric image synthesis that allows for the generation of photorealistic portraits of specific individuals with controllable head positions, poses, facial expressions, and illuminations in different contexts. |
This work addresses the limitations of existing text-to-image models that struggle with identity preservation and fine-grained control over human head features, particularly in one-shot settings. |
CapHuman leverages the pretrained Stable Diffusion model and incorporates two key components: 1) an "encode then learn to align" paradigm for identity preservation using global and local features and 2) a 3D facial prior (FLAME) for flexible and 3D-consistent head control. |
CapHuman generates high-quality, identity-preserved portraits with diverse head renditions, outperforming previous state-of-the-art methods.
The model demonstrates generalizable identity preservation capabilities, eliminating the need for cumbersome fine-tuning for each new individual.
Quantitative and qualitative analysis on the proposed HumanIPHC benchmark confirm the effectiveness of CapHuman in identity preservation, text-to-image alignment, and head control precision. |
The model's generative capabilities are limited by the pre-training dataset and may not generalize well to scenarios outside its distribution.
The accuracy of 3D facial reconstruction relies on the estimation accuracy of DECA, which can be limited for extreme poses and expressions, leading to potential misalignment. |
image synthesis, identity preservation, head control, diffusion models, 3d facial prior |
2402.00626
Report |
Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks |
Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer |
Typographic Attacks, which involve pasting misleading text onto an image,
were noted to harm the performance of Vision-Language Models like CLIP.
However, the susceptibility of recent Large Vision-Language Models to these
attacks remains understudied. Furthermore, prior work's Typographic attacks
against CLIP randomly sample a misleading class from a predefined set of
categories. However, this simple strategy misses more effective attacks that
exploit LVLM(s) stronger language skills. To address these issues, we first
introduce a benchmark for testing Typographic attacks against LVLM(s).
Moreover, we introduce two novel and more effective \textit{Self-Generated}
attacks which prompt the LVLM to generate an attack against itself: 1) Class
Based Attack where the LVLM (e.g. LLaVA) is asked which deceiving class is most
similar to the target class and 2) Descriptive Attacks where a more advanced
LVLM (e.g. GPT4-V) is asked to recommend a Typographic attack that includes
both a deceiving class and description. Using our benchmark, we uncover that
Self-Generated attacks pose a significant threat, reducing LVLM(s)
classification performance by up to 33\%. We also uncover that attacks
generated by one model (e.g. GPT-4V or LLaVA) are effective against the model
itself and other models like InstructBLIP and MiniGPT4. Code:
\url{https://github.com/mqraitem/Self-Gen-Typo-Attack} |
The paper introduces a new benchmark for evaluating typographic attacks against Large Vision Language Models (LVLMs) and proposes novel self-generated attacks that leverage the LVLMs themselves to devise more effective attacks. |
This work addresses the urgent threat of typographic attacks that can mislead LVLMs by exploiting their reliance on textual cues for image interpretation, especially given the increasing sophistication and accessibility of these models. |
The authors develop a benchmark using five diverse classification datasets and evaluate four recent LVLMs (GPT-4V, LLaVA 1.5, MiniGPT4, and InstructBLIP) against three types of attacks: random class, class-based (using LVLMs to identify similar classes), and descriptive (using LVLMs to generate deceiving descriptions). |
Self-generated attacks, particularly descriptive attacks, significantly reduce LVLMs' classification accuracy (up to 33%).
Descriptive attacks with relevant descriptions are more effective than those with random or no descriptions, highlighting LVLMs' language understanding capabilities.
While prompting LVLMs to ignore text shows some improvement, it doesn't fully mitigate the impact of typographic attacks. |
The study is limited by the computational cost of evaluating GPT-4V, restricting the number of test samples.
Future work should explore defenses against these attacks and investigate their generalization to other LVLM tasks beyond classification. |
typographic attacks, large vision language models, self-generated attacks, benchmarking, vision and language |
2402.00606
Report |
Dynamic Texture Transfer using PatchMatch and Transformers |
Guo Pu, Shiyao Xu, Xixin Cao, Zhouhui Lian |
How to automatically transfer the dynamic texture of a given video to the
target still image is a challenging and ongoing problem. In this paper, we
propose to handle this task via a simple yet effective model that utilizes both
PatchMatch and Transformers. The key idea is to decompose the task of dynamic
texture transfer into two stages, where the start frame of the target video
with the desired dynamic texture is synthesized in the first stage via a
distance map guided texture transfer module based on the PatchMatch algorithm.
Then, in the second stage, the synthesized image is decomposed into
structure-agnostic patches, according to which their corresponding subsequent
patches can be predicted by exploiting the powerful capability of Transformers
equipped with VQ-VAE for processing long discrete sequences. After getting all
those patches, we apply a Gaussian weighted average merging strategy to
smoothly assemble them into each frame of the target stylized video.
Experimental results demonstrate the effectiveness and superiority of the
proposed method in dynamic texture transfer compared to the state of the art. |
Proposes DynTexture, a novel neural-based approach to automatically transfer dynamic texture effects from a source video to a target image, enabling one-shot dynamic texture transfer. |
Automates the laborious process of designing dynamic textures, such as those used in films, digital posters, and online media, thereby improving efficiency and enabling more creative applications. |
Uses a two-stage architecture: 1) a distance map guided texture transfer module (based on PatchMatch) to synthesize the initial stylized frame, and 2) a deep sequence forecasting module (based on Transformers and VQ-VAE) to predict and synthesize subsequent stylized frames. |
Achieves superior performance in dynamic text effects transfer with various font styles and glyphs, accurately transferring complex dynamic effects like burning flames and flowing water.
Outperforms state-of-the-art methods in qualitative and quantitative comparisons, demonstrating better texture quality, spatial and temporal consistency, and the ability to handle moving dynamic effects.
Demonstrates versatility in other applications like image animation, modifying image layouts, and animating them according to driving videos. |
The choice of patch size is crucial and requires careful consideration for optimal performance.
Quantitative evaluation of one-shot learning tasks remains challenging due to the lack of ground truth data. |
texture transfer, video synthesis, image generation, patchmatch, transformers |
2402.00525
Report |
StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time Rendering |
Lukas Radl, Michael Steiner, Mathias Parger, Alexander Weinrauch, Bernhard Kerbl, Markus Steinberger |
Gaussian Splatting has emerged as a prominent model for constructing 3D
representations from images across diverse domains. However, the efficiency of
the 3D Gaussian Splatting rendering pipeline relies on several simplifications.
Notably, reducing Gaussian to 2D splats with a single view-space depth
introduces popping and blending artifacts during view rotation. Addressing this
issue requires accurate per-pixel depth computation, yet a full per-pixel sort
proves excessively costly compared to a global sort operation. In this paper,
we present a novel hierarchical rasterization approach that systematically
resorts and culls splats with minimal processing overhead. Our software
rasterizer effectively eliminates popping artifacts and view inconsistencies,
as demonstrated through both quantitative and qualitative measurements.
Simultaneously, our method mitigates the potential for cheating view-dependent
effects with popping, ensuring a more authentic representation. Despite the
elimination of cheating, our approach achieves comparable quantitative results
for test images, while increasing the consistency for novel view synthesis in
motion. Due to its design, our hierarchical approach is only 4% slower on
average than the original Gaussian Splatting. Notably, enforcing consistency
enables a reduction in the number of Gaussians by approximately half with
nearly identical quality and view-consistency. Consequently, rendering
performance is nearly doubled, making our approach 1.6x faster than the
original Gaussian Splatting, with a 50% reduction in memory requirements. |
This paper presents a novel hierarchical rasterization approach for 3D Gaussian Splatting that addresses popping artifacts by performing per-pixel sorting of splats, while maintaining real-time performance. |
Popping artifacts are a common issue in Gaussian Splatting, particularly noticeable during camera rotations, which detracts from the realism and quality of rendered scenes. |
The proposed method utilizes a hierarchical rendering pipeline that exploits coherence among neighboring view rays on multiple hierarchy levels, interleaving culling, depth evaluation, and resorting operations. |
The hierarchical renderer effectively eliminates popping artifacts and view inconsistencies.
It achieves comparable quantitative image quality metrics to the original Gaussian Splatting method.
The hierarchical approach adds an overhead of only 4% compared to the original Gaussian Splatting, while enabling a 2x reduction in memory and 1.6x faster rendering by using Opacity Decay during training. |
The hierarchical resorting may not guarantee perfect blend order in all cases, potentially leading to residual artifacts.
The method still approximates true 3D Gaussian rendering, ignoring potential overlaps between Gaussians along a view ray.
Future work could investigate fully correct volume rendering of Gaussians for further quality improvements. |
gaussian splatting, neural rendering, view consistency, hierarchical rasterization, real-time rendering |
2402.00351
Report |
Machine Unlearning for Image-to-Image Generative Models |
Guihong Li, Hsiang Hsu, Chun-Fu Chen, Radu Marculescu |
Machine unlearning has emerged as a new paradigm to deliberately forget data
samples from a given model in order to adhere to stringent regulations.
However, existing machine unlearning methods have been primarily focused on
classification models, leaving the landscape of unlearning for generative
models relatively unexplored. This paper serves as a bridge, addressing the gap
by providing a unifying framework of machine unlearning for image-to-image
generative models. Within this framework, we propose a
computationally-efficient algorithm, underpinned by rigorous theoretical
analysis, that demonstrates negligible performance degradation on the retain
samples, while effectively removing the information from the forget samples.
Empirical studies on two large-scale datasets, ImageNet-1K and Places-365,
further show that our algorithm does not rely on the availability of the retain
samples, which further complies with data retention policy. To our best
knowledge, this work is the first that represents systemic, theoretical,
empirical explorations of machine unlearning specifically tailored for
image-to-image generative models. Our code is available at
https://github.com/jpmorganchase/l2l-generator-unlearning. |
This paper presents the first systematic exploration of machine unlearning for image-to-image (I2I) generative models, proposing a novel framework and an efficient algorithm. |
Machine unlearning for generative models is crucial due to their superior data memorization capability and the increasing demand for data privacy and copyright protection. |
The authors formulate the problem as maximizing the KL-divergence between the distributions of generated images from the forget set and a Gaussian distribution. They then derive a tractable lower bound based on mutual information and minimize the L2 distance between encoder outputs. |
The proposed approach effectively eliminates information from the forget set while maintaining near-identical performance on the retain set, as demonstrated on ImageNet-1K and Places-365.
The framework is generally applicable to various I2I models, including diffusion models, VQ-GAN, and MAE.
The method exhibits robustness to limited or unavailable retain samples, offering flexibility in practical applications. |
The method is primarily evaluated on I2I generative models and requires access to original forget samples.
Future work includes extending the approach to other modalities (text, text-to-image) and exploring practical scenarios for content control and privacy protection. |
machine unlearning, generative models, image-to-image, data privacy, copyright protection |
2402.00240
Report |
Spectral Norm of Convolutional Layers with Circular and Zero Paddings |
Blaise Delattre, Quentin Barthélemy, Alexandre Allauzen |
This paper leverages the use of \emph{Gram iteration} an efficient,
deterministic, and differentiable method for computing spectral norm with an
upper bound guarantee. Designed for circular convolutional layers, we
generalize the use of the Gram iteration to zero padding convolutional layers
and prove its quadratic convergence. We also provide theorems for bridging the
gap between circular and zero padding convolution's spectral norm. We design a
\emph{spectral rescaling} that can be used as a competitive $1$-Lipschitz layer
that enhances network robustness. Demonstrated through experiments, our method
outperforms state-of-the-art techniques in precision, computational cost, and
scalability. The code of experiments is available at
https://github.com/blaisedelattre/lip4conv. |
This paper proposes Gram iteration, a novel method for efficiently computing the spectral norm of convolutional layers with circular and zero padding, ensuring an upper bound guarantee and outperforming state-of-the-art techniques. |
Spectral norm regularization in CNNs is crucial for enhancing generalization, stabilizing training, and bolstering robustness against adversarial attacks. |
The authors leverage Gelfand's formula to generalize Gram iteration for any matrix norm, proving its quadratic convergence. They extend its application to zero padding convolutions and establish theoretical bounds bridging circular and zero padding spectral norms, as well as linking input size to the bound. |
Gram iteration provides a guaranteed upper bound on the spectral norm of convolutional layers, unlike power iteration methods.
The proposed method achieves superior accuracy in spectral norm estimation compared to existing techniques while maintaining computational efficiency.
Spectral Rescaling (SR), a novel 1-Lipschitz layer derived from Gram iteration, demonstrably enhances the robustness of CNNs against adversarial attacks. |
Future work includes exploring the adaptability of Gram iteration for computing multiple singular values.
Further investigation into the trade-off between the tightness of the spectral norm bound and computational cost is warranted. |
convolutional neural networks, spectral norm, adversarial robustness, gram iteration, lipschitz layers |
2402.00225
Report |
Geometry aware 3D generation from in-the-wild images in ImageNet |
Qijia Shen, Guangrun Wang |
Generating accurate 3D models is a challenging problem that traditionally
requires explicit learning from 3D datasets using supervised learning. Although
recent advances have shown promise in learning 3D models from 2D images, these
methods often rely on well-structured datasets with multi-view images of each
instance or camera pose information. Furthermore, these datasets usually
contain clean backgrounds with simple shapes, making them expensive to acquire
and hard to generalize, which limits the applicability of these methods. To
overcome these limitations, we propose a method for reconstructing 3D geometry
from the diverse and unstructured Imagenet dataset without camera pose
information. We use an efficient triplane representation to learn 3D models
from 2D images and modify the architecture of the generator backbone based on
StyleGAN2 to adapt to the highly diverse dataset. To prevent mode collapse and
improve the training stability on diverse data, we propose to use multi-view
discrimination. The trained generator can produce class-conditional 3D models
as well as renderings from arbitrary viewpoints. The class-conditional
generation results demonstrate significant improvement over the current
state-of-the-art method. Additionally, using PTI, we can efficiently
reconstruct the whole 3D geometry from single-view images. |
This paper introduces a novel method for generating 3D models from diverse and unstructured 2D image datasets, like ImageNet, without relying on camera pose information. |
This approach is significant because it allows for learning 3D representations from widely available 2D data, overcoming limitations of previous methods reliant on structured datasets with multi-view or camera pose data. |
The method leverages a triplane representation for 3D modeling and modifies the StyleGAN2 generator architecture for improved learning from diverse datasets. Additionally, it employs a multi-view discrimination technique to enhance training stability and prevent mode collapse. |
The model successfully generates class-conditional 3D models and renderings from arbitrary viewpoints, demonstrating significant improvement over existing methods.
Training on diverse datasets enables the model to infer plausible 3D shapes even for objects with limited viewpoints in the training data.
The method allows for efficient single-view 3D reconstruction using pivotal tuning inversion. |
The issue of unknown camera poses for in-the-wild images is not fully addressed, posing a limitation to be explored in future work.
Future work could investigate incorporating depth information during training as additional supervision for geometry and explore alternative architectures beyond StyleGAN2. |
3d generation, imagenet, triplane representation, multi-view discrimination, single-view reconstruction |
2402.00033
Report |
LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition |
Youbing Hu, Yun Cheng, Anqi Lu, Zhiqiang Cao, Dawei Wei, Jie Liu, Zhijun Li |
The Vision Transformer (ViT) excels in accuracy when handling high-resolution
images, yet it confronts the challenge of significant spatial redundancy,
leading to increased computational and memory requirements. To address this, we
present the Localization and Focus Vision Transformer (LF-ViT). This model
operates by strategically curtailing computational demands without impinging on
performance. In the Localization phase, a reduced-resolution image is
processed; if a definitive prediction remains elusive, our pioneering
Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively
identifying and spotlighting class-discriminative regions based on initial
findings. Subsequently, in the Focus phase, this designated region is used from
the original image to enhance recognition. Uniquely, LF-ViT employs consistent
parameters across both phases, ensuring seamless end-to-end optimization. Our
empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs
by 63\% and concurrently amplifies throughput twofold. Code of this project is
at https://github.com/edgeai1/LF-ViT.git. |
This paper presents LF-ViT, a novel two-stage Vision Transformer framework designed to optimize computational efficiency for high-resolution image recognition by minimizing spatial redundancy. |
ViT excels in accuracy but suffers from high computational costs, especially with increasing image resolutions, hindering deployment on resource-limited devices. This paper addresses this by focusing computation on minimal class-discriminative image regions. |
LF-ViT employs a two-stage approach: (1) Localization: a down-sampled image is processed, and if a confident prediction isn't reached, a novel Neighborhood Global Class Attention (NGCA) mechanism identifies class-discriminative regions. (2) Focus: these regions are processed from the original image for enhanced recognition, employing feature reuse and fusion mechanisms for further optimization. |
LF-ViT significantly reduces Deit-S's FLOPs by 63% while maintaining accuracy, resulting in a 2.03x throughput improvement on an A100 GPU.
The NGCA mechanism effectively identifies class-discriminative regions, outperforming other region selection alternatives.
LF-ViT consistently surpasses state-of-the-art ViT optimization models in both accuracy and computational efficiency, demonstrating a superior balance between performance and resource usage. |
The current implementation of LF-ViT is limited to image classification tasks.
Further research is needed to explore the integration of LF-ViT with token pruning methods for enhanced efficiency. |
vision transformer, adaptive inference, spatial redundancy, class-discriminative regions, computational efficiency |
2401.17992
Report |
Multilinear Operator Networks |
Yixin Cheng, Grigorios G. Chrysos, Markos Georgopoulos, Volkan Cevher |
Despite the remarkable capabilities of deep neural networks in image
recognition, the dependence on activation functions remains a largely
unexplored area and has yet to be eliminated. On the other hand, Polynomial
Networks is a class of models that does not require activation functions, but
have yet to perform on par with modern architectures. In this work, we aim
close this gap and propose MONet, which relies solely on multilinear operators.
The core layer of MONet, called Mu-Layer, captures multiplicative interactions
of the elements of the input token. MONet captures high-degree interactions of
the input elements and we demonstrate the efficacy of our approach on a series
of image recognition and scientific computing benchmarks. The proposed model
outperforms prior polynomial networks and performs on par with modern
architectures. We believe that MONet can inspire further research on models
that use entirely multilinear operations. |
Introduces Multilinear Operator Network (MoNet), a Polynomial Network (PN) based solely on multilinear operations, avoiding activation functions while achieving competitive performance to modern architectures. |
Addresses the limitations of deep neural networks' reliance on activation functions, which hinders their application in privacy-preserving settings like Fully Homomorphic Encryption. |
Proposes a core layer, PolyMLP, utilizing multilinear operations to capture multiplicative interactions within input tokens. The architecture stacks PolyMLPs to capture high-degree interactions, enabling polynomial expansion of input data. |
MoNet significantly outperforms prior PNs on ImageNet, achieving over 10% improvement over the previous state-of-the-art.
Achieves competitive performance with modern architectures like transformers and MLP-based models on ImageNet and other image recognition benchmarks.
Demonstrates strong robustness to image corruptions on ImageNet-C and shows promise in scientific computing by accurately recovering formulas in a polynomial neural ODE solver experiment. |
Theoretical characterization of polynomial expansions achievable with MoNet remains to be explored.
Future work includes further theoretical analysis of the model's inductive bias and exploration of its potential beyond image recognition. |
polynomial networks, activation functions, multilinear operations, image recognition, privacy-preserving machine learning |
2401.17948
Report |
HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction |
Harvie Zhang |
The self-attention mechanism utilizes large implicit weight matrices,
programmed through dot product-based activations with very few trainable
parameters, to enable long sequence modeling. In this paper, we investigate the
possibility of discarding residual learning by employing large implicit kernels
to achieve full context interaction at each layer of the network. To accomplish
it, we introduce coordinate-based implicit MLPs as a slow network to generate
hyper-kernels for another fast convolutional network. To get context-varying
weights for fast dynamic encoding, we propose a
$\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$ operator that connects
hyper-kernels ($\mathcal{W}$) and hidden activations ($\mathcal{Z}$) through
simple elementwise multiplication, followed by convolution of $\mathcal{Z}$
using the context-dependent $\mathcal{W}$. Based on this design, we present a
novel Terminator architecture that integrates hyper-kernels of different sizes
to produce multi-branch hidden representations for enhancing the feature
extraction capability of each layer. Additionally, a bottleneck layer is
employed to compress the concatenated channels, allowing only valuable
information to propagate to the subsequent layers. Notably, our model
incorporates several innovative components and exhibits excellent properties,
such as introducing local feedback error for updating the slow network, stable
zero-mean features, faster training convergence, and fewer model parameters.
Extensive experimental results on pixel-level 1D and 2D image classification
benchmarks demonstrate the superior performance of our architecture. |
This paper introduces Terminator, a novel neural network architecture that eliminates the need for residual learning by employing large implicit convolution kernels generated by coordinate-based implicit MLPs. |
Residual learning, while effective, poses challenges for interpretability and efficient training. Terminator addresses these limitations by enabling full context interaction at each layer through large kernels, enhancing feature extraction capabilities. |
The Terminator architecture leverages a novel Slow-Fast Neural Encoding (SFNE) block. This block uses a slow network (coordinate-based MLP) to generate hyper-kernels, and a fast network that interacts with the context via a proposed HyperZZW operator, which efficiently creates context-dependent weights using elementwise multiplication. |
Terminator achieves state-of-the-art performance on pixel-level 1D and 2D image classification benchmarks, surpassing residual networks and transformers.
The architecture exhibits faster training convergence due to stable zero-mean features.
Terminator requires fewer model parameters compared to other state-of-the-art architectures. |
The paper acknowledges limitations in evaluating Terminator on larger datasets like ImageNet due to computational constraints.
Future work will focus on exploring more effective slow neural loss functions to further improve the accuracy of pixel-level scores. |
residual learning, implicit kernels, slow-fast networks, context-dependent weights, image classification |
2401.17895
Report |
ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields |
Edward Bartrum, Thu Nguyen-Phuoc, Chris Xie, Zhengqin Li, Numair Khan, Armen Avetisyan, Douglas Lanman, Lei Xiao |
We introduce ReplaceAnything3D model (RAM3D), a novel text-guided 3D scene
editing method that enables the replacement of specific objects within a scene.
Given multi-view images of a scene, a text prompt describing the object to
replace, and a text prompt describing the new object, our Erase-and-Replace
approach can effectively swap objects in the scene with newly generated content
while maintaining 3D consistency across multiple viewpoints. We demonstrate the
versatility of ReplaceAnything3D by applying it to various realistic 3D scenes,
showcasing results of modified foreground objects that are well-integrated with
the rest of the scene without affecting its overall integrity. |
Introduces Replace Anything Model (RAM), a text-guided 3D scene editing method using an Erase-and-Replace approach for multi-view consistent object replacement in neural radiance fields. |
Addresses the growing demand for efficient 3D content creation and editing tools, particularly for tasks like object replacement in VR/MR, gaming, and film production. |
Employs a two-stage Erase-and-Replace approach: 1) Erases target objects and inpaints the background using a text-guided 3D inpainting technique and a Bubble-NeRF representation. 2) Replaces the erased object with a new object generated using a text-guided 3D inpainting technique, ensuring seamless blending and multi-view consistency. |
Achieves high-quality object replacement in various 3D scenes, including forward-facing and 360° scenes.
Demonstrates superior performance compared to existing methods like Instruct-NeRF2NeRF and Blended-NeRF, particularly in preserving scene structure and generating realistic object details.
Extends beyond object replacement to enable object removal and addition with realistic lighting and multi-view consistency. |
May remove important structural information from the original objects due to the Erase-and-Replace approach, making it unsuitable for editing tasks requiring preserving original geometry.
Suffers from artifacts common to text-to-image model distillation techniques, such as the Janus multi-face problem. |
3d scene editing, neural radiance fields, text-guided image inpainting, object replacement, multi-view consistency |
2401.17879
Report |
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error |
Jonas Ricker, Denis Lukovnikov, Asja Fischer |
With recent text-to-image models, anyone can generate deceptively realistic
images with arbitrary contents, fueling the growing threat of visual
disinformation. A key enabler for generating high-resolution images with low
computational cost has been the development of latent diffusion models (LDMs).
In contrast to conventional diffusion models, LDMs perform the denoising
process in the low-dimensional latent space of a pre-trained autoencoder (AE)
instead of the high-dimensional image space. Despite their relevance, the
forensic analysis of LDMs is still in its infancy. In this work we propose
AEROBLADE, a novel detection method which exploits an inherent component of
LDMs: the AE used to transform images between image and latent space. We find
that generated images can be more accurately reconstructed by the AE than real
images, allowing for a simple detection approach based on the reconstruction
error. Most importantly, our method is easy to implement and does not require
any training, yet nearly matches the performance of detectors that rely on
extensive training. We empirically demonstrate that AEROBLADE is effective
against state-of-the-art LDMs, including Stable Diffusion and Midjourney.
Beyond detection, our approach allows for the qualitative analysis of images,
which can be leveraged for identifying inpainted regions. We release our code
and data at https://github.com/jonasricker/aeroblade . |
AEROBLADE, a training-free method for detecting images generated by Latent Diffusion Models (LDMs) by exploiting the reconstruction error of the model's autoencoder (AE). |
The proliferation of LDMs enables easy creation of hyperrealistic images, posing a significant threat of visual disinformation and necessitating effective detection methods. |
AEROBLADE leverages the observation that LDM AEs reconstruct generated images more accurately than real images. It computes the reconstruction error using LPIPS distance between an image and its reconstruction by the AE. |
AEROBLADE effectively distinguishes real images from images generated by seven state-of-the-art LDMs, achieving a mean Average Precision (AP) of 0.992.
The method doesn't require training, yet performs comparably to extensively trained classifiers.
AEROBLADE provides qualitative information about image regions and their reconstructability, enabling identification of inpainted areas. |
Achieving optimal performance requires access to the AE of the specific LDM used for generation.
Generated images with low complexity (e.g., logos) are more challenging to detect. |
image forensics, disinformation detection, latent diffusion models, autoencoder reconstruction error, generative ai |
2401.17868
Report |
Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model |
Zihan Zhong, Zhiqiang Tang, Tong He, Haoyang Fang, Chun Yuan |
The Segment Anything Model (SAM) stands as a foundational framework for image
segmentation. While it exhibits remarkable zero-shot generalization in typical
scenarios, its advantage diminishes when applied to specialized domains like
medical imagery and remote sensing. To address this limitation, this paper
introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning
approach. By integrating ultra-lightweight convolutional parameters into
Low-Rank Adaptation (LoRA), Conv-LoRA can inject image-related inductive biases
into the plain ViT encoder, further reinforcing SAM's local prior assumption.
Notably, Conv-LoRA not only preserves SAM's extensive segmentation knowledge
but also revives its capacity of learning high-level image semantics, which is
constrained by SAM's foreground-background segmentation pretraining.
Comprehensive experimentation across diverse benchmarks spanning multiple
domains underscores Conv-LoRA's superiority in adapting SAM to real-world
semantic segmentation tasks. |
The paper proposes Conv-LoRA, a novel parameter-efficient fine-tuning (PEFT) approach for adapting the Segment Anything Model (SAM) to downstream semantic segmentation tasks. |
While SAM excels in zero-shot generalization for generic object segmentation, its performance degrades in specialized domains like medical imaging and remote sensing. This work addresses these limitations by improving SAM's ability to capture local image priors and high-level semantic information. |
Conv-LoRA integrates lightweight convolutional layers within the Low-Rank Adaptation (LoRA) framework. It uses a Mixture-of-Experts (MoE) approach to dynamically inject local priors at appropriate feature scales. Additionally, the authors modify SAM's decoder to enable end-to-end multi-class segmentation. |
Conv-LoRA consistently outperforms other PEFT methods across diverse datasets spanning multiple domains (natural images, medical, agriculture, remote sensing).
Analysis reveals that SAM's pretraining, focused on foreground-background separation, hinders its ability to learn high-level semantics crucial for multi-class segmentation. LoRA helps recover this capability.
MoE proves effective in dynamically selecting the proper scale for local prior injection, leading to both performance gains and reduced computational cost compared to a multi-scale approach. |
While demonstrating strong general performance, Conv-LoRA may not consistently surpass domain-specific state-of-the-art models. Further tailoring of the mask decoder and prompt encoder might be needed for specific domains.
Conv-LoRA introduces a slight computational overhead compared to other PEFT methods due to the upscaling/downscaling operations within MoE. Exploring alternative local prior injection methods without explicit scaling could be beneficial. |
semantic segmentation, segment anything model (sam), parameter-efficient fine-tuning (peft), low-rank adaptation (lora), mixture-of-experts (moe) |
2401.17857
Report |
SAGD: Boundary-Enhanced Segment Anything in 3D Gaussian via Gaussian Decomposition |
Xu Hu, Yuxi Wang, Lue Fan, Junsong Fan, Junran Peng, Zhen Lei, Qing Li, Zhaoxiang Zhang |
3D Gaussian Splatting has emerged as an alternative 3D representation for
novel view synthesis, benefiting from its high-quality rendering results and
real-time rendering speed. However, the 3D Gaussians learned by 3D-GS have
ambiguous structures without any geometry constraints. This inherent issue in
3D-GS leads to a rough boundary when segmenting individual objects. To remedy
these problems, we propose SAGD, a conceptually simple yet effective
boundary-enhanced segmentation pipeline for 3D-GS to improve segmentation
accuracy while preserving segmentation speed. Specifically, we introduce a
Gaussian Decomposition scheme, which ingeniously utilizes the special structure
of 3D Gaussian, finds out, and then decomposes the boundary Gaussians.
Moreover, to achieve fast interactive 3D segmentation, we introduce a novel
training-free pipeline by lifting a 2D foundation model to 3D-GS. Extensive
experiments demonstrate that our approach achieves high-quality 3D segmentation
without rough boundary issues, which can be easily applied to other scene
editing tasks. |
This paper proposes SAGD, a training-free pipeline for interactive and effective segmentation of 3D Gaussian Splatting (3D-GS), addressing the rough boundary issue inherent in existing methods. |
Accurate and efficient 3D segmentation in 3D-GS is crucial for scene understanding and editing applications, but existing methods suffer from rough boundaries due to the ambiguous nature of learned Gaussians. |
The method leverages 2D foundation model (SAM) to generate multi-view masks from user prompts and introduces a Gaussian Decomposition scheme to decompose boundary Gaussians, thus refining segmentation boundaries. A voting strategy then determines final 3D segmentation. |
Achieves high-quality 3D segmentation with smoother boundaries compared to previous methods (SA3D, SAGA).
Demonstrates efficiency with significantly less or no training time compared to learning-based approaches.
Shows strong performance on various datasets (SPIn-NeRF, LERF, Mip-NeRF 360) and applicability to scene editing and collision detection tasks. |
Performance degrades with sparse 3D Gaussian distribution, suggesting future work on structured 3D-GS representation.
The confidence score threshold requires manual adjustment depending on scene complexity and view quality. |
3d gaussian splatting, 3d segmentation, boundary enhancement, gaussian decomposition, scene editing |
2401.17807
Report |
Advances in 3D Generation: A Survey |
Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, Ying Shan |
Generating 3D models lies at the core of computer graphics and has been the
focus of decades of research. With the emergence of advanced neural
representations and generative models, the field of 3D content generation is
developing rapidly, enabling the creation of increasingly high-quality and
diverse 3D models. The rapid growth of this field makes it difficult to stay
abreast of all recent developments. In this survey, we aim to introduce the
fundamental methodologies of 3D generation methods and establish a structured
roadmap, encompassing 3D representation, generation methods, datasets, and
corresponding applications. Specifically, we introduce the 3D representations
that serve as the backbone for 3D generation. Furthermore, we provide a
comprehensive overview of the rapidly growing literature on generation methods,
categorized by the type of algorithmic paradigms, including feedforward
generation, optimization-based generation, procedural generation, and
generative novel view synthesis. Lastly, we discuss available datasets,
applications, and open challenges. We hope this survey will help readers
explore this exciting topic and foster further advancements in the field of 3D
content generation. |
This paper presents a comprehensive survey of 3D generation methods, encompassing 3D representations, generation techniques, datasets, and applications. |
3D content generation is crucial for various applications like video games, movies, and immersive experiences, and has seen rapid advancements due to neural representations and generative models. |
The paper categorizes generation methods into four paradigms: feedforward, optimization-based, procedural, and generative novel view synthesis, analyzing each with representative examples. |
The survey provides a structured roadmap of 3D generation methodologies, highlighting advancements in generative models, 3D representations, and algorithmic paradigms.
It discusses commonly used datasets for 3D generation, categorized by 3D data, multi-view images, and single-view images.
The paper explores applications like 3D human, face, and general scene generation, and discusses 3D editing techniques. |
A key limitation is the lack of objective metrics to comprehensively evaluate the quality and diversity of generated 3D models.
The field still needs large-scale, high-quality 3D datasets, and better utilization of existing 2D data for 3D generation. |
3d generation, neural rendering, generative models, scene representations, 3d deep learning |
2401.17629
Report |
Spatial-and-Frequency-aware Restoration method for Images based on Diffusion Models |
Kyungsung Lee, Donggyu Lee, Myungjoo Kang |
Diffusion models have recently emerged as a promising framework for Image
Restoration (IR), owing to their ability to produce high-quality
reconstructions and their compatibility with established methods. Existing
methods for solving noisy inverse problems in IR, considers the pixel-wise
data-fidelity. In this paper, we propose SaFaRI, a spatial-and-frequency-aware
diffusion model for IR with Gaussian noise. Our model encourages images to
preserve data-fidelity in both the spatial and frequency domains, resulting in
enhanced reconstruction quality. We comprehensively evaluate the performance of
our model on a variety of noisy inverse problems, including inpainting,
denoising, and super-resolution. Our thorough evaluation demonstrates that
SaFaRI achieves state-of-the-art performance on both the ImageNet datasets and
FFHQ datasets, outperforming existing zero-shot IR methods in terms of LPIPS
and FID metrics. |
This paper proposes SaFaRI, a novel diffusion model-based image restoration approach that incorporates spatial and frequency information into the data fidelity term for enhanced restoration performance. |
Existing methods for solving noisy inverse problems in image restoration typically rely on pixel-wise data fidelity, which does not fully capture perceptual features important for high-quality image reconstruction. |
SaFaRI modifies the data fidelity term using bicubic upsampling for spatial context and Fourier transformation for frequency domain representation, allowing for a more comprehensive representation of perceptual attributes. The method iteratively refines the generated image by minimizing the weighted sum of spatial and frequency-aware data fidelity terms. |
SaFaRI achieves state-of-the-art performance on ImageNet and FFHQ datasets, outperforming existing zero-shot image restoration methods in terms of LPIPS and FID metrics.
The method effectively restores images across various tasks, including inpainting, denoising, and super-resolution.
The use of either spatial or frequency information alone in SaFaRI is sufficient to outperform existing methods, demonstrating the effectiveness of the proposed approach. |
The transformation applied to the data fidelity term may introduce perturbations to the feasible solutions due to the influence of the prior term.
Future work could involve a comprehensive analysis of these solution perturbations to strengthen the theoretical foundation of the methodology. |
image restoration, diffusion models, data fidelity, perceptual quality, spatial and frequency information |
2401.17509
Report |
Anything in Any Scene: Photorealistic Video Object Insertion |
Chen Bai, Zeman Shao, Guoxiang Zhang, Di Liang, Jie Yang, Zhuorui Zhang, Yujian Guo, Chengzhang Zhong, Yiqiao Qiu, Zhendong Wang, Yichen Guan, Xiaoyin Zheng, Tao Wang, Cheng Lu |
Realistic video simulation has shown significant potential across diverse
applications, from virtual reality to film production. This is particularly
true for scenarios where capturing videos in real-world settings is either
impractical or expensive. Existing approaches in video simulation often fail to
accurately model the lighting environment, represent the object geometry, or
achieve high levels of photorealism. In this paper, we propose Anything in Any
Scene, a novel and generic framework for realistic video simulation that
seamlessly inserts any object into an existing dynamic video with a strong
emphasis on physical realism. Our proposed general framework encompasses three
key processes: 1) integrating a realistic object into a given scene video with
proper placement to ensure geometric realism; 2) estimating the sky and
environmental lighting distribution and simulating realistic shadows to enhance
the light realism; 3) employing a style transfer network that refines the final
video output to maximize photorealism. We experimentally demonstrate that
Anything in Any Scene framework produces simulated videos of great geometric
realism, lighting realism, and photorealism. By significantly mitigating the
challenges associated with video data generation, our framework offers an
efficient and cost-effective solution for acquiring high-quality videos.
Furthermore, its applications extend well beyond video data augmentation,
showing promising potential in virtual reality, video editing, and various
other video-centric applications. Please check our project website
https://anythinginanyscene.github.io for access to our project code and more
high-resolution video results. |
This paper introduces "Anything in Any Scene", a novel framework for realistic video simulation that seamlessly inserts any object into existing dynamic videos with a focus on physical realism. |
Existing video simulation methods often struggle to accurately model lighting, object geometry, and photorealism, limiting their application in fields like autonomous driving and robotics. |
The framework employs a three-step process: 1) object integration into the scene video with proper placement, 2) sky and environmental lighting estimation and realistic shadow simulation, 3) style transfer network refinement for enhanced photorealism. |
The proposed framework generates simulated videos with high geometric, lighting, and photorealism, outperforming other methods.
Human studies and Frechet Inception Distance (FID) scores demonstrate the effectiveness of the framework.
The framework proves valuable for data augmentation in perception algorithms, improving object detection performance on rare classes. |
The placement of objects in constrained indoor scenes can be challenging due to limited space.
Future work includes incorporating improved 3D mesh reconstruction methods and exploring new applications beyond data augmentation. |
video simulation, photorealism, object insertion, lighting estimation, style transfer |
2401.17270
Report |
YOLO-World: Real-Time Open-Vocabulary Object Detection |
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan |
The You Only Look Once (YOLO) series of detectors have established themselves
as efficient and practical tools. However, their reliance on predefined and
trained object categories limits their applicability in open scenarios.
Addressing this limitation, we introduce YOLO-World, an innovative approach
that enhances YOLO with open-vocabulary detection capabilities through
vision-language modeling and pre-training on large-scale datasets.
Specifically, we propose a new Re-parameterizable Vision-Language Path
Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate
the interaction between visual and linguistic information. Our method excels in
detecting a wide range of objects in a zero-shot manner with high efficiency.
On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on
V100, which outperforms many state-of-the-art methods in terms of both accuracy
and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable
performance on several downstream tasks, including object detection and
open-vocabulary instance segmentation. |
Introduces YOLO-World, an efficient open-vocabulary object detector that enhances traditional YOLO with open-vocabulary capabilities via vision-language modeling and large-scale pre-training. |
Addresses the limitation of traditional object detectors, like YOLO, being restricted to a fixed set of object categories. |
Leverages pre-trained CLIP text encoder and introduces a novel Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN). Pre-trains the model on large-scale detection, grounding, and image-text datasets using a region-text contrastive learning scheme. |
Achieves state-of-the-art zero-shot performance on the LVIS dataset with 35.4 AP at 52.0 FPS on a V100 GPU.
Demonstrates strong generalization capabilities, effectively transferring to downstream tasks like open-vocabulary instance segmentation and referring object detection.
Proves the effectiveness of vision-language pre-training for smaller models, allowing for efficient deployment. |
Fine-tuning on limited datasets can degrade the generalization ability gained from pre-training.
Using excessive amounts of pseudo-labeled data for pre-training can negatively impact smaller models. |
open-vocabulary object detection, vision-language pre-training, yolo, region-text contrastive learning, real-time object detection |
2401.17258
Report |
You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation |
Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos |
In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach
for image super-resolution that yields state-of-the-art results using only a
single DDIM step. We propose a novel scale distillation approach to train our
SR model. Instead of directly training our SR model on the scale factor of
interest, we start by training a teacher model on a smaller magnification
scale, thereby making the SR problem simpler for the teacher. We then train a
student model for a higher magnification scale, using the predictions of the
teacher as a target during the training. This process is repeated iteratively
until we reach the target scale factor of the final model. The rationale behind
our scale distillation is that the teacher aids the student diffusion model
training by i) providing a target adapted to the current noise level rather
than using the same target coming from ground truth data for all noise levels
and ii) providing an accurate target as the teacher has a simpler task to
solve. We empirically show that the distilled model significantly outperforms
the model trained for high scales directly, specifically with few steps during
inference. Having a strong diffusion model that requires only one step allows
us to freeze the U-Net and fine-tune the decoder on top of it. We show that the
combination of spatially distilled U-Net and fine-tuned decoder outperforms
state-of-the-art methods requiring 200 steps with only one single step. |
This paper presents YONOS-SR, a novel stable diffusion-based image super-resolution approach that achieves state-of-the-art results using only a single DDIM step. |
Diffusion models are computationally expensive for image super-resolution due to the large number of denoising steps required. YONOS-SR addresses this by enabling high-quality super-resolution with just one step, making it significantly faster. |
The paper introduces 'scale distillation', a novel training strategy where a 'student' model learns from a 'teacher' model trained on a smaller magnification scale. This simplifies the super-resolution task, allowing the student to achieve good results with fewer steps. Additionally, the decoder is fine-tuned on top of the frozen one-step diffusion model to further improve quality. |
YONOS-SR outperforms state-of-the-art diffusion-based SR methods that require 200 steps, using only one step.
Scale distillation significantly improves performance, especially with few steps, by providing a more accurate and noise-adaptive target for training.
Fine-tuning the decoder on top of the frozen one-step diffusion model further enhances results. |
The model's performance with extremely low-resolution images can be further improved.
Exploring the application of scale distillation to other inverse imaging problems, such as image inpainting, is a promising future direction. |
image super-resolution, diffusion models, stable diffusion, scale distillation, fast inference |
2401.17181
Report |
Transfer Learning for Text Diffusion Models |
Kehang Han, Kathleen Kenealy, Aditya Barua, Noah Fiedel, Noah Constant |
In this report, we explore the potential for text diffusion to replace
autoregressive (AR) decoding for the training and deployment of large language
models (LLMs). We are particularly interested to see whether pretrained AR
models can be transformed into text diffusion models through a lightweight
adaptation procedure we call ``AR2Diff''. We begin by establishing a strong
baseline setup for training text diffusion models. Comparing across multiple
architectures and pretraining objectives, we find that training a decoder-only
model with a prefix LM objective is best or near-best across several tasks.
Building on this finding, we test various transfer learning setups for text
diffusion models. On machine translation, we find that text diffusion
underperforms the standard AR approach. However, on code synthesis and
extractive QA, we find diffusion models trained from scratch outperform AR
models in many cases. We also observe quality gains from AR2Diff -- adapting AR
models to use diffusion decoding. These results are promising given that text
diffusion is relatively underexplored and can be significantly faster than AR
decoding for long text generation. |
This paper investigates the potential of adapting pretrained autoregressive language models (LLMs) for non-autoregressive text generation using text diffusion, a method called "AR2Diff". |
This work aims to address the limitations of autoregressive decoding in LLMs, particularly its inefficiency in long text generation, by exploring the feasibility of text diffusion as a faster alternative. |
The authors compare different model architectures, pretraining objectives, and transfer learning strategies for text diffusion. They also introduce AR2Diff, a method to adapt pretrained AR models for diffusion, and evaluate its performance against AR and diffusion baselines on machine translation, question answering, and code synthesis tasks. |
Decoder-only models pretrained with a prefix language modeling objective are found to be most suitable for text diffusion.
Text diffusion models can achieve competitive performance with autoregressive models on code synthesis and question answering tasks, but not on machine translation.
AR2Diff, especially with longer adaptation stages, can further improve the performance of diffusion models, often surpassing pure diffusion baselines and sometimes approaching autoregressive baselines. |
The study primarily focuses on a limited set of tasks and datasets.
Further research is needed to explore the full potential of caching and other optimization techniques to enhance the inference speed of text diffusion. |
text generation, diffusion models, non-autoregressive models, large language models, transfer learning |
2401.17053
Report |
BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation |
Zhennan Wu, Yang Li, Han Yan, Taizhang Shang, Weixuan Sun, Senbo Wang, Ruikai Cui, Weizhe Liu, Hiroyuki Sato, Hongdong Li, Pan Ji |
We present BlockFusion, a diffusion-based model that generates 3D scenes as
unit blocks and seamlessly incorporates new blocks to extend the scene.
BlockFusion is trained using datasets of 3D blocks that are randomly cropped
from complete 3D scene meshes. Through per-block fitting, all training blocks
are converted into the hybrid neural fields: with a tri-plane containing the
geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the
signed distance values. A variational auto-encoder is employed to compress the
tri-planes into the latent tri-plane space, on which the denoising diffusion
process is performed. Diffusion applied to the latent representations allows
for high-quality and diverse 3D scene generation. To expand a scene during
generation, one needs only to append empty blocks to overlap with the current
scene and extrapolate existing latent tri-planes to populate new blocks. The
extrapolation is done by conditioning the generation process with the feature
samples from the overlapping tri-planes during the denoising iterations. Latent
tri-plane extrapolation produces semantically and geometrically meaningful
transitions that harmoniously blend with the existing scene. A 2D layout
conditioning mechanism is used to control the placement and arrangement of
scene elements. Experimental results indicate that BlockFusion is capable of
generating diverse, geometrically consistent and unbounded large 3D scenes with
unprecedented high-quality shapes in both indoor and outdoor scenarios. |
Presents BlockFusion, a novel method for generating expansive 3D scenes using latent tri-plane representation and diffusion models. It extrapolates new latent codes for unseen regions, enabling the generation of out-of-bound content. |
Existing 3D scene generation methods struggle to generate coherent and expansive scenes due to limited capacity and lack of extrapolation capabilities. |
1. Encode 3D scene into latent tri-plane features using a variational autoencoder (VAE). 2. Train a diffusion model on these latent codes. 3. Extrapolate new latent codes for unseen regions by sampling from the diffusion model. 4. Decode the extrapolated latent codes to generate new 3D content. |
Generates coherent and expansive 3D scenes with diverse layouts and styles.
Outperforms existing methods in terms of scene consistency, diversity, and visual quality.
Demonstrates strong generalization ability, enabling the generation of novel content beyond the training set. |
Limited control over the generated content.
Computational cost for large scenes. |
3d scene generation, diffusion model, latent representation, tri-plane, extrapolation |
2401.16861
Report |
Repositioning the Subject within Image |
Yikai Wang, Chenjie Cao, Ke Fan, Qiaole Dong, Yifan Li, Xiangyang Xue, Yanwei Fu |
Current image manipulation primarily centers on static manipulation, such as
replacing specific regions within an image or altering its overall style. In
this paper, we introduce an innovative dynamic manipulation task, subject
repositioning. This task involves relocating a user-specified subject to a
desired position while preserving the image's fidelity. Our research reveals
that the fundamental sub-tasks of subject repositioning, which include filling
the void left by the repositioned subject, reconstructing obscured portions of
the subject and blending the subject to be consistent with surrounding areas,
can be effectively reformulated as a unified, prompt-guided inpainting task.
Consequently, we can employ a single diffusion generative model to address
these sub-tasks using various task prompts learned through our proposed task
inversion technique. Additionally, we integrate pre-processing and
post-processing techniques to further enhance the quality of subject
repositioning. These elements together form our SEgment-gEnerate-and-bLEnd
(SEELE) framework. To assess SEELE's effectiveness in subject repositioning, we
assemble a real-world subject repositioning dataset called ReS. Results of
SEELE on ReS demonstrate its efficacy. |
This paper introduces the novel task of subject repositioning in images and proposes the SEELE framework to address it. |
Subject repositioning enables dynamic object manipulation within images, pushing beyond static editing techniques. |
SEELE employs a single diffusion model guided by learned task prompts (task inversion) to tackle sub-tasks like subject removal, completion, and harmonization. |
SEELE effectively repositions subjects in diverse scenes, outperforming Stable Diffusion variants on the ReS dataset.
Task inversion proves valuable for adapting a single diffusion model to multiple sub-tasks, improving consistency and quality.
SEELE's modular design allows for the incorporation of components like depth estimation and matting, enhancing realism. |
SEELE's performance depends on the accuracy of individual components, requiring manual intervention in case of errors.
Developing models for open-vocabulary amodal mask generation is crucial for improved subject completion with occlusions. |
subject repositioning, image manipulation, diffusion models, task inversion, image inpainting |
2401.16764
Report |
BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion |
Yonghao Yu, Shunan Zhu, Huai Qin, Haorui Li |
Witnessing the evolution of text-to-image diffusion models, significant
strides have been made in text-to-3D generation. Currently, two primary
paradigms dominate the field of text-to-3D: the feed-forward generation
solutions, capable of swiftly producing 3D assets but often yielding coarse
results, and the Score Distillation Sampling (SDS) based solutions, known for
generating high-fidelity 3D assets albeit at a slower pace. The synergistic
integration of these methods holds substantial promise for advancing 3D
generation techniques. In this paper, we present BoostDream, a highly efficient
plug-and-play 3D refining method designed to transform coarse 3D assets into
high-quality. The BoostDream framework comprises three distinct processes: (1)
We introduce 3D model distillation that fits differentiable representations
from the 3D assets obtained through feed-forward generation. (2) A novel
multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion
model to refine the 3D assets. (3) We propose to use prompt and multi-view
consistent normal maps as guidance in refinement.Our extensive experiment is
conducted on different differentiable 3D representations, revealing that
BoostDream excels in generating high-quality 3D assets rapidly, overcoming the
Janus problem compared to conventional SDS-based methods. This breakthrough
signifies a substantial advancement in both the efficiency and quality of 3D
generation processes. |
This paper introduces BoostDream, a highly efficient plug-and-play 3D refining method for transforming coarse 3D assets into high-quality ones by combining advantages of feed-forward and SDS-based methods. |
Current text-to-3D generation methods suffer from either coarse results (feed-forward methods) or slow generation speed (SDS-based methods). BoostDream aims to address this trade-off by enabling efficient generation of high-quality 3D assets. |
BoostDream consists of three stages: (1) 3D model distillation for initializing differentiable 3D representations from coarse assets; (2) Multi-view SDS loss utilizing a multi-view aware 2D diffusion model for refinement; and (3) Refinement guided by prompt and multi-view consistent normal maps. |
BoostDream excels in generating high-quality 3D assets rapidly.
It effectively mitigates the Janus problem encountered in conventional SDS-based methods.
It demonstrates strong generalizability by being applicable to various 3D differentiable representations. |
The current implementation relies on existing 2D diffusion models, inheriting their limitations and biases.
Future work could explore optimizing the multi-view rendering system for improved efficiency and exploring alternative control conditions beyond normal maps. |
3d generation, text-to-3d, diffusion models, differentiable rendering, multi-view synthesis |
2401.16762
Report |
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization |
Henglei Lv, Jiayu Xiao, Liang Li, Qingming Huang |
Diffusion-based text-to-image personalization have achieved great success in
generating subjects specified by users among various contexts. Even though,
existing finetuning-based methods still suffer from model overfitting, which
greatly harms the generative diversity, especially when given subject images
are few. To this end, we propose Pick-and-Draw, a training-free semantic
guidance approach to boost identity consistency and generative diversity for
personalization methods. Our approach consists of two components: appearance
picking guidance and layout drawing guidance. As for the former, we construct
an appearance palette with visual features from the reference image, where we
pick local patterns for generating the specified subject with consistent
identity. As for layout drawing, we outline the subject's contour by referring
to a generative template from the vanilla diffusion model, and inherit the
strong image prior to synthesize diverse contexts according to different text
conditions. The proposed approach can be applied to any personalized diffusion
models and requires as few as a single reference image. Qualitative and
quantitative experiments show that Pick-and-Draw consistently improves identity
consistency and generative diversity, pushing the trade-off between subject
fidelity and image-text fidelity to a new Pareto frontier. |
Proposes Pick-and-Draw, a training-free semantic guidance approach to enhance identity consistency and generative diversity for text-to-image personalization models. |
Existing finetuning-based personalization methods suffer from model overfitting, harming generative diversity, especially with limited subject images. |
Uses appearance picking guidance (extracts visual features from reference image to guide subject generation) and layout drawing guidance (utilizes subject's contour from original diffusion model as a template for diverse context synthesis). |
Pick-and-Draw consistently improves identity consistency and generative diversity across different personalization methods.
Quantitative evaluation on DreamBench dataset shows significant improvement in subject fidelity and image-text alignment.
Directly applying Pick-and-Draw to vanilla Stable Diffusion yields surprisingly favorable outcomes, showing potential for training-free single-image personalization. |
Pick-and-Draw may fail when the Stable Diffusion template provides incorrect layout priors.
Incomplete appearance transfer may occur if the generated subject significantly differs from the reference. |
text-to-image personalization, diffusion models, semantic guidance, appearance transfer, layout drawing |
2401.16741
Report |
MESA: Matching Everything by Segmenting Anything |
Yesheng Zhang, Xu Zhao |
Feature matching is a crucial task in the field of computer vision, which
involves finding correspondences between images. Previous studies achieve
remarkable performance using learning-based feature comparison. However, the
pervasive presence of matching redundancy between images gives rise to
unnecessary and error-prone computations in these methods, imposing limitations
on their accuracy. To address this issue, we propose MESA, a novel approach to
establish precise area (or region) matches for efficient matching redundancy
reduction. MESA first leverages the advanced image understanding capability of
SAM, a state-of-the-art foundation model for image segmentation, to obtain
image areas with implicit semantic. Then, a multi-relational graph is proposed
to model the spatial structure of these areas and construct their scale
hierarchy. Based on graphical models derived from the graph, the area matching
is reformulated as an energy minimization task and effectively resolved.
Extensive experiments demonstrate that MESA yields substantial precision
improvement for multiple point matchers in indoor and outdoor downstream tasks,
e.g. +13.61% for DKM in indoor pose estimation. |
MESA, a novel method for precise area matching based on the Segment Anything Model (SAM), is proposed to effectively reduce matching redundancy in feature matching and promote accurate point matching. |
Feature matching suffers from matching redundancy, limiting the accuracy of existing methods. Although matching redundancy can be reduced by high-level image understanding, existing methods are either computationally expensive or rely on impractical semantic segmentation. |
MESA constructs a multi-relational Area Graph (AG) to model spatial structures and scale hierarchy of image areas segmented by SAM. Leveraging AG, MESA formulates area matching as an energy minimization problem within a Markov Random Field framework and solves it efficiently using Graph Cut and a learned area similarity model. A global matching energy refinement is further introduced to enhance area matching accuracy by considering the AG structures of both input images. |
MESA significantly outperforms the previous semantic segmentation-based area matching method (SGAM) on the ScanNet1500 benchmark.
MESA remarkably boosts the accuracy of both semi-dense and dense point matchers for indoor and outdoor relative pose estimation, achieving state-of-the-art results on ScanNet1500 and MegaDepth1500 benchmarks.
MESA effectively improves the performance of various point matchers in visual odometry tasks on the KITTI360 dataset. |
MESA does not fully exploit the potential of SAM features for area matching.
The speed of MESA can be further improved for latency-sensitive applications. |
feature matching, matching redundancy, area matching, segment anything model (sam), graphical model |
2401.16663
Report |
VR-GS: A Physical Dynamics-Aware Interactive Gaussian Splatting System in Virtual Reality |
Ying Jiang, Chang Yu, Tianyi Xie, Xuan Li, Yutao Feng, Huamin Wang, Minchen Li, Henry Lau, Feng Gao, Yin Yang, Chenfanfu Jiang |
As consumer Virtual Reality (VR) and Mixed Reality (MR) technologies gain
momentum, there's a growing focus on the development of engagements with 3D
virtual content. Unfortunately, traditional techniques for content creation,
editing, and interaction within these virtual spaces are fraught with
difficulties. They tend to be not only engineering-intensive but also require
extensive expertise, which adds to the frustration and inefficiency in virtual
object manipulation. Our proposed VR-GS system represents a leap forward in
human-centered 3D content interaction, offering a seamless and intuitive user
experience. By developing a physical dynamics-aware interactive Gaussian
Splatting in a Virtual Reality setting, and constructing a highly efficient
two-level embedding strategy alongside deformable body simulations, VR-GS
ensures real-time execution with highly realistic dynamic responses. The
components of our Virtual Reality system are designed for high efficiency and
effectiveness, starting from detailed scene reconstruction and object
segmentation, advancing through multi-view image in-painting, and extending to
interactive physics-based editing. The system also incorporates real-time
deformation embedding and dynamic shadow casting, ensuring a comprehensive and
engaging virtual experience.Our project page is available at:
https://yingjiang96.github.io/VR-GS/. |
Presents VR-GS, a novel system for real-time, physics-based interaction with 3D scenes represented by Gaussian Splatting. |
Traditional 3D content creation is complex and not user-friendly. VR-GS offers an intuitive and accessible way to interact with and edit high-fidelity 3D scenes in real-time. |
Combines 3D Gaussian Splatting with Position Based Dynamics (XPBD) using a novel two-level embedding strategy. This allows for real-time simulation of deformable objects represented by Gaussians, enhanced by segmentation, inpainting, and dynamic shadow mapping. |
Achieves real-time performance while maintaining high visual fidelity in physics-based interactions.
The two-level embedding strategy effectively mitigates spiky artifacts common in deformed Gaussian Splatting.
User study confirms significant improvements in immersion and realism compared to traditional transform-based interactions. |
Rendering high-fidelity Gaussian Splatting in VR at high resolutions can cause latency issues.
Physical parameters are currently manually defined, limiting accessibility for non-expert users. |
gaussian splatting, neural radiance fields, virtual reality, physics-based simulation, real-time interaction |
2401.16575
Report |
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking |
Ivana Beňová, Jana Košecká, Michal Gregor, Martin Tamajka, Marcel Veselý, Marián Šimko |
The dominant probing approaches rely on the zero-shot performance of
image-text matching tasks to gain a finer-grained understanding of the
representations learned by recent multimodal image-language transformer models.
The evaluation is carried out on carefully curated datasets focusing on
counting, relations, attributes, and others. This work introduces an
alternative probing strategy called guided masking. The proposed approach
ablates different modalities using masking and assesses the model's ability to
predict the masked word with high accuracy. We focus on studying multimodal
models that consider regions of interest (ROI) features obtained by object
detectors as input tokens. We probe the understanding of verbs using guided
masking on ViLBERT, LXMERT, UNITER, and VisualBERT and show that these models
can predict the correct verb with high accuracy. This contrasts with previous
conclusions drawn from image-text matching probing techniques that frequently
fail in situations requiring verb understanding. The code for all experiments
will be publicly available https://github.com/ivana-13/guided_masking. |
This paper introduces 'guided masking', a novel probing technique for evaluating multimodal vision-language transformer models. This method involves masking specific tokens in captions, particularly verbs, and assessing the model's ability to predict them accurately, offering a more nuanced understanding of the model's reasoning compared to traditional image-text matching. |
Understanding the fine-grained capabilities of multimodal transformers, especially their grasp of linguistic nuances like verb understanding, is crucial for advancing their interpretability and performance. Existing methods like image-text matching have limitations in providing such insights, necessitating alternative probing techniques like guided masking. |
The authors employ 'guided masking' by masking verbs in image captions and evaluating the model's prediction accuracy for the masked word. Additionally, they ablate visual tokens representing the action's subject to assess the model's grounding of verbs in visual features. |
Guided masking reveals that the studied models (ViLBERT, LXMERT, UNITER, VisualBERT) achieve over 75% accuracy in predicting masked verbs, indicating a better understanding of verbs than previously suggested by image-text matching methods.
Ablating visual tokens associated with the verb's subject leads to a performance drop, highlighting the models' grounding of verbs in visual information.
The study demonstrates the limitations of image-text matching for probing, showing instances where models correctly classify mismatched pairs based on object recognition rather than verb understanding. |
The study primarily focuses on models utilizing ROI features from object detectors, limiting generalizability to models employing different visual feature representations.
Future work can extend guided masking to probe other linguistic aspects, such as objects, attributes, or counting, providing a comprehensive understanding of multimodal transformer capabilities. |
multimodal transformers, probing techniques, verb understanding, vision-language models, guided masking |
2401.16468
Report |
InstructIR: High-Quality Image Restoration Following Human Instructions |
Marcos V. Conde, Gregor Geigle, Radu Timofte |
Image restoration is a fundamental problem that involves recovering a
high-quality clean image from its degraded observation. All-In-One image
restoration models can effectively restore images from various types and levels
of degradation using degradation-specific information as prompts to guide the
restoration model. In this work, we present the first approach that uses
human-written instructions to guide the image restoration model. Given natural
language prompts, our model can recover high-quality images from their degraded
counterparts, considering multiple degradation types. Our method, InstructIR,
achieves state-of-the-art results on several restoration tasks including image
denoising, deraining, deblurring, dehazing, and (low-light) image enhancement.
InstructIR improves +1dB over previous all-in-one restoration methods.
Moreover, our dataset and results represent a novel benchmark for new research
on text-guided image restoration and enhancement. Our code, datasets and models
are available at: https://github.com/mv-lab/InstructIR |
This paper introduces InstructIR, the first text-guided deep learning model for blind image restoration using human-written instructions to guide restoration. |
Existing all-in-one restoration models, while effective, rely on image-based degradation classification and don't leverage users' understanding of what needs fixing. This work leverages the potential of text guidance for improved image restoration. |
The authors train InstructIR on a dataset of over 10,000 GPT4-generated prompts paired with degraded/clean images. A text encoder (sentence transformer) maps instructions to embeddings, guiding a NAFNet-based image model enhanced with a novel Instruction Condition Block (ICB) for task-specific feature adaptation. |
InstructIR achieves state-of-the-art results on five restoration tasks, outperforming previous all-in-one models by +1dB PSNR.
The model generalizes well to various human-written instructions, demonstrating robustness to different language styles and levels of detail.
The integration of instructions allows for selective restoration, enabling users to target specific degradations in an image. |
InstructIR's current implementation might not achieve the same level of perceptual quality as diffusion-based restoration models.
The model struggles with images containing multiple real-world degradations and is limited to in-distribution degradation types. |
image restoration, text-guided image editing, instruction following, blind image restoration, all-in-one restoration |
2401.16456
Report |
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design |
Seokju Yun, Youngmin Ro |
Recently, efficient Vision Transformers have shown great performance with low
latency on resource-constrained devices. Conventionally, they use 4x4 patch
embeddings and a 4-stage structure at the macro level, while utilizing
sophisticated attention with multi-head configuration at the micro level. This
paper aims to address computational redundancy at all design levels in a
memory-efficient manner. We discover that using larger-stride patchify stem not
only reduces memory access costs but also achieves competitive performance by
leveraging token representations with reduced spatial redundancy from the early
stages. Furthermore, our preliminary analyses suggest that attention layers in
the early stages can be substituted with convolutions, and several attention
heads in the latter stages are computationally redundant. To handle this, we
introduce a single-head attention module that inherently prevents head
redundancy and simultaneously boosts accuracy by parallelly combining global
and local information. Building upon our solutions, we introduce SHViT, a
Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy
tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x
faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device,
respectively, while being 1.3% more accurate. For object detection and instance
segmentation on MS COCO using Mask-RCNN head, our model achieves performance
comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone
latency on GPU and mobile device, respectively. |
This paper proposes SHViT, a Single-Head Vision Transformer, that achieves state-of-the-art speed-accuracy tradeoff by addressing computational redundancy in both macro and micro architectural design. |
Efficient Vision Transformers are crucial for resource-constrained devices, and existing methods suffer from redundancies in architectural design, limiting their efficiency. |
The authors analyze spatial and channel redundancy in ViT architectures. They propose a larger-stride patchify stem and a novel Single-Head Self-Attention (SHSA) module to mitigate these redundancies. |
SHViT achieves state-of-the-art speed and accuracy on ImageNet-1k classification, outperforming models like EfficientNet-B0 and MobileViTv2.
It demonstrates superior performance on object detection and instance segmentation tasks using RetinaNet and Mask-RCNN, surpassing EfficientViT and PoolFormer.
SHViT exhibits consistent performance across diverse platforms, including GPUs, CPUs, and mobile devices, with notable speed improvements in ONNX runtime. |
While effective, the architecture's macro design might limit its ability to capture fine-grained features.
Future work includes exploring cost-effective ways to utilize fine-grained features and integrating the single-head design into existing sophisticated attention mechanisms. |
vision transformer, efficient architecture, single-head attention, resource-constrained devices, computer vision |
2401.16420
Report |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model |
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang |
We introduce InternLM-XComposer2, a cutting-edge vision-language model
excelling in free-form text-image composition and comprehension. This model
goes beyond conventional vision-language understanding, adeptly crafting
interleaved text-image content from diverse inputs like outlines, detailed
textual specifications, and reference images, enabling highly customizable
content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach
that applies additional LoRA parameters exclusively to image tokens to preserve
the integrity of pre-trained language knowledge, striking a balance between
precise vision understanding and text composition with literary talent.
Experimental results demonstrate the superiority of InternLM-XComposer2 based
on InternLM2-7B in producing high-quality long-text multi-modal content and its
exceptional vision-language understanding performance across various
benchmarks, where it not only significantly outperforms existing multimodal
models but also matches or even surpasses GPT-4V and Gemini Pro in certain
assessments. This highlights its remarkable proficiency in the realm of
multimodal understanding. The InternLM-XComposer2 model series with 7B
parameters are publicly available at
https://github.com/InternLM/InternLM-XComposer. |
InternLM-XComposer2 is a cutting-edge vision-language model excelling in free-form text-image composition and comprehension, surpassing its predecessor and even competing with GPT-4V and Gemini Pro. |
This model enables highly customizable content creation by crafting interleaved text-image content from diverse inputs like outlines, textual specifications, and reference images. |
The model leverages a Partial LoRA (PLoRA) approach, applying additional LoRA parameters solely to image tokens, and benefits from a high-quality and diverse dataset for training. |
InternLM-XComposer2 based on InternLM2-7B outperforms existing open-source MLLMs by a significant margin.
It matches or surpasses GPT-4V and Gemini Pro in several benchmarks.
It excels in creating high-quality long-text multimodal content and exhibits exceptional vision-language understanding. |
The model's performance on college-level benchmarks, while impressive, still has room for improvement.
Future work could explore the impact of higher-resolution image inputs on text-image composition tasks. |
vision-language model, multimodal understanding, text-image composition, large language model, partial lora |
2401.16157
Report |
Spatial-Aware Latent Initialization for Controllable Image Generation |
Wenqiang Sun, Teng Li, Zehong Lin, Jun Zhang |
Recently, text-to-image diffusion models have demonstrated impressive ability
to generate high-quality images conditioned on the textual input. However,
these models struggle to accurately adhere to textual instructions regarding
spatial layout information. While previous research has primarily focused on
aligning cross-attention maps with layout conditions, they overlook the impact
of the initialization noise on the layout guidance. To achieve better layout
control, we propose leveraging a spatial-aware initialization noise during the
denoising process. Specifically, we find that the inverted reference image with
finite inversion steps contains valuable spatial awareness regarding the
object's position, resulting in similar layouts in the generated images. Based
on this observation, we develop an open-vocabulary framework to customize a
spatial-aware initialization noise for each layout condition. Without modifying
other modules except the initialization noise, our approach can be seamlessly
integrated as a plug-and-play module within other training-free layout guidance
frameworks. We evaluate our approach quantitatively and qualitatively on the
available Stable Diffusion model and COCO dataset. Equipped with the
spatial-aware latent initialization, our method significantly improves the
effectiveness of layout guidance while preserving high-quality content. |
This paper introduces a novel approach for enhancing layout control in text-to-image generation by leveraging a spatial-aware initialization noise during the denoising process of diffusion models. |
Existing text-to-image diffusion models struggle to accurately adhere to textual instructions regarding spatial layout, limiting their ability to generate images that precisely match user specifications. |
The method utilizes the DDIM inversion latent, which retains spatial information from a reference image, as the initialization noise for the image generation process. This spatial-aware latent guides the model to generate objects at the desired positions. An additional attention guidance process further refines the layout during sampling. |
The proposed method significantly improves layout accuracy as measured by IoU and mAP@0.5, outperforming state-of-the-art zero-shot layout guidance methods.
It maintains competitive image quality as assessed by CLIP score.
The approach is efficient, achieving better layout control in fewer optimization steps compared to previous methods. |
The method might experience challenges in maintaining prompt alignment due to the focus on spatial guidance, potentially leading to a slight decrease in CLIP score.
The choice of background significantly influences generation quality, with pure white backgrounds posing challenges. |
text-to-image generation, diffusion models, layout control, ddim inversion, spatial-aware latent |
2401.16144
Report |
Divide and Conquer: Rethinking the Training Paradigm of Neural Radiance Fields |
Rongkai Ma, Leo Lebrat, Rodrigo Santa Cruz, Gil Avraham, Yan Zuo, Clinton Fookes, Olivier Salvado |
Neural radiance fields (NeRFs) have exhibited potential in synthesizing
high-fidelity views of 3D scenes but the standard training paradigm of NeRF
presupposes an equal importance for each image in the training set. This
assumption poses a significant challenge for rendering specific views
presenting intricate geometries, thereby resulting in suboptimal performance.
In this paper, we take a closer look at the implications of the current
training paradigm and redesign this for more superior rendering quality by
NeRFs. Dividing input views into multiple groups based on their visual
similarities and training individual models on each of these groups enables
each model to specialize on specific regions without sacrificing speed or
efficiency. Subsequently, the knowledge of these specialized models is
aggregated into a single entity via a teacher-student distillation paradigm,
enabling spatial efficiency for online render-ing. Empirically, we evaluate our
novel training framework on two publicly available datasets, namely NeRF
synthetic and Tanks&Temples. Our evaluation demonstrates that our DaC training
pipeline enhances the rendering quality of a state-of-the-art baseline model
while exhibiting convergence to a superior minimum. |
This paper introduces DaC, a novel training pipeline for Neural Radiance Fields (NeRFs) that leverages a divide and conquer strategy to improve rendering quality, especially for scenes with intricate geometries. |
Standard NeRF training treats all views equally, limiting the rendering quality for complex scenes. DaC aims to overcome this limitation by enabling specialized learning of different scene regions. |
DaC divides input views into groups based on visual similarity and trains expert NeRF models on each group. Subsequently, it aggregates the knowledge from these experts into a single model via teacher-student distillation for efficient rendering. |
DaC consistently outperforms the standard NeRF training pipeline on both synthetic and real-world benchmark datasets.
Dividing scenes into 4 partitions strikes a good balance between performance and efficiency.
A balanced number of iterations for distillation and fine-tuning stages yields optimal results. |
The current implementation primarily focuses on static scenes and might require adaptations for dynamic scenarios.
Future work will explore extending DaC to dynamic scenes and continual learning setups. |
neural radiance fields, nerf, novel view synthesis, divide and conquer, knowledge distillation |
2401.16087
Report |
High Resolution Image Quality Database |
Huang Huang, Qiang Wan, Jari Korhonen |
With technology for digital photography and high resolution displays rapidly
evolving and gaining popularity, there is a growing demand for blind image
quality assessment (BIQA) models for high resolution images. Unfortunately, the
publicly available large scale image quality databases used for training BIQA
models contain mostly low or general resolution images. Since image resizing
affects image quality, we assume that the accuracy of BIQA models trained on
low resolution images would not be optimal for high resolution images.
Therefore, we created a new high resolution image quality database (HRIQ),
consisting of 1120 images with resolution of 2880x2160 pixels. We conducted a
subjective study to collect the subjective quality ratings for HRIQ in a
controlled laboratory setting, resulting in accurate MOS at high resolution. To
demonstrate the importance of a high resolution image quality database for
training BIQA models to predict mean opinion scores (MOS) of high resolution
images accurately, we trained and tested several traditional and deep learning
based BIQA methods on different resolution versions of our database. The
database is publicly available in https://github.com/jarikorhonen/hriq. |
This paper introduces HRIQ, a new high-resolution image quality database containing 1120 images with authentic distortions, rated by 175 users in a controlled lab environment. |
Existing large-scale image quality databases primarily contain low-resolution images, limiting the development of BIQA models for high-resolution displays where subtle distortions are more perceptible. |
Researchers collected high-resolution images, conducted a subjective quality assessment study in a lab setting, analyzed data for outliers, and evaluated various traditional and deep learning-based BIQA models on different resolution versions of the database. |
Traditional BIQA methods perform poorly on HRIQ across all resolutions.
Deep learning BIQA models show better performance, but their accuracy varies with resolution.
The proposed HR-BIQA, specifically designed for high-resolution images, achieves state-of-the-art performance on the full-resolution database. |
Limited diversity in test user demographics (primarily college students).
HR-BIQA, while effective for high-resolution, exhibits lower performance on low-resolution images due to its patch-based approach.
Future work can explore alternative BIQA architectures optimized for both high and low-resolution images. |
image quality assessment, high resolution images, image database, subjective quality assessment, biqa |
2401.15977
Report |
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling |
Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li |
We introduce Motion-I2V, a novel framework for consistent and controllable
image-to-video generation (I2V). In contrast to previous methods that directly
learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into
two stages with explicit motion modeling. For the first stage, we propose a
diffusion-based motion field predictor, which focuses on deducing the
trajectories of the reference image's pixels. For the second stage, we propose
motion-augmented temporal attention to enhance the limited 1-D temporal
attention in video latent diffusion models. This module can effectively
propagate reference image's feature to synthesized frames with the guidance of
predicted trajectories from the first stage. Compared with existing methods,
Motion-I2V can generate more consistent videos even at the presence of large
motion and viewpoint variation. By training a sparse trajectory ControlNet for
the first stage, Motion-I2V can support users to precisely control motion
trajectories and motion regions with sparse trajectory and region annotations.
This offers more controllability of the I2V process than solely relying on
textual instructions. Additionally, Motion-I2V's second stage naturally
supports zero-shot video-to-video translation. Both qualitative and
quantitative comparisons demonstrate the advantages of Motion-I2V over prior
approaches in consistent and controllable image-to-video generation. Please see
our project page at https://xiaoyushi97.github.io/Motion-I2V/. |
Presents Motion-I2V, a novel two-stage framework for consistent and controllable image-to-video generation with explicit motion modeling. |
Existing I2V methods struggle to maintain temporal consistency and offer limited controllability. Motion-I2V addresses these limitations. |
A diffusion-based motion field predictor (stage 1) deduces pixel trajectories. A motion-augmented temporal attention mechanism (stage 2) enhances video generation using predicted motions. |
Generates temporally consistent videos even with large motions, outperforming state-of-the-art methods.
Offers fine-grained control over motion using sparse trajectory guidance and region-specific animation.
Enables zero-shot video-to-video translation by leveraging motion from source videos. |
Generated videos exhibit medium brightness due to limitations in noise scheduling.
Future work includes exploring improved noise scheduling and further enhancing controllability. |
image-to-video generation, diffusion models, motion modeling, controllable generation, video-to-video translation |
2401.15975
Report |
StableIdentity: Inserting Anybody into Anywhere at First Sight |
Qinghe Wang, Xu Jia, Xiaomin Li, Taiqing Li, Liqian Ma, Yunzhi Zhuge, Huchuan Lu |
Recent advances in large pretrained text-to-image models have shown
unprecedented capabilities for high-quality human-centric generation, however,
customizing face identity is still an intractable problem. Existing methods
cannot ensure stable identity preservation and flexible editability, even with
several images for each subject during training. In this work, we propose
StableIdentity, which allows identity-consistent recontextualization with just
one face image. More specifically, we employ a face encoder with an identity
prior to encode the input face, and then land the face representation into a
space with an editable prior, which is constructed from celeb names. By
incorporating identity prior and editability prior, the learned identity can be
injected anywhere with various contexts. In addition, we design a masked
two-phase diffusion loss to boost the pixel-level perception of the input face
and maintain the diversity of generation. Extensive experiments demonstrate our
method outperforms previous customization methods. In addition, the learned
identity can be flexibly combined with the off-the-shelf modules such as
ControlNet. Notably, to the best knowledge, we are the first to directly inject
the identity learned from a single image into video/3D generation without
finetuning. We believe that the proposed StableIdentity is an important step to
unify image, video, and 3D customized generation models. |
This paper presents StableIdentity, a novel framework that allows for identity-consistent customization of human subjects in text-to-image generation using only a single face image. |
Existing methods for customizing the identity of human subjects in generated images struggle with maintaining consistent identity and flexibility across different contexts, especially when trained on limited data. |
StableIdentity leverages an encoder pretrained on face recognition for identity prior, and constructs an editable embedding space from celebrity names for editability prior. It also employs a masked two-phase diffusion loss to enhance identity preservation and generation diversity. |
StableIdentity outperforms state-of-the-art methods in identity preservation, text-image consistency, and generation quality.
The learned identity can be seamlessly integrated with other image manipulation modules like ControlNet.
StableIdentity demonstrates impressive generalization ability by successfully injecting learned identities into video and 3D generation models without finetuning. |
The method inherits limitations of the base Stable Diffusion model, such as potential hand anomalies.
The performance of video customization is limited by the capabilities of current text-to-video generation models. |
text-to-image generation, diffusion models, identity customization, one-shot learning, video and 3d generation |
2401.15947
Report |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models |
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Munan Ning, Li Yuan |
Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs)
effectively improves downstream task performances. However, existing scaling
methods enable all model parameters to be active for each token in the
calculation, which brings massive training and inferring costs. In this work,
we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This
strategy innovatively addresses the common issue of performance degradation in
multi-modal sparsity learning, consequently constructing a sparse model with an
outrageous number of parameters but a constant computational cost. Furthermore,
we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely
activates only the top-k experts through routers during deployment, keeping the
remaining experts inactive. Extensive experiments show the significant
performance of MoE-LLaVA in a variety of visual understanding and object
hallucination benchmarks. Remarkably, with only approximately 3B sparsely
activated parameters, MoE-LLaVA demonstrates performance comparable to the
LLaVA-1.5-7B on various visual understanding datasets and even surpasses the
LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to
establish a baseline for sparse LVLMs and provide valuable insights for future
research in developing more efficient and effective multi-modal learning
systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA. |
This paper proposes MoE-LLaVA, a novel sparse Large Vision-Language Model (LVLM) architecture based on Mixture of Experts (MoE), and MoE-Tuning, a three-stage training strategy to address the performance degradation issue in multi-modal sparsity learning. |
Scaling LVLMs improves performance but incurs high computational costs. MoE-LLaVA aims to achieve comparable performance with significantly fewer activated parameters, thus reducing computational overhead. |
MoE-LLaVA uses a three-stage training approach: 1) MLP training for visual token adaptation, 2) LLM training for multi-modal understanding, and 3) MoE layer training with FFN-initialized experts. This facilitates gradual transition to a sparse model. During inference, only the top-k experts are activated by a router. |
MoE-LLaVA achieves comparable performance to state-of-the-art LVLMs on visual understanding benchmarks with only ~3B sparsely activated parameters, significantly fewer than dense models.
It outperforms LLaVA-1.5-13B on object hallucination benchmark (POPE) with only 2.2B activated parameters.
Analysis reveals that MoE-LLaVA learns specific patterns in expert activation and modality preferences, demonstrating effective sparse multi-modal learning. |
Training stability, particularly with 16-bit precision, poses a challenge.
Limited multi-modal instruction tuning data hinders exploration of larger MoE-LLaVA models (e.g., 10B+ parameters). |
large vision-language models, mixture of experts, sparse models, multi-modal learning, object hallucination |
2401.15914
Report |
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization |
Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang |
Existing vision-language models exhibit strong generalization on a variety of
visual domains and tasks. However, such models mainly perform zero-shot
recognition in a closed-set manner, and thus struggle to handle open-domain
visual concepts by design. There are recent finetuning methods, such as prompt
learning, that not only study the discrimination between in-distribution (ID)
and out-of-distribution (OOD) samples, but also show some improvements in both
ID and OOD accuracies. In this paper, we first demonstrate that vision-language
models, after long enough finetuning but without proper regularization, tend to
overfit the known classes in the given dataset, with degraded performance on
unknown classes. Then we propose a novel approach OGEN to address this pitfall,
with the main focus on improving the OOD GENeralization of finetuned models.
Specifically, a class-conditional feature generator is introduced to synthesize
OOD features using just the class name of any unknown class. Such synthesized
features will provide useful knowledge about unknowns and help regularize the
decision boundary between ID and OOD data when optimized jointly. Equally
important is our adaptive self-distillation mechanism to regularize our feature
generation model during joint optimization, i.e., adaptively transferring
knowledge between model states to further prevent overfitting. Experiments
validate that our method yields convincing gains in OOD generalization
performance in different settings. Code: https://github.com/apple/ml-ogen. |
This paper addresses the overfitting issue in finetuned vision-language models for improved out-of-distribution (OOD) generalization. |
Existing vision-language models, while demonstrating strong generalization capabilities, often overfit to known classes during finetuning, hindering their performance on novel, unseen classes, which is crucial for real-world applications and safety. |
The paper proposes OGEN, a novel approach that: 1) Introduces a class-conditional feature generator to synthesize image features for unknown classes based solely on their names, leveraging the aligned image-text feature spaces of models like CLIP. 2) Employs an adaptive self-distillation mechanism during training, utilizing past model checkpoints as teachers to guide the current model and prevent overfitting on known classes while improving generalization to unknown ones. |
OGEN consistently improves new class accuracy across various prompt learning baselines, significantly boosting performance on datasets with substantial inter-class variations.
The approach maintains or enhances base class accuracy, demonstrating its ability to balance performance on both known and unknown classes.
Ablation studies validate the contribution of both the feature generator and the adaptive self-distillation mechanism to OGEN’s effectiveness. |
The paper primarily focuses on prompt learning methods for finetuning, future work could explore its applicability to other finetuning techniques like adaptor tuning.
While OGEN shows promise in improving OOD generalization, exploring its capabilities in quantifying uncertainty and evaluating on established OOD detection benchmarks is a potential future direction. |
out-of-distribution generalization, vision-language models, prompt learning, feature synthesis, self-distillation |
2401.15885
Report |
Rectify the Regression Bias in Long-Tailed Object Detection |
Ke Zhu, Minghao Fu, Jie Shao, Tianyu Liu, Jianxin Wu |
Long-tailed object detection faces great challenges because of its extremely
imbalanced class distribution. Recent methods mainly focus on the
classification bias and its loss function design, while ignoring the subtle
influence of the regression branch. This paper shows that the regression bias
exists and does adversely and seriously impact the detection accuracy. While
existing methods fail to handle the regression bias, the class-specific
regression head for rare classes is hypothesized to be the main cause of it in
this paper. As a result, three kinds of viable solutions to cater for the rare
categories are proposed, including adding a class-agnostic branch, clustering
heads and merging heads. The proposed methods brings in consistent and
significant improvements over existing long-tailed detection methods,
especially in rare and common classes. The proposed method achieves
state-of-the-art performance in the large vocabulary LVIS dataset with
different backbones and architectures. It generalizes well to more difficult
evaluation metrics, relatively balanced datasets, and the mask branch. This is
the first attempt to reveal and explore rectifying of the regression bias in
long-tailed object detection. |
This paper reveals the detrimental impact of regression bias in long-tailed object detection and introduces three novel methods to mitigate it by enhancing regression for rare categories. |
Existing long-tailed object detection methods primarily address classification bias, neglecting the significant impact of the regression branch on detection accuracy, especially for rare categories. |
The authors leverage the observation that class-agnostic regression heads benefit rare categories and propose three solutions: 1) adding a class-agnostic branch alongside class-specific ones, 2) clustering similar regression heads based on object scale, and 3) merging heads of specific categories. |
Rectifying regression bias consistently improves performance across various existing long-tailed detection methods, particularly for rare categories.
The proposed method achieves state-of-the-art results on the LVIS dataset with different backbones and architectures, demonstrating its effectiveness.
The method generalizes well to various evaluation metrics, relatively balanced datasets (COCO, COCO-LT), and even to the mask prediction branch in instance segmentation. |
The performance improvement from mitigating regression bias is less pronounced when applied to stronger baselines, suggesting potential limitations in backbone model capacity.
Adapting the proposed regression methods to one-stage object detectors, which typically employ class-agnostic regression heads, requires further exploration. |
long-tailed learning, object detection, regression bias, class-agnostic regression, lvis dataset |
2401.15859
Report |
Diffusion Facial Forgery Detection |
Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli |
Detecting diffusion-generated images has recently grown into an emerging
research area. Existing diffusion-based datasets predominantly focus on general
image generation. However, facial forgeries, which pose a more severe social
risk, have remained less explored thus far. To address this gap, this paper
introduces DiFF, a comprehensive dataset dedicated to face-focused
diffusion-generated images. DiFF comprises over 500,000 images that are
synthesized using thirteen distinct generation methods under four conditions.
In particular, this dataset leverages 30,000 carefully collected textual and
visual prompts, ensuring the synthesis of images with both high fidelity and
semantic consistency. We conduct extensive experiments on the DiFF dataset via
a human test and several representative forgery detection methods. The results
demonstrate that the binary detection accuracy of both human observers and
automated detectors often falls below 30%, shedding light on the challenges in
detecting diffusion-generated facial forgeries. Furthermore, we propose an edge
graph regularization approach to effectively enhance the generalization
capability of existing detectors. |
This paper introduces DiFF, the first large-scale dataset for diffusion-based facial forgery detection, containing over 500,000 images synthesized using 13 methods under 4 conditions (Text-to-Image, Image-to-Image, Face Swapping, Face Editing). |
Existing diffusion-based datasets focus on general image generation and lack the scale and diversity needed to train robust facial forgery detectors. |
Researchers collected pristine celebrity images, generated diverse textual and visual prompts, and synthesized forgeries using various diffusion models while maintaining semantic consistency. |
Human observers and automated detectors struggle to identify diffusion-generated facial forgeries, often falling below 30% accuracy.
Detectors exhibit significant performance drops in cross-domain settings, highlighting the challenge of generalizing across forgery types.
The proposed Edge Graph Regularization (EGR) method, incorporating edge graphs into image processing, significantly improves detector generalizability, achieving up to 10% AUC improvement. |
DiFF currently focuses on facial forgeries, limiting its generalizability to other domains.
Future work includes expanding DiFF with more methods and conditions, and exploring new tasks like traceability and retrieval of diffusion-generated images. |
diffusion models, facial forgery detection, dataset, edge graph regularization, deepfakes |
2401.15841
Report |
2L3: Lifting Imperfect Generated 2D Images into Accurate 3D |
Yizheng Chen, Rengan Xie, Qi Ye, Sen Yang, Zixuan Xie, Tianxiao Chen, Rong Li, Yuchi Huo |
Reconstructing 3D objects from a single image is an intriguing but
challenging problem. One promising solution is to utilize multi-view (MV) 3D
reconstruction to fuse generated MV images into consistent 3D objects. However,
the generated images usually suffer from inconsistent lighting, misaligned
geometry, and sparse views, leading to poor reconstruction quality. To cope
with these problems, we present a novel 3D reconstruction framework that
leverages intrinsic decomposition guidance, transient-mono prior guidance, and
view augmentation to cope with the three issues, respectively. Specifically, we
first leverage to decouple the shading information from the generated images to
reduce the impact of inconsistent lighting; then, we introduce mono prior with
view-dependent transient encoding to enhance the reconstructed normal; and
finally, we design a view augmentation fusion strategy that minimizes
pixel-level loss in generated sparse views and semantic loss in augmented
random views, resulting in view-consistent geometry and detailed textures. Our
approach, therefore, enables the integration of a pre-trained MV image
generator and a neural network-based volumetric signed distance function (SDF)
representation for a single image to 3D object reconstruction. We evaluate our
framework on various datasets and demonstrate its superior performance in both
quantitative and qualitative assessments, signifying a significant advancement
in 3D object reconstruction. Compared with the latest state-of-the-art method
Syncdreamer~\cite{liu2023syncdreamer}, we reduce the Chamfer Distance error by
about 36\% and improve PSNR by about 30\% . |
This paper introduces a novel multi-view 3D reconstruction method specifically designed for imperfect, 'dreamed' images generated by off-the-shelf models. |
Existing 3D reconstruction methods struggle with the inconsistencies (lighting, geometry, view sparsity) present in images generated by current multi-view generation models. This work aims to bridge this gap and enable high-quality 3D reconstruction from such imperfect data. |
The framework employs a two-stage reconstruction process. Stage 1 reconstructs geometry and albedo using monocular normal priors, per-frame normal encoding, intrinsic decomposition guidance, and view augmentation. Stage 2 reconstructs shaded texture using per-frame color encoding and the geometry from Stage 1. |
Significantly improved 3D reconstruction quality (up to 36% lower CD error and 30% higher PSNR) compared to using basic Neus reconstruction with state-of-the-art generation models.
Effective handling of inconsistent lighting, misaligned geometry, and view sparsity issues common in generated images.
Generalizability and robustness demonstrated through successful application to various multi-view generation models and out-of-domain images. |
Reliance on pre-trained models for normal estimation and decomposition, which might introduce limitations depending on their performance.
Further research on reducing reliance on pre-trained models and exploring end-to-end training for improved performance. |
3d reconstruction, multi-view synthesis, neural rendering, image generation, intrinsic image decomposition |
2401.15708
Report |
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding |
Jianxiang Lu, Cong Xie, Hui Guo |
As large-scale text-to-image generation models have made remarkable progress
in the field of text-to-image generation, many fine-tuning methods have been
proposed. However, these models often struggle with novel objects, especially
with one-shot scenarios. Our proposed method aims to address the challenges of
generalizability and fidelity in an object-driven way, using only a single
input image and the object-specific regions of interest. To improve
generalizability and mitigate overfitting, in our paradigm, a prototypical
embedding is initialized based on the object's appearance and its class, before
fine-tuning the diffusion model. And during fine-tuning, we propose a
class-characterizing regularization to preserve prior knowledge of object
classes. To further improve fidelity, we introduce object-specific loss, which
can also use to implant multiple objects. Overall, our proposed object-driven
method for implanting new objects can integrate seamlessly with existing
concepts as well as with high fidelity and generalization. Our method
outperforms several existing works. The code will be released. |
This paper presents a novel object-driven one-shot fine-tuning method for text-to-image diffusion models using prototypical embedding, aiming to improve generalizability and fidelity in generating images of user-specified objects. |
Existing fine-tuning methods struggle with novel objects in one-shot scenarios, often leading to overfitting or low fidelity in generated images. This method addresses these challenges by enabling the accurate implantation of user-specified objects into a generative model using only a single image while maintaining the model's generalization ability. |
The method utilizes prototypical embedding initialized based on the object's appearance and class to improve generalizability. It employs class-characterizing regularization during fine-tuning to preserve prior knowledge of object classes. Additionally, it introduces an object-specific loss function supervised by the object in the input image to enhance fidelity. |
The method effectively mitigates overfitting and enables the generation of images that accurately reflect the user-specified object.
It preserves the prior knowledge of object classes, leading to improved diversity and naturalness in the synthesized images.
The object-specific loss function enhances fidelity by focusing on the object region during training and supports the implantation of multiple objects. |
The method may exhibit limitations in handling objects with complex edges, leading to potential degradation in the quality of generated image edges.
Fidelity might be slightly compromised when implanting smaller objects. Future work will focus on improving mask acquisition and incorporating a multi-scale perception mechanism. |
object-driven, one-shot, diffusion model, prototypical embedding, text-to-image synthesis |
2401.15688
Report |
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation |
Zhenyu Wang, Enze Xie, Aoxue Li, Zhongdao Wang, Xihui Liu, Zhenguo Li |
Despite significant advancements in text-to-image models for generating
high-quality images, these methods still struggle to ensure the controllability
of text prompts over images in the context of complex text prompts, especially
when it comes to retaining object attributes and relationships. In this paper,
we propose CompAgent, a training-free approach for compositional text-to-image
generation, with a large language model (LLM) agent as its core. The
fundamental idea underlying CompAgent is premised on a divide-and-conquer
methodology. Given a complex text prompt containing multiple concepts including
objects, attributes, and relationships, the LLM agent initially decomposes it,
which entails the extraction of individual objects, their associated
attributes, and the prediction of a coherent scene layout. These individual
objects can then be independently conquered. Subsequently, the agent performs
reasoning by analyzing the text, plans and employs the tools to compose these
isolated objects. The verification and human feedback mechanism is finally
incorporated into our agent to further correct the potential attribute errors
and refine the generated images. Guided by the LLM agent, we propose a
tuning-free multi-concept customization model and a layout-to-image generation
model as the tools for concept composition, and a local image editing method as
the tool to interact with the agent for verification. The scene layout controls
the image generation process among these tools to prevent confusion among
multiple objects. Extensive experiments demonstrate the superiority of our
approach for compositional text-to-image generation: CompAgent achieves more
than 10\% improvement on T2I-CompBench, a comprehensive benchmark for
open-world compositional T2I generation. The extension to various related tasks
also illustrates the flexibility of our CompAgent for potential applications. |
This paper proposes CompAgent, a training-free approach for compositional text-to-image generation using an LLM agent for divide-and-conquer image synthesis based on complex text prompts. |
Existing text-to-image models struggle to accurately represent object attributes and relationships within complex scenes, limiting their controllability. |
An LLM agent decomposes complex text prompts into individual objects and scene layouts, then leverages a toolkit including multi-concept customization, layout-to-image generation, and local image editing tools to compose the final image. A verification and feedback mechanism further enhances accuracy. |
CompAgent shows significant improvement in compositional text-to-image generation, achieving over 10% improvement on the T2I-CompBench benchmark.
The LLM agent effectively plans and selects appropriate tools based on text prompt analysis, improving object attribute binding and relationship representation.
The method exhibits flexibility for extension to tasks like multi-concept customization, image editing, and object placement. |
The reliance on multiple tools and models could increase computational cost.
Further exploration of LLM agents with enhanced reasoning and planning capabilities could lead to improved performance. |
text-to-image generation, compositional generation, llm agent, image editing, layout-to-image |
2401.15687
Report |
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance |
Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu |
The synthesis of 3D facial animations from speech has garnered considerable
attention. Due to the scarcity of high-quality 4D facial data and
well-annotated abundant multi-modality labels, previous methods often suffer
from limited realism and a lack of lexible conditioning. We address this
challenge through a trilogy. We first introduce Generalized Neural Parametric
Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial
geometry and images to a highly generalized expression latent space, decoupling
expressions and identities. Then, we utilize GNPFA to extract high-quality
expressions and accurate head poses from a large array of videos. This presents
the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial
animation dataset with well-annotated emotional and style labels. Finally, we
propose Media2Face, a diffusion model in GNPFA latent space for co-speech
facial animation generation, accepting rich multi-modality guidances from
audio, text, and image. Extensive experiments demonstrate that our model not
only achieves high fidelity in facial animation synthesis but also broadens the
scope of expressiveness and style adaptability in 3D facial animation. |
This paper proposes Media2Face, a diffusion-based model that generates realistic and expressive 3D facial animations from diverse media inputs, including audio, text, and images. |
Existing methods for synthesizing 3D facial animations from speech often lack realism and flexible conditioning due to limited training data and control mechanisms. This work aims to overcome these limitations and generate more compelling and controllable animations. |
The authors introduce a new neural representation called GNPFA to capture fine-grained facial expressions and head poses. They use GNPFA to build M2F-D, a large and diverse 4D facial animation dataset. Then, they train Media2Face, a latent diffusion model, on M2F-D to generate animations conditioned on audio, text, and image inputs using a multi-classifier-free guidance approach. |
Media2Face achieves state-of-the-art performance in lip synchronization accuracy, facial expression stylization, and rhythmic head movement synthesis.
The model allows for keyframe editing and CLIP-guided style editing, enabling fine-grained control over the generated animations.
User studies confirm that Media2Face generates more realistic and expressive animations than existing methods. |
The current implementation of real-time generation is limited to 30fps.
The model might struggle with generating animations for unseen languages or highly exaggerated expressions. |
facial animation, diffusion models, speech synthesis, multi-modal learning, computer graphics |
2401.15652
Report |
Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach |
Shaofeng Zhang, Jinfa Huang, Qiang Zhou, Zhibin Wang, Fan Wang, Jiebo Luo, Junchi Yan |
Image outpainting aims to generate the content of an input sub-image beyond
its original boundaries. It is an important task in content generation yet
remains an open problem for generative models. This paper pushes the technical
frontier of image outpainting in two directions that have not been resolved in
literature: 1) outpainting with arbitrary and continuous multiples (without
restriction), and 2) outpainting in a single step (even for large expansion
multiples). Moreover, we develop a method that does not depend on a pre-trained
backbone network, which is in contrast commonly required by the previous SOTA
outpainting methods. The arbitrary multiple outpainting is achieved by
utilizing randomly cropped views from the same image during training to capture
arbitrary relative positional information. Specifically, by feeding one view
and positional embeddings as queries, we can reconstruct another view. At
inference, we generate images with arbitrary expansion multiples by inputting
an anchor image and its corresponding positional embeddings. The one-step
outpainting ability here is particularly noteworthy in contrast to previous
methods that need to be performed for $N$ times to obtain a final multiple
which is $N$ times of its basic and fixed multiple. We evaluate the proposed
approach (called PQDiff as we adopt a diffusion-based generator as our
embodiment, under our proposed \textbf{P}ositional \textbf{Q}uery scheme) on
public benchmarks, demonstrating its superior performance over state-of-the-art
approaches. Specifically, PQDiff achieves state-of-the-art FID scores on the
Scenery (\textbf{21.512}), Building Facades (\textbf{25.310}), and WikiArts
(\textbf{36.212}) datasets. Furthermore, under the 2.25x, 5x and 11.7x
outpainting settings, PQDiff only takes \textbf{40.6\%}, \textbf{20.3\%} and
\textbf{10.2\%} of the time of the benchmark state-of-the-art (SOTA) method. |
This paper proposes PQDiff, a novel image outpainting method that utilizes relative positional queries and a diffusion-based generator to achieve outpainting with arbitrary, continuous multiples in a single step. |
Image outpainting, while important for content generation, is limited by existing methods that require multiple steps for large expansions and lack flexibility in specifying expansion multiples. PQDiff addresses these limitations with improved efficiency and controllability. |
PQDiff leverages a positional query scheme, randomly cropping training images to create anchor and target views. This allows the model to learn arbitrary relative positional information and generate images with continuous expansion multiples in one step. |
PQDiff achieves state-of-the-art FID scores on Scenery, Building Facades, and WikiArts datasets for 11.7x outpainting.
Significantly faster generation speed compared to previous methods, requiring only 10.2% of the time for 11.7x outpainting.
Demonstrates the ability to outpaint at arbitrary positions within the image, not just surrounding regions. |
The performance of PQDiff can be influenced by the random crop ratio used during training.
Further exploration of integrating pre-trained models into the PQDiff framework for enhanced consistency is a potential avenue for future work. |
image outpainting, diffusion models, positional embeddings, generative models, content generation |
2401.15636
Report |
FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models |
Feihong He, Gang Li, Mengyuan Zhang, Leilei Yan, Lingyu Si, Fanzhang Li |
The rapid development of generative diffusion models has significantly
advanced the field of style transfer. However, most current style transfer
methods based on diffusion models typically involve a slow iterative
optimization process, e.g., model fine-tuning and textual inversion of style
concept. In this paper, we introduce FreeStyle, an innovative style transfer
method built upon a pre-trained large diffusion model, requiring no further
optimization. Besides, our method enables style transfer only through a text
description of the desired style, eliminating the necessity of style images.
Specifically, we propose a dual-stream encoder and single-stream decoder
architecture, replacing the conventional U-Net in diffusion models. In the
dual-stream encoder, two distinct branches take the content image and style
text prompt as inputs, achieving content and style decoupling. In the decoder,
we further modulate features from the dual streams based on a given content
image and the corresponding style text prompt for precise style transfer. Our
experimental results demonstrate high-quality synthesis and fidelity of our
method across various content images and style text prompts. The code and more
results are available at our project
website:https://freestylefreelunch.github.io/. |
This paper introduces FreeStyle, a novel text-guided style transfer method that leverages pre-trained large text-guided diffusion models to perform style transfer without any optimization or the need for reference style images. |
Existing style transfer methods based on diffusion models rely on time-consuming optimization processes or require reference style images, limiting their practicality. FreeStyle addresses these limitations by directly utilizing the style generation capabilities of pre-trained diffusion models. |
FreeStyle employs a dual-stream encoder and a single-stream decoder architecture. The dual-stream encoder separately processes the content image and style text prompt, while the single-stream decoder modulates and fuses the extracted features for style transfer. |
FreeStyle generates high-quality stylized images with accurate style expression and content preservation across diverse content images and style text prompts.
Qualitative comparisons demonstrate that FreeStyle outperforms existing methods in terms of visual quality, artistic consistency, and robustness.
Quantitative evaluations using CLIP Score and human preference studies further validate FreeStyle's superiority over state-of-the-art methods. |
FreeStyle's performance is influenced by the quality and diversity of the pre-trained diffusion model used.
Fine-grained control over specific style elements within the image might require further exploration. |
style transfer, diffusion models, text-guided synthesis, training-free, feature modulation |
2401.15318
Report |
Gaussian Splashing: Dynamic Fluid Synthesis with Gaussian Splatting |
Yutao Feng, Xiang Feng, Yintong Shang, Ying Jiang, Chang Yu, Zeshun Zong, Tianjia Shao, Hongzhi Wu, Kun Zhou, Chenfanfu Jiang, Yin Yang |
We demonstrate the feasibility of integrating physics-based animations of
solids and fluids with 3D Gaussian Splatting (3DGS) to create novel effects in
virtual scenes reconstructed using 3DGS. Leveraging the coherence of the
Gaussian splatting and position-based dynamics (PBD) in the underlying
representation, we manage rendering, view synthesis, and the dynamics of solids
and fluids in a cohesive manner. Similar to Gaussian shader, we enhance each
Gaussian kernel with an added normal, aligning the kernel's orientation with
the surface normal to refine the PBD simulation. This approach effectively
eliminates spiky noises that arise from rotational deformation in solids. It
also allows us to integrate physically based rendering to augment the dynamic
surface reflections on fluids. Consequently, our framework is capable of
realistically reproducing surface highlights on dynamic fluids and facilitating
interactions between scene objects and fluids from new views. For more
information, please visit our project page at
\url{https://amysteriouscat.github.io/GaussianSplashing/}. |
Gaussian Splashing (GSP) is a novel framework that integrates physics-based animation of fluids and solids with 3D Gaussian Splatting (3DGS) for creating dynamic effects in reconstructed 3D scenes. |
Existing NeRF/3DGS-based dynamic scene reconstruction methods lack the ability to realistically simulate and render fluid-solid interactions, limiting their applications. |
GSP combines position-based dynamics (PBD) with 3DGS. It uses Gaussian kernels for both scene representation and PBD discretization. The framework employs anisotropy loss to maintain rendering quality under large deformations and integrates a Gaussian shader for dynamic specular reflection. It also utilizes AI inpainting to fill missing textures caused by object displacement. |
GSP enables realistic two-way coupled fluid-solid interaction within 3DGS scenes.
It achieves high-quality rendering of dynamic fluids with specular highlights.
The framework allows for interactive scene editing, such as transforming objects into fluids. |
The current PBD-based simulation, while versatile, has limitations in physical accuracy and could be enhanced with more sophisticated meshless methods.
Fluid rendering, particularly the handling of refraction and the computational cost associated with a large number of fluid particles, requires further improvement. |
3d gaussian splatting, fluid simulation, position-based dynamics, dynamic scene reconstruction, novel view synthesis |
2401.14828
Report |
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts |
Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan |
Text-driven 3D scene editing has gained significant attention owing to its
convenience and user-friendliness. However, existing methods still lack
accurate control of the specified appearance and location of the editing result
due to the inherent limitations of the text description. To this end, we
propose a 3D scene editing framework, TIPEditor, that accepts both text and
image prompts and a 3D bounding box to specify the editing region. With the
image prompt, users can conveniently specify the detailed appearance/style of
the target content in complement to the text description, enabling accurate
control of the appearance. Specifically, TIP-Editor employs a stepwise 2D
personalization strategy to better learn the representation of the existing
scene and the reference image, in which a localization loss is proposed to
encourage correct object placement as specified by the bounding box.
Additionally, TIPEditor utilizes explicit and flexible 3D Gaussian splatting as
the 3D representation to facilitate local editing while keeping the background
unchanged. Extensive experiments have demonstrated that TIP-Editor conducts
accurate editing following the text and image prompts in the specified bounding
box region, consistently outperforming the baselines in editing quality, and
the alignment to the prompts, qualitatively and quantitatively. |
Presents TIP-Editor, a 3D scene editing framework that allows users to edit existing scenes using both text and image prompts within a user-specified 3D bounding box, offering accurate control over the appearance and location of the edit. |
Existing text-driven 3D scene editing methods lack accurate control over the appearance and location of the editing result due to the inherent limitations of the text description. |
TIP-Editor employs a stepwise 2D personalization strategy to learn representations of the existing scene and the reference image. It utilizes explicit and flexible 3D Gaussian splatting (GS) for the 3D scene representation, facilitating local editing while preserving the background. A localization loss is introduced during personalization to ensure accurate object placement. |
TIP-Editor accurately captures unique characteristics specified in the reference images, offering superior controllability.
It supports sequential editing, allowing multiple modifications without noticeable quality degradation.
Both qualitative and quantitative evaluations demonstrate TIP-Editor's superiority in editing quality, visual fidelity, and user satisfaction compared to existing methods. |
The reliance on coarse bounding box input can be problematic in complex scenes where bounding boxes might include unwanted elements.
Extracting a smooth and accurate mesh from GS-represented scenes for further geometric manipulation remains a challenge. |
3d scene editing, text-guided image editing, image-guided image editing, 3d gaussian splatting, score distillation sampling |
2401.14754
Report |
VJT: A Video Transformer on Joint Tasks of Deblurring, Low-light Enhancement and Denoising |
Yuxiang Hui, Yang Liu, Yaofang Liu, Fan Jia, Jinshan Pan, Raymond Chan, Tieyong Zeng |
Video restoration task aims to recover high-quality videos from low-quality
observations. This contains various important sub-tasks, such as video
denoising, deblurring and low-light enhancement, since video often faces
different types of degradation, such as blur, low light, and noise. Even worse,
these kinds of degradation could happen simultaneously when taking videos in
extreme environments. This poses significant challenges if one wants to remove
these artifacts at the same time. In this paper, to the best of our knowledge,
we are the first to propose an efficient end-to-end video transformer approach
for the joint task of video deblurring, low-light enhancement, and denoising.
This work builds a novel multi-tier transformer where each tier uses a
different level of degraded video as a target to learn the features of video
effectively. Moreover, we carefully design a new tier-to-tier feature fusion
scheme to learn video features incrementally and accelerate the training
process with a suitable adaptive weighting scheme. We also provide a new
Multiscene-Lowlight-Blur-Noise (MLBN) dataset, which is generated according to
the characteristics of the joint task based on the RealBlur dataset and YouTube
videos to simulate realistic scenes as far as possible. We have conducted
extensive experiments, compared with many previous state-of-the-art methods, to
show the effectiveness of our approach clearly. |
This paper proposes Video Joint Task (VJT), a novel multi-tier video transformer framework for the joint task of video deblurring, low-light enhancement, and denoising. |
Real-world videos often suffer from multiple degradations simultaneously (blur, low light, noise), necessitating a joint approach for optimal restoration. |
The VJT employs a multi-tier decoder structure with feature fusion between tiers to progressively learn features for the three subtasks. An adaptive weighting scheme balances the multiple loss functions, accelerating training and enhancing results. |
VJT outperforms state-of-the-art methods (e.g., RVRT, LEDNet) on the proposed Multi-scene Lowlight-Blur-Noise (MLBN) dataset, achieving a PSNR of 25.45dB and SSIM of 0.8083.
The multi-tier architecture with feature fusion significantly improves restoration quality compared to single-tier methods.
Adaptive weighting scheme effectively balances loss functions, leading to faster training convergence and improved performance compared to fixed-weight methods. |
The computational cost of the multi-tier transformer architecture is relatively high, limiting real-time applicability.
The MLBN dataset, while designed to approximate real-world scenes, is still synthetic and may not fully capture the complexities of real-world degradations. |
video restoration, video deblurring, low-light enhancement, video denoising, video transformer |
2401.14425
Report |
No Longer Trending on Artstation: Prompt Analysis of Generative AI Art |
Jon McCormack, Maria Teresa Llano, Stephen James Krol, Nina Rajcic |
Image generation using generative AI is rapidly becoming a major new source
of visual media, with billions of AI generated images created using diffusion
models such as Stable Diffusion and Midjourney over the last few years. In this
paper we collect and analyse over 3 million prompts and the images they
generate. Using natural language processing, topic analysis and visualisation
methods we aim to understand collectively how people are using text prompts,
the impact of these systems on artists, and more broadly on the visual cultures
they promote. Our study shows that prompting focuses largely on surface
aesthetics, reinforcing cultural norms, popular conventional representations
and imagery. We also find that many users focus on popular topics (such as
making colouring books, fantasy art, or Christmas cards), suggesting that the
dominant use for the systems analysed is recreational rather than artistic. |
This paper investigates the use of text prompts in text-to-image (TTI) AI art generation, analyzing over 3 million prompts from Stable Diffusion and Midjourney (2022-2023) to understand user trends and the impact of these systems on visual culture. |
The rapid adoption of TTI systems raises concerns about bias, artistic homogenization, and the impact on human artists. Understanding how people utilize these systems is crucial to assess their influence on visual art and culture. |
The study employs natural language processing, topic analysis, and data visualization techniques to analyze prompt datasets from Stable Diffusion and Midjourney. It examines trends in prompt usage, stylistic references, artist mentions, and the content of generated images. |
Prompting in TTI systems often prioritizes achieving desired visual aesthetics over conveying unique artistic ideas, as evidenced by the prevalence of terms like 'cinematic lighting' and 'photorealistic'.
Analysis reveals a significant bias toward popular and conventional artistic styles, potentially leading to aesthetic homogenization and the reinforcement of existing norms.
The study finds a dominant focus on generating images of women, particularly in genres like fantasy art and anime, highlighting potential biases and the reinforcement of stereotypes. |
The study is limited to data from Stable Diffusion and Midjourney, and future research should include data from other popular TTI systems like DALL-E and Leonardo.
Future work could investigate the agency exerted by TTI systems on human users and how their inherent properties might shape future image production. |
generative ai, prompting, visual arts & culture, text-to-image, ai art |
2401.14405
Report |
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities |
Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue |
We propose to improve transformers of a specific modality with irrelevant
data from other modalities, e.g., improve an ImageNet model with audio or point
cloud datasets. We would like to highlight that the data samples of the target
modality are irrelevant to the other modalities, which distinguishes our method
from other works utilizing paired (e.g., CLIP) or interleaved data of different
modalities. We propose a methodology named Multimodal Pathway - given a target
modality and a transformer designed for it, we use an auxiliary transformer
trained with data of another modality and construct pathways to connect
components of the two models so that data of the target modality can be
processed by both models. In this way, we utilize the universal
sequence-to-sequence modeling abilities of transformers obtained from two
modalities. As a concrete implementation, we use a modality-specific tokenizer
and task-specific head as usual but utilize the transformer blocks of the
auxiliary model via a proposed method named Cross-Modal Re-parameterization,
which exploits the auxiliary weights without any inference costs. On the image,
point cloud, video, and audio recognition tasks, we observe significant and
consistent performance improvements with irrelevant data from other modalities.
The code and models are available at https://github.com/AILab-CVC/M2PT. |
This paper proposes Multimodal Pathway, a framework to improve the performance of a transformer on a specific modality using irrelevant data from other modalities. |
Existing multimodal learning methods rely on paired or interleaved data, requiring strong relevance between samples. This work explores improving models with irrelevant data, addressing an open problem in the field. |
The method uses two transformers, one trained on the target modality and another on an auxiliary modality. Cross-Modal Re-parameterization connects the models, allowing the target model to leverage the auxiliary model's weights during training without inference cost. |
M2PT consistently improves performance across image, video, point cloud, and audio modalities.
The method is effective even when auxiliary model weights are fixed during fine-tuning, demonstrating the transferability of learned knowledge.
Empirical studies suggest the improvements stem from the auxiliary model's ability to enhance hierarchical representations, not just better initialization. |
The theoretical explanation behind the performance improvements needs further investigation.
Future work will explore extending Multimodal Pathways to CNNs and cross-architecture scenarios. |
multimodal learning, transformer, re-parameterization, modality-complementary knowledge, hierarchical representation |
2401.14404
Report |
Deconstructing Denoising Diffusion Models for Self-Supervised Learning |
Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He |
In this study, we examine the representation learning abilities of Denoising
Diffusion Models (DDM) that were originally purposed for image generation. Our
philosophy is to deconstruct a DDM, gradually transforming it into a classical
Denoising Autoencoder (DAE). This deconstructive procedure allows us to explore
how various components of modern DDMs influence self-supervised representation
learning. We observe that only a very few modern components are critical for
learning good representations, while many others are nonessential. Our study
ultimately arrives at an approach that is highly simplified and to a large
extent resembles a classical DAE. We hope our study will rekindle interest in a
family of classical methods within the realm of modern self-supervised
learning. |
This paper investigates the representation learning capabilities of Denoising Diffusion Models (DDMs) by deconstructing them into classical Denoising Autoencoders (DAEs). It identifies key components contributing to DDM's representation learning and proposes a simplified DAE architecture. |
The study aims to understand how various components of modern DDMs affect self-supervised representation learning and to bridge the gap between classical DAEs and modern DDMs. |
The authors deconstruct a DDM step-by-step, simplifying the tokenizer and removing DDM-specific components to approach a classical DAE while evaluating the representation learning performance at each step. |
A low-dimensional latent space, rather than tokenizer specifics, is crucial for DDM's representation learning.
A simple DAE with patch-wise PCA tokenizer and multi-level noise achieves competitive self-supervised learning performance.
DDM's representation learning capability stems primarily from the denoising process, not diffusion. |
Autoencoder-based methods, including the proposed one, still lag behind contrastive learning.
The study primarily focuses on ImageNet and linear probing protocol. |
denoising diffusion models, denoising autoencoders, self-supervised learning, representation learning, computer vision |
2401.14398
Report |
pix2gestalt: Amodal Segmentation by Synthesizing Wholes |
Ege Ozguroglu, Ruoshi Liu, Dídac Surís, Dian Chen, Achal Dave, Pavel Tokmakov, Carl Vondrick |
We introduce pix2gestalt, a framework for zero-shot amodal segmentation,
which learns to estimate the shape and appearance of whole objects that are
only partially visible behind occlusions. By capitalizing on large-scale
diffusion models and transferring their representations to this task, we learn
a conditional diffusion model for reconstructing whole objects in challenging
zero-shot cases, including examples that break natural and physical priors,
such as art. As training data, we use a synthetically curated dataset
containing occluded objects paired with their whole counterparts. Experiments
show that our approach outperforms supervised baselines on established
benchmarks. Our model can furthermore be used to significantly improve the
performance of existing object recognition and 3D reconstruction methods in the
presence of occlusions. |
Introduces pix2gestalt, a framework for zero-shot amodal segmentation that leverages pre-trained diffusion models to estimate the shape and appearance of partially occluded objects. |
Amodal completion is crucial for various applications in vision, graphics, and robotics. Existing methods struggle to generalize beyond closed-world settings. |
Fine-tunes a pre-trained diffusion model on a synthetic dataset of occluded objects paired with their whole counterparts. The model takes an RGB image and a point prompt as input and generates the whole object behind occlusions. |
Achieves state-of-the-art amodal segmentation results in a zero-shot setting, outperforming supervised baselines on established benchmarks.
Significantly improves the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.
Generates diverse and plausible completions, handling uncertainty in occlusion scenarios. |
Limitations in situations requiring commonsense or physical reasoning.
Future work could explore incorporating such reasoning abilities into the model. |
amodal segmentation, zero-shot learning, diffusion models, object recognition, 3d reconstruction |
2401.14391
Report |
Rethinking Patch Dependence for Masked Autoencoders |
Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg |
In this work, we re-examine inter-patch dependencies in the decoding
mechanism of masked autoencoders (MAE). We decompose this decoding mechanism
for masked patch reconstruction in MAE into self-attention and cross-attention.
Our investigations suggest that self-attention between mask patches is not
essential for learning good representations. To this end, we propose a novel
pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE).
CrossMAE's decoder leverages only cross-attention between masked and visible
tokens, with no degradation in downstream performance. This design also enables
decoding only a small subset of mask tokens, boosting efficiency. Furthermore,
each decoder block can now leverage different encoder features, resulting in
improved representation learning. CrossMAE matches MAE in performance with 2.5
to 3.7$\times$ less decoding compute. It also surpasses MAE on ImageNet
classification and COCO instance segmentation under the same compute. Code and
models: https://crossmae.github.io |
CrossMAE, a masked autoencoder that uses cross-attention between visible and masked image patches for reconstruction, eliminating self-attention among masked patches. |
Self-attention in the decoder of masked autoencoders is computationally expensive and may not be necessary for good representation learning. |
CrossMAE replaces self-attention with cross-attention in the decoder, enabling partial reconstruction and incorporating inter-block attention to leverage features from multiple encoder blocks. |
CrossMAE achieves comparable or superior performance to MAE on ImageNet classification and COCO instance segmentation with 2.5-3.7x less decoding compute.
Partial reconstruction, decoding only a subset of masked patches, maintains performance while boosting efficiency.
Inter-block attention, allowing decoder blocks to leverage features from different encoder blocks, further improves representation learning. |
Exploration of more efficient inter-block attention mechanisms.
Investigation into the role of self-attention in masked visual pretraining and potential alternatives. |
masked autoencoders, self-supervised learning, cross-attention, vision transformers, representation learning |
2401.14257
Report |
Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation |
Minglin Chen, Weihao Yuan, Yukun Wang, Zhe Sheng, Yisheng He, Zilong Dong, Liefeng Bo, Yulan Guo |
Recently, text-to-3D approaches have achieved high-fidelity 3D content
generation using text description. However, the generated objects are
stochastic and lack fine-grained control. Sketches provide a cheap approach to
introduce such fine-grained control. Nevertheless, it is challenging to achieve
flexible control from these sketches due to their abstraction and ambiguity. In
this paper, we present a multi-view sketch-guided text-to-3D generation
framework (namely, Sketch2NeRF) to add sketch control to 3D generation.
Specifically, our method leverages pretrained 2D diffusion models (e.g., Stable
Diffusion and ControlNet) to supervise the optimization of a 3D scene
represented by a neural radiance field (NeRF). We propose a novel synchronized
generation and reconstruction method to effectively optimize the NeRF. In the
experiments, we collected two kinds of multi-view sketch datasets to evaluate
the proposed method. We demonstrate that our method can synthesize 3D
consistent contents with fine-grained sketch control while being high-fidelity
to text prompts. Extensive results show that our method achieves
state-of-the-art performance in terms of sketch similarity and text alignment. |
Presents Sketch2NeRF, a novel framework for multi-view sketch-guided 3D object generation using neural radiance fields (NeRF) optimized with pretrained 2D diffusion models (Stable Diffusion and ControlNet) for fine-grained control. |
Addresses limitations of existing text-to-3D methods that lack fine-grained controllability and introduces a method for generating 3D objects from multi-view sketches, a more intuitive and expressive way to specify object structure than text. |
Leverages pretrained 2D diffusion models to supervise the optimization of a NeRF, employing a novel synchronized generation and reconstruction mechanism. An annealed time schedule enhances generation quality by gradually reducing noise during optimization. |
Generates high-fidelity 3D objects that accurately reflect the structure and details of the input multi-view sketches.
Exhibits better 3D consistency than text-to-3D methods, alleviating issues like the 'Janus' problem.
Achieves state-of-the-art performance in terms of sketch similarity and text alignment on collected multi-view sketch datasets. |
The quality of generated objects degrades with increased noise in sketch poses, highlighting a dependence on accurate sketch alignment.
The generation process is computationally expensive, taking around 2 hours on a single NVIDIA RTX 3090 GPU. |
text-to-3d, sketch-based 3d generation, neural radiance fields (nerf), diffusion models, controllable generation |
2401.14159
Report |
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks |
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, Lei Zhang |
We introduce Grounded SAM, which uses Grounding DINO as an open-set object
detector to combine with the segment anything model (SAM). This integration
enables the detection and segmentation of any regions based on arbitrary text
inputs and opens a door to connecting various vision models. As shown in Fig.1,
a wide range of vision tasks can be achieved by using the versatile Grounded
SAM pipeline. For example, an automatic annotation pipeline based solely on
input images can be realized by incorporating models such as BLIP and Recognize
Anything. Additionally, incorporating Stable-Diffusion allows for controllable
image editing, while the integration of OSX facilitates promptable 3D human
motion analysis. Grounded SAM also shows superior performance on
open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in
the wild) zero-shot benchmark with the combination of Grounding DINO-Base and
SAM-Huge models. |
Introduces "Grounded SAM," a framework that combines Grounding DINO (an open-set object detector) with the Segment Anything Model (SAM) for open-vocabulary object detection and segmentation using text inputs. |
Addresses the limitations of existing visual perception models in handling complex open-world scenarios, specifically targeting the challenge of open-set segmentation. |
Leverages the strengths of Grounding DINO for text-to-box mapping and SAM for box-to-mask mapping, effectively achieving text-to-mask segmentation. Further extends Grounded SAM by integrating other models for tasks like automatic image annotation, image editing, and human motion analysis. |
Enables detection and segmentation of objects in images based on arbitrary text inputs, including long-tail categories.
Achieves state-of-the-art performance on the SegInW (Segmentation in the wild) zero-shot benchmark, demonstrating superior open-vocabulary segmentation capabilities.
Provides a versatile framework for building diverse AI systems by integrating other expert models for applications like automatic image annotation, controllable image editing, and human motion analysis. |
Reliance on the accuracy of the underlying expert models (e.g., Grounding DINO, SAM).
Potential limitations in handling complex scenes with overlapping or partially visible objects. |
open-vocabulary segmentation, grounded segmentation, open-world vision, foundation model assembling, multimodal ai |
2401.14069
Report |
Neural Sinkhorn Gradient Flow |
Huminhao Zhu, Fangyikang Wang, Chao Zhang, Hanbin Zhao, Hui Qian |
Wasserstein Gradient Flows (WGF) with respect to specific functionals have
been widely used in the machine learning literature. Recently, neural networks
have been adopted to approximate certain intractable parts of the underlying
Wasserstein gradient flow and result in efficient inference procedures. In this
paper, we introduce the Neural Sinkhorn Gradient Flow (NSGF) model, which
parametrizes the time-varying velocity field of the Wasserstein gradient flow
w.r.t. the Sinkhorn divergence to the target distribution starting a given
source distribution. We utilize the velocity field matching training scheme in
NSGF, which only requires samples from the source and target distribution to
compute an empirical velocity field approximation. Our theoretical analyses
show that as the sample size increases to infinity, the mean-field limit of the
empirical approximation converges to the true underlying velocity field. To
further enhance model efficiency on high-dimensional tasks, a two-phase NSGF++
model is devised, which first follows the Sinkhorn flow to approach the image
manifold quickly ($\le 5$ NFEs) and then refines the samples along a simple
straight flow. Numerical experiments with synthetic and real-world benchmark
datasets support our theoretical results and demonstrate the effectiveness of
the proposed methods. |
Introduces Neural Sinkhorn Gradient Flow (NSGF), a model that uses neural networks to approximate the velocity field of the Wasserstein Gradient Flow with respect to the Sinkhorn divergence for efficient inference between probability distributions. |
WGFs are important for machine learning, but existing methods can be computationally expensive. NSGF offers an efficient alternative by using neural networks to approximate the flow. |
The authors utilize a velocity field matching training scheme, which learns the velocity field by minimizing the difference between a neural network approximation and an empirical velocity field estimated from samples of the source and target distributions. |
Theoretical analysis shows that the mean-field limit of the empirical velocity field approximation converges to the true underlying velocity field as sample size increases.
A two-phase NSGF++ model improves efficiency on high-dimensional tasks by combining Sinkhorn flow and straight flow.
Experiments on synthetic and real-world datasets demonstrate the effectiveness of NSGF and NSGF++. |
The current analysis focuses on the mean-field limit and assumes a specific form for the empirical approximation.
Future work could explore different neural network architectures and training objectives to further improve the model's performance. |
wasserstein gradient flow, sinkhorn divergence, neural networks, probability distributions, inference |
2401.13992
Report |
Diffusion-based Data Augmentation for Object Counting Problems |
Zhen Wang, Yuelei Li, Jia Wan, Nuno Vasconcelos |
Crowd counting is an important problem in computer vision due to its wide
range of applications in image understanding. Currently, this problem is
typically addressed using deep learning approaches, such as Convolutional
Neural Networks (CNNs) and Transformers. However, deep networks are data-driven
and are prone to overfitting, especially when the available labeled crowd
dataset is limited. To overcome this limitation, we have designed a pipeline
that utilizes a diffusion model to generate extensive training data. We are the
first to generate images conditioned on a location dot map (a binary dot map
that specifies the location of human heads) with a diffusion model. We are also
the first to use these diverse synthetic data to augment the crowd counting
models. Our proposed smoothed density map input for ControlNet significantly
improves ControlNet's performance in generating crowds in the correct
locations. Also, Our proposed counting loss for the diffusion model effectively
minimizes the discrepancies between the location dot map and the crowd images
generated. Additionally, our innovative guidance sampling further directs the
diffusion process toward regions where the generated crowd images align most
accurately with the location dot map. Collectively, we have enhanced
ControlNet's ability to generate specified objects from a location dot map,
which can be used for data augmentation in various counting problems. Moreover,
our framework is versatile and can be easily adapted to all kinds of counting
problems. Extensive experiments demonstrate that our framework improves the
counting performance on the ShanghaiTech, NWPU-Crowd, UCF-QNRF, and TRANCOS
datasets, showcasing its effectiveness. |
This paper presents a novel framework leveraging diffusion models for data augmentation in object counting tasks, enhancing the training of counting models by generating synthetic images with precise control over object location and density. |
Existing crowd counting datasets are limited in size, leading to overfitting in deep learning models. This framework addresses this challenge by synthesizing diverse and realistic training images, improving model generalization and performance. |
The framework utilizes a pre-trained diffusion model (ControlNet) with several key modifications: 1) Density maps derived from location dot maps are used as input to guide object generation. 2) A counting loss function enforces accurate object placement during training. 3) A counting-guided sampling strategy refines object locations in generated images. |
The method generates synthetic crowd images that accurately reflect the specified density and spatial distribution from location dot maps.
Training counting models with the augmented dataset leads to significant performance improvements across various crowd counting benchmarks (ShanghaiTech, NWPU-Crowd, UCF-QNRF).
The framework demonstrates versatility by effectively augmenting data for vehicle counting on the TRANCOS dataset, highlighting its adaptability to different object counting tasks. |
There might be a trade-off between image quality and strict adherence to location maps due to modifications in the loss function and sampling process.
Future work could explore techniques to further improve generated image quality while maintaining accurate object correspondence. |
data augmentation, object counting, diffusion models, crowd counting, controlnet |
2401.13974
Report |
BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models |
Senthil Purushwalkam, Akash Gokul, Shafiq Joty, Nikhil Naik |
Recent text-to-image generation models have demonstrated incredible success
in generating images that faithfully follow input prompts. However, the
requirement of using words to describe a desired concept provides limited
control over the appearance of the generated concepts. In this work, we address
this shortcoming by proposing an approach to enable personalization
capabilities in existing text-to-image diffusion models. We propose a novel
architecture (BootPIG) that allows a user to provide reference images of an
object in order to guide the appearance of a concept in the generated images.
The proposed BootPIG architecture makes minimal modifications to a pretrained
text-to-image diffusion model and utilizes a separate UNet model to steer the
generations toward the desired appearance. We introduce a training procedure
that allows us to bootstrap personalization capabilities in the BootPIG
architecture using data generated from pretrained text-to-image models, LLM
chat agents, and image segmentation models. In contrast to existing methods
that require several days of pretraining, the BootPIG architecture can be
trained in approximately 1 hour. Experiments on the DreamBooth dataset
demonstrate that BootPIG outperforms existing zero-shot methods while being
comparable with test-time finetuning approaches. Through a user study, we
validate the preference for BootPIG generations over existing methods both in
maintaining fidelity to the reference object's appearance and aligning with
textual prompts. |
This paper proposes BootPIG, a novel architecture that enables zero-shot subject-driven generation in text-to-image models by injecting learned reference image features into a pretrained diffusion model. |
Personalized image generation, the ability to generate images of specific objects in user-defined contexts, has numerous applications but current methods require time-consuming finetuning or lack fidelity to the reference object. |
BootPIG uses two UNets: one extracts features from reference images and the other (modified with Reference Self-Attention layers) generates images conditioned on these features. The model is trained using a novel bootstrapping procedure that generates synthetic training data from pretrained text-to-image models, chat agents, and segmentation models. |
BootPIG outperforms existing zero-shot methods and achieves comparable performance to test-time finetuned methods on standard metrics.
User studies demonstrate a preference for BootPIG generations over existing methods in terms of both subject fidelity and prompt fidelity.
BootPIG can be trained efficiently, requiring only approximately 1 hour on 16 A100 GPUs. |
BootPIG may struggle with prompts that significantly modify the subject's appearance or require fine-grained details.
The method inherits limitations and biases from the underlying generative model. |
text-to-image generation, personalized image generation, subject-driven generation, diffusion models, zero-shot learning |
2401.13942
Report |
StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models |
Mohan Zhou, Yalong Bai, Qing Yang, Tiejun Zhao |
The ability to fine-tune generative models for text-to-image generation tasks
is crucial, particularly facing the complexity involved in accurately
interpreting and visualizing textual inputs. While LoRA is efficient for
language model adaptation, it often falls short in text-to-image tasks due to
the intricate demands of image generation, such as accommodating a broad
spectrum of styles and nuances. To bridge this gap, we introduce StyleInject, a
specialized fine-tuning approach tailored for text-to-image models. StyleInject
comprises multiple parallel low-rank parameter matrices, maintaining the
diversity of visual features. It dynamically adapts to varying styles by
adjusting the variance of visual features based on the characteristics of the
input signal. This approach significantly minimizes the impact on the original
model's text-image alignment capabilities while adeptly adapting to various
styles in transfer learning. StyleInject proves particularly effective in
learning from and enhancing a range of advanced, community-fine-tuned
generative models. Our comprehensive experiments, including both small-sample
and large-scale data fine-tuning as well as base model distillation, show that
StyleInject surpasses traditional LoRA in both text-image semantic consistency
and human preference evaluation, all while ensuring greater parameter
efficiency. |
Introduces StyleInject, a parameter-efficient fine-tuning approach for text-to-image diffusion models that improves upon LoRA by dynamically adapting to various styles while maintaining semantic consistency. |
Addresses the limitations of LoRA in text-to-image generation, which often struggles with stylistic diversity and preserving text-image alignment. |
Employs dynamic multi-style adaptation with a style router for instance-wise feature adaptation and uses AdaIN for style transfer, enabling fine-grained control over visual features. |
Outperforms LoRA in data-driven fine-tuning, achieving better text-image semantic consistency and human preference scores.
Effectively distills knowledge from community-fine-tuned SDMs, transferring stylistic elements while maintaining the original model's capabilities.
Demonstrates improved performance in DreamBooth, enabling the generation of customized subjects with higher quality and consistency. |
The optimal number of training epochs can vary significantly across different experimental settings, requiring careful monitoring and potential early stopping.
Further research could explore extending StyleInject to other generative models beyond diffusion models. |
text-to-image generation, diffusion models, parameter efficient tuning, style transfer, model distillation |
2401.13795
Report |
Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All |
Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, Ismail B. Tutar |
As online shopping is growing, the ability for buyers to virtually visualize
products in their settings-a phenomenon we define as "Virtual Try-All"-has
become crucial. Recent diffusion models inherently contain a world model,
rendering them suitable for this task within an inpainting context. However,
traditional image-conditioned diffusion models often fail to capture the
fine-grained details of products. In contrast, personalization-driven models
such as DreamPaint are good at preserving the item's details but they are not
optimized for real-time applications. We present "Diffuse to Choose," a novel
diffusion-based image-conditioned inpainting model that efficiently balances
fast inference with the retention of high-fidelity details in a given reference
item while ensuring accurate semantic manipulations in the given scene content.
Our approach is based on incorporating fine-grained features from the reference
image directly into the latent feature maps of the main diffusion model,
alongside with a perceptual loss to further preserve the reference item's
details. We conduct extensive testing on both in-house and publicly available
datasets, and show that Diffuse to Choose is superior to existing zero-shot
diffusion inpainting methods as well as few-shot diffusion personalization
algorithms like DreamPaint. |
Introduce "Diffuse to Choose" (DTC), a novel diffusion-based image-conditioned inpainting model for Virtual Try-All that balances fast inference with high-fidelity detail retention. |
To address the need for an efficient and effective solution for virtual product visualization in online shopping, enabling customers to digitally "try" any product in any setting. |
Incorporates fine-grained features from the reference image into the latent feature maps of the main diffusion model using a secondary U-Net encoder and affine transformations. Also utilizes perceptual loss for improved feature alignment. |
DTC surpasses existing zero-shot diffusion inpainting methods like Paint By Example.
DTC matches the performance of few-shot diffusion personalization algorithms like DreamPaint while enabling real-time inference.
DTC effectively handles in-the-wild images and references, preserves fine-grained product details, and ensures seamless integration into target scenes. |
DTC might struggle with very fine-grained details, particularly text engravings due to limitations of VAE decoder.
Model might alter human poses due to its pose-agnostic nature, potentially causing discrepancies in full-body coverage. |
diffusion models, image inpainting, virtual try-on, e-commerce, computer vision |
2401.13641
Report |
How Good is ChatGPT at Face Biometrics? A First Look into Recognition, Soft Biometrics, and Explainability |
Ivan DeAndres-Tame, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia |
Large Language Models (LLMs) such as GPT developed by OpenAI, have already
shown astonishing results, introducing quick changes in our society. This has
been intensified by the release of ChatGPT which allows anyone to interact in a
simple conversational way with LLMs, without any experience in the field
needed. As a result, ChatGPT has been rapidly applied to many different tasks
such as code- and song-writer, education, virtual assistants, etc., showing
impressive results for tasks for which it was not trained (zero-shot learning).
The present study aims to explore the ability of ChatGPT, based on the recent
GPT-4 multimodal LLM, for the task of face biometrics. In particular, we
analyze the ability of ChatGPT to perform tasks such as face verification,
soft-biometrics estimation, and explainability of the results. ChatGPT could be
very valuable to further increase the explainability and transparency of
automatic decisions in human scenarios. Experiments are carried out in order to
evaluate the performance and robustness of ChatGPT, using popular public
benchmarks and comparing the results with state-of-the-art methods in the
field. The results achieved in this study show the potential of LLMs such as
ChatGPT for face biometrics, especially to enhance explainability. For
reproducibility reasons, we release all the code in GitHub. |
This paper presents the first study exploring the capabilities of ChatGPT, specifically the GPT-4 multimodal LLM, for face biometrics tasks including face verification, soft biometrics estimation, and result explainability. |
ChatGPT's rapid adoption and impressive zero-shot learning capabilities make it important to assess its potential in face biometrics, a field crucial for security and human-computer interaction. |
The study uses ChatGPT's API with specifically designed prompts to perform face verification and soft biometrics estimation on various benchmark databases, comparing its performance with state-of-the-art models. The explainability of ChatGPT’s outputs is analyzed qualitatively. |
ChatGPT demonstrates promising results for face verification in controlled environments, but its performance declines in challenging scenarios such as surveillance or extreme conditions.
ChatGPT shows potential for soft biometrics estimation, outperforming some specialized models on certain attributes like age and ethnicity in LFW, and gender in MAAD-Face.
ChatGPT exhibits the ability to provide textual explanations for its decisions, enhancing the transparency of its outputs, despite occasional inaccuracies. |
The study is limited by the computational cost and API request limitations of ChatGPT, restricting the number of experiments.
Further research is needed to explore bias mitigation techniques in ChatGPT for fairer face biometrics applications. |
large language models, chatgpt, face recognition, soft biometrics, explainability |
2401.13627
Report |
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild |
Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, Chao Dong |
We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image
restoration method that harnesses generative prior and the power of model
scaling up. Leveraging multi-modal techniques and advanced generative prior,
SUPIR marks a significant advance in intelligent and realistic image
restoration. As a pivotal catalyst within SUPIR, model scaling dramatically
enhances its capabilities and demonstrates new potential for image restoration.
We collect a dataset comprising 20 million high-resolution, high-quality images
for model training, each enriched with descriptive text annotations. SUPIR
provides the capability to restore images guided by textual prompts, broadening
its application scope and potential. Moreover, we introduce negative-quality
prompts to further improve perceptual quality. We also develop a
restoration-guided sampling method to suppress the fidelity issue encountered
in generative-based restoration. Experiments demonstrate SUPIR's exceptional
restoration effects and its novel capacity to manipulate restoration through
textual prompts. |
This paper proposes SUPIR, the largest-ever image restoration method, achieving high-fidelity and intelligent restoration through model scaling, a novel adaptor, a large image-text dataset, and restoration-guided sampling. |
Existing IR methods are limited by the scale of generative models and often lack the intelligence for targeted restoration. Model scaling significantly enhances model capability, pushing the boundaries of image restoration quality and intelligence. |
The authors utilize the StableDiffusion-XL as the generative prior and design a large-scale adaptor with a ZeroSFT connector. They collect 20 million high-resolution images with text annotations for training and introduce negative-quality samples/prompts for quality enhancement. A restoration-guided sampling method is developed to ensure fidelity. |
SUPIR achieves state-of-the-art performance on non-reference assessment metrics, indicating superior perceptual quality.
It offers flexible control over restoration through textual prompts, enabling targeted restoration and manipulation.
Extensive experiments on both synthetic and real-world data validate the effectiveness and superiority of the method. |
Negative prompts might introduce artifacts when low-quality inputs lack semantic clarity.
Full-reference metrics show limitations in evaluating high-fidelity restoration, necessitating new evaluation methods. |
image restoration, generative prior, model scaling, textual prompt, diffusion models |
2401.13601
Report |
MM-LLMs: Recent Advances in MultiModal Large Language Models |
Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, Dong Yu |
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or
outputs via cost-effective training strategies. The resulting models not only
preserve the inherent reasoning and decision-making capabilities of LLMs but
also empower a diverse range of MM tasks. In this paper, we provide a
comprehensive survey aimed at facilitating further research of MM-LLMs.
Initially, we outline general design formulations for model architecture and
training pipeline. Subsequently, we introduce a taxonomy encompassing $122$
MM-LLMs, each characterized by its specific formulations. Furthermore, we
review the performance of selected MM-LLMs on mainstream benchmarks and
summarize key training recipes to enhance the potency of MM-LLMs. Finally, we
explore promising directions for MM-LLMs while concurrently maintaining a
real-time tracking website for the latest developments in the field. We hope
that this survey contributes to the ongoing advancement of the MM-LLMs domain. |
This paper presents a comprehensive survey of MultiModal Large Language Models (MM-LLMs), focusing on their recent advancements in bridging language models with other modalities. |
MM-LLMs represent a significant advancement in AI, striving to combine the reasoning and decision-making capabilities of LLMs with the rich information content of various modalities (e.g., image, video, audio). |
The authors provide a detailed analysis of MM-LLM design, encompassing model architecture (with five key components) and training pipelines (including Multimodal Pre-Training and Instruction Tuning). |
The paper introduces a taxonomy of 122 SOTA MM-LLMs, categorized by functionality and design.
It reviews the performance of major MM-LLMs on 18 VL benchmarks, providing a comparative analysis of their capabilities.
The authors distill key training recipes for enhancing MM-LLMs based on insights from state-of-the-art models. |
The paper acknowledges the rapidly evolving nature of MM-LLMs and potential omissions, addressed by maintaining a dedicated website for real-time updates.
The paper provides concise overviews of individual MM-LLMs due to space limitations, committing to more detailed information on their website. |
multimodal learning, large language models, vision-language, multimodal instruction tuning, survey |
2401.13560
Report |
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation |
Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, Lei Zhu |
The Transformer architecture has shown a remarkable ability in modeling
global relationships. However, it poses a significant computational challenge
when processing high-dimensional medical images. This hinders its development
and widespread adoption in this task. Mamba, as a State Space Model (SSM),
recently emerged as a notable manner for long-range dependencies in sequential
modeling, excelling in natural language processing filed with its remarkable
memory efficiency and computational speed. Inspired by its success, we
introduce SegMamba, a novel 3D medical image \textbf{Seg}mentation
\textbf{Mamba} model, designed to effectively capture long-range dependencies
within whole volume features at every scale. Our SegMamba, in contrast to
Transformer-based methods, excels in whole volume feature modeling from a state
space model standpoint, maintaining superior processing speed, even with volume
features at a resolution of {$64\times 64\times 64$}. Comprehensive experiments
on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our
SegMamba. The code for SegMamba is available at:
https://github.com/ge-xing/SegMamba |
This paper introduces SegMamba, a novel 3D medical image segmentation model based on the Mamba architecture for capturing long-range dependencies within whole-volume features efficiently. |
Modeling global relationships in 3D medical image segmentation is crucial but computationally challenging. Transformer-based methods, while effective, struggle with high-resolution images. SegMamba addresses this challenge by leveraging the Mamba architecture for memory-efficient and fast long-range dependency modeling. |
SegMamba combines a U-shaped structure with the Mamba architecture. It incorporates a tri-orientated Mamba (ToM) module for multi-directional feature modeling and a gated spatial convolution (GSC) module to enhance spatial feature representation. |
SegMamba achieves state-of-the-art performance on BraTS2023, AIIB2023, and the newly proposed CRC-500 datasets.
Ablation studies demonstrate the effectiveness of the GSC and ToM modules in improving segmentation accuracy.
SegMamba exhibits superior computational efficiency compared to transformer-based methods, even with high-resolution input. |
The paper acknowledges potential limitations in evaluating the generalizability of SegMamba due to the limited number of datasets used.
Future work may explore extending SegMamba to multi-modal medical image segmentation.
Investigating the integration of alternative spatial feature extraction modules within the SegMamba framework. |
3d medical image segmentation, state space models, mamba, long-range dependencies, computational efficiency |
2401.13555
Report |
Benchmarking the Fairness of Image Upsampling Methods |
Mike Laszkiewicz, Imant Daunhawer, Julia E. Vogt, Asja Fischer, Johannes Lederer |
Recent years have witnessed a rapid development of deep generative models for
creating synthetic media, such as images and videos. While the practical
applications of these models in everyday tasks are enticing, it is crucial to
assess the inherent risks regarding their fairness. In this work, we introduce
a comprehensive framework for benchmarking the performance and fairness of
conditional generative models. We develop a set of
metrics$\unicode{x2013}$inspired by their supervised fairness
counterparts$\unicode{x2013}$to evaluate the models on their fairness and
diversity. Focusing on the specific application of image upsampling, we create
a benchmark covering a wide variety of modern upsampling methods. As part of
the benchmark, we introduce UnfairFace, a subset of FairFace that replicates
the racial distribution of common large-scale face datasets. Our empirical
study highlights the importance of using an unbiased training set and reveals
variations in how the algorithms respond to dataset imbalances. Alarmingly, we
find that none of the considered methods produces statistically fair and
diverse results. All experiments can be reproduced using our provided
repository. |
The paper introduces a comprehensive framework for benchmarking the performance and fairness of conditional generative models, focusing on image upsampling. |
Assessing the fairness of generative models is crucial to mitigate potential biases in applications like image enhancement, which can have societal impacts. |
The authors propose novel fairness metrics (RDP, PR, UCPR) inspired by supervised fairness counterparts, alongside traditional performance measures. They create a benchmark using a subset of the FairFace dataset, called UnfairFace, mimicking racial distribution biases in common datasets. |
Training data bias significantly affects the fairness of image upsampling models across all races.
Denoising Diffusion Restoration Models (DDRM) show the most significant fairness discrepancies between biased and unbiased datasets.
While some models demonstrate better fairness, statistical tests reveal that none achieve statistically significant fairness, emphasizing the need for further research. |
The evaluation is limited to 128x128 resolution images due to the lack of fairness labels in higher-resolution datasets.
The definition of fairness relies on race labels, which are inherently complex and subject to limitations in representation and granularity. |
conditional generative models, computer vision, image upsampling, fairness, dataset bias |
2401.13388
Report |
UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion |
Wei Li, Xue Xu, Jiachen Liu, Xinyan Xiao |
Existing text-to-image diffusion models primarily generate images from text
prompts. However, the inherent conciseness of textual descriptions poses
challenges in faithfully synthesizing images with intricate details, such as
specific entities or scenes. This paper presents UNIMO-G, a simple multimodal
conditional diffusion framework that operates on multimodal prompts with
interleaved textual and visual inputs, which demonstrates a unified ability for
both text-driven and subject-driven image generation. UNIMO-G comprises two
core components: a Multimodal Large Language Model (MLLM) for encoding
multimodal prompts, and a conditional denoising diffusion network for
generating images based on the encoded multimodal input. We leverage a
two-stage training strategy to effectively train the framework: firstly
pre-training on large-scale text-image pairs to develop conditional image
generation capabilities, and then instruction tuning with multimodal prompts to
achieve unified image generation proficiency. A well-designed data processing
pipeline involving language grounding and image segmentation is employed to
construct multi-modal prompts. UNIMO-G excels in both text-to-image generation
and zero-shot subject-driven synthesis, and is notably effective in generating
high-fidelity images from complex multimodal prompts involving multiple image
entities. |
This paper introduces UNIMO-G, a novel multimodal conditional diffusion framework for image generation using interleaved textual and visual prompts. |
Existing text-to-image models struggle to generate images with intricate details due to the limitations of concise textual descriptions. UNIMO-G addresses this by enabling more control and detail through multimodal prompts. |
UNIMO-G leverages a Multimodal Large Language Model (MLLM) to encode multimodal prompts and a conditional denoising diffusion network for image generation. It is trained in two stages: pre-training on text-image pairs for basic generation and fine-tuning with multimodal prompts for enhanced controllability. |
UNIMO-G outperforms existing VL-to-image models in text-to-image generation on MS-COCO.
It excels in zero-shot single-entity subject-driven generation, achieving state-of-the-art results on DreamBench.
UNIMO-G exhibits superior performance in zero-shot multi-entity subject-driven generation, as demonstrated on the newly introduced MultiBench. |
UNIMO-G shares common limitations with other image generation models, such as occasional inaccuracies in complex compositions and limitations in visual faithfulness.
The potential for misuse, particularly in creating deepfakes, raises ethical concerns. |
multimodal image generation, diffusion models, multimodal large language models, subject-driven generation, zero-shot learning |
2401.13363
Report |
Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple Persons |
Zhe Xu, Kun Wei, Xu Yang, Cheng Deng |
Human dance generation (HDG) aims to synthesize realistic videos from images
and sequences of driving poses. Despite great success, existing methods are
limited to generating videos of a single person with specific backgrounds,
while the generalizability for real-world scenarios with multiple persons and
complex backgrounds remains unclear. To systematically measure the
generalizability of HDG models, we introduce a new task, dataset, and
evaluation protocol of compositional human dance generation (cHDG). Evaluating
the state-of-the-art methods on cHDG, we empirically find that they fail to
generalize to real-world scenarios. To tackle the issue, we propose a novel
zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos
consistent with arbitrary multiple persons and background while precisely
following the driving poses. Specifically, in contrast to straightforward DDIM
or null-text inversion, we first present a pose-aware inversion method to
obtain the noisy latent code and initialization text embeddings, which can
accurately reconstruct the composed reference image. Since directly generating
videos from them will lead to severe appearance inconsistency, we propose a
compositional augmentation strategy to generate augmented images and utilize
them to optimize a set of generalizable text embeddings. In addition,
consistency-guided sampling is elaborated to encourage the background and
keypoints of the estimated clean image at each reverse step to be close to
those of the reference image, further improving the temporal consistency of
generated videos. Extensive qualitative and quantitative results demonstrate
the effectiveness and superiority of our approach. |
This paper introduces a novel dataset for compositional human dance generation (cHDG) and proposes a new zero-shot method for this task that leverages text embeddings optimized on augmented data. |
cHDG is a challenging task with no previous work, making this research significant for advancing the field. |
The proposed method utilizes a pretrained Stable Diffusion model and optimizes text embeddings on augmented data with varying numbers of people and backgrounds. This allows the model to learn generalizable representations for cHDG. |
The proposed method achieves state-of-the-art performance on cHDG benchmarks, outperforming both supervised and zero-shot baselines in terms of temporal consistency and pose accuracy.
The approach demonstrates superior performance in a user study, indicating higher overall generation quality.
The method is efficient in terms of storage, requiring only optimized text embeddings instead of storing entire models. |
The study primarily focuses on a limited set of 10 persons, 10 backgrounds, and 10 pose sequences, which might not fully represent the diversity in real-world scenarios.
Further investigation is required to explore the impact of a larger and more diverse dataset on the generalizability of the proposed method. |
compositional human dance generation, zero-shot learning, text embeddings, stable diffusion, data augmentation |
2401.13329
Report |
Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval |
Dezhao Luo, Shaogang Gong, Jiabo Huang, Hailin Jin, Yang Liu |
Video Moment Retrieval (VMR) requires precise modelling of fine-grained
moment-text associations to capture intricate visual-language relationships.
Due to the lack of a diverse and generalisable VMR dataset to facilitate
learning scalable moment-text associations, existing methods resort to joint
training on both source and target domain videos for cross-domain applications.
Meanwhile, recent developments in vision-language multimodal models pre-trained
on large-scale image-text and/or video-text pairs are only based on coarse
associations (weakly labelled). They are inadequate to provide fine-grained
moment-text correlations required for cross-domain VMR. In this work, we solve
the problem of unseen cross-domain VMR, where certain visual and textual
concepts do not overlap across domains, by only utilising target domain
sentences (text prompts) without accessing their videos. To that end, we
explore generative video diffusion for fine-grained editing of source videos
controlled by the target sentences, enabling us to simulate target domain
videos. We address two problems in video editing for optimising unseen domain
VMR: (1) generation of high-quality simulation videos of different moments with
subtle distinctions, (2) selection of simulation videos that complement
existing source training videos without introducing harmful noise or
unnecessary repetitions. On the first problem, we formulate a two-stage video
diffusion generation controlled simultaneously by (1) the original video
structure of a source video, (2) subject specifics, and (3) a target sentence
prompt. This ensures fine-grained variations between video moments. On the
second problem, we introduce a hybrid selection mechanism that combines two
quantitative metrics for noise filtering and one qualitative metric for
leveraging VMR prediction on simulation video selection. |
This document provides a template and guidelines for formatting papers submitted to a conference (likely associated with the IEEE Computer Society Press). |
It ensures consistency in formatting for publication and provides instructions on handling elements like blind review, citations, and figure placement. |
The paper outlines specific formatting requirements for margins, fonts, headings, references, and more. It emphasizes the use of LaTeX and provides code snippets for various formatting needs. |
The document clarifies the importance of adhering to a strict 8-page limit for submitted papers.
It emphasizes the need for clear, numbered equations and consistent referencing styles.
The guide offers detailed instructions on anonymizing submissions for blind review, including handling self-citations and unpublished work. |
The document primarily focuses on LaTeX, potentially limiting accessibility for authors using other systems.
While detailed, the guide might benefit from visual examples of correctly formatted elements. |
latex, academic-writing, paper-formatting, conference-submission, ieee |
2401.13307
Report |
ChatterBox: Multi-round Multimodal Referring and Grounding |
Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, Qixiang Ye |
In this study, we establish a baseline for a new task named multimodal
multi-round referring and grounding (MRG), opening up a promising direction for
instance-level multimodal dialogues. We present a new benchmark and an
efficient vision-language model for this purpose. The new benchmark, named
CB-300K, spans challenges including multi-round dialogue, complex spatial
relationships among multiple instances, and consistent reasoning, which are
beyond those shown in existing benchmarks. The proposed model, named
ChatterBox, utilizes a two-branch architecture to collaboratively handle vision
and language tasks. By tokenizing instance regions, the language branch
acquires the ability to perceive referential information. Meanwhile, ChatterBox
feeds a query embedding in the vision branch to a token receiver for visual
grounding. A two-stage optimization strategy is devised, making use of both
CB-300K and auxiliary external data to improve the model's stability and
capacity for instance-level understanding. Experiments show that ChatterBox
outperforms existing models in MRG both quantitatively and qualitatively,
paving a new path towards multimodal dialogue scenarios with complicated and
precise interactions. Code, data, and model are available at:
https://github.com/sunsmarterjie/ChatterBox. |
This paper introduces a new task called multi-round multimodal referring and grounding (MRG) for instance-level multimodal dialogues and presents a new benchmark and an efficient vision-language model, ChatterBox, to facilitate research in this direction. |
A powerful multimodal agent should understand logically related questions and perform basic vision-aware tasks like referring and grounding, which few existing models can do effectively. |
The ChatterBox model employs a two-branch architecture, with one branch handling language logic and the other focusing on visual feature extraction and recognition for grounding. A two-stage optimization strategy leverages both the new benchmark data and auxiliary data to enhance the model's stability and instance-level understanding. |
ChatterBox outperforms previous models in MRG tasks both quantitatively and qualitatively, showing a better understanding of multi-round dialogues and reasoning.
The model effectively performs single-round referring expression and visual grounding tasks, surpassing prior models in benchmark evaluations.
Diagnostic studies confirm that the newly collected benchmark data and pronoun replacement during training contribute significantly to the model's improved performance in MRG tasks. |
The model's design, while effective for referring and grounding, requires further engineering to support tasks beyond these.
Future work could explore training a universal tokenizer for vision-language understanding to enhance the model's capabilities. |
multimodal dialogue, referring expression, visual grounding, vision-language model, instance-level understanding |
2401.13221
Report |
Unified-Width Adaptive Dynamic Network for All-In-One Image Restoration |
Yimin Xu, Nanxi Gao, Zhongyun Shan, Fei Chao, Rongrong Ji |
In contrast to traditional image restoration methods, all-in-one image
restoration techniques are gaining increased attention for their ability to
restore images affected by diverse and unknown corruption types and levels.
However, contemporary all-in-one image restoration methods omit task-wise
difficulties and employ the same networks to reconstruct images afflicted by
diverse degradations. This practice leads to an underestimation of the task
correlations and suboptimal allocation of computational resources. To elucidate
task-wise complexities, we introduce a novel concept positing that intricate
image degradation can be represented in terms of elementary degradation.
Building upon this foundation, we propose an innovative approach, termed the
Unified-Width Adaptive Dynamic Network (U-WADN), consisting of two pivotal
components: a Width Adaptive Backbone (WAB) and a Width Selector (WS). The WAB
incorporates several nested sub-networks with varying widths, which facilitates
the selection of the most apt computations tailored to each task, thereby
striking a balance between accuracy and computational efficiency during
runtime. For different inputs, the WS automatically selects the most
appropriate sub-network width, taking into account both task-specific and
sample-specific complexities. Extensive experiments across a variety of image
restoration tasks demonstrate that the proposed U-WADN achieves better
performance while simultaneously reducing up to 32.3\% of FLOPs and providing
approximately 15.7\% real-time acceleration. The code has been made available
at \url{https://github.com/xuyimin0926/U-WADN}. |
This paper presents a novel Unified-Width Adaptive Dynamic Network (U-WADN) designed for all-in-one image restoration, dynamically allocating computational resources based on both task-specific and sample-specific difficulties. |
Current all-in-one image restoration methods treat all degradations equally, leading to suboptimal resource allocation. This paper introduces a method to assess and leverage task-wise complexity for improved efficiency. |
The U-WADN uses a Width Adaptive Backbone (WAB) with nested sub-networks of varying widths and a Width Selector (WS) to choose the appropriate sub-network for each sample based on its task and complexity. |
U-WADN outperforms state-of-the-art methods in PSNR/SSIM across five image restoration tasks, particularly excelling in complex tasks like dehazing and deraining.
It achieves a 32.3% reduction in FLOPs and a 15.7% acceleration in speed compared to the baseline.
The proposed method allows for a flexible trade-off between performance and efficiency by adjusting the sparsity target. |
The current work focuses on 'noisy-rain-hazy' scenarios; exploring other restoration tasks is left for future work.
The selection of the optimal sparsity target is based on empirical analysis; developing a more systematic approach is desirable. |
image restoration, all-in-one network, dynamic neural network, resource allocation, task-specific complexity |
2401.13203
Report |
Style-Consistent 3D Indoor Scene Synthesis with Decoupled Objects |
Yunfan Zhang, Hong Huang, Zhiwei Xiong, Zhiqi Shen, Guosheng Lin, Hao Wang, Nicholas Vun |
Controllable 3D indoor scene synthesis stands at the forefront of
technological progress, offering various applications like gaming, film, and
augmented/virtual reality. The capability to stylize and de-couple objects
within these scenarios is a crucial factor, providing an advanced level of
control throughout the editing process. This control extends not just to
manipulating geometric attributes like translation and scaling but also
includes managing appearances, such as stylization. Current methods for scene
stylization are limited to applying styles to the entire scene, without the
ability to separate and customize individual objects. Addressing the
intricacies of this challenge, we introduce a unique pipeline designed for
synthesis 3D indoor scenes. Our approach involves strategically placing objects
within the scene, utilizing information from professionally designed bounding
boxes. Significantly, our pipeline prioritizes maintaining style consistency
across multiple objects within the scene, ensuring a cohesive and visually
appealing result aligned with the desired aesthetic. The core strength of our
pipeline lies in its ability to generate 3D scenes that are not only visually
impressive but also exhibit features like photorealism, multi-view consistency,
and diversity. These scenes are crafted in response to various natural language
prompts, demonstrating the versatility and adaptability of our model. |
This paper proposes a novel 3D indoor scene synthesis pipeline that generates decoupled mesh objects with consistent styles using text prompts or single-view images, allowing for individual object stylization and manipulation. |
Controllable 3D indoor scene synthesis is crucial for applications like gaming, film, and VR/AR, and this pipeline offers enhanced control over object stylization and placement within a scene. |
The pipeline utilizes mesh representations for objects, employs a cascaded stylization approach for multi-object style consistency, leverages ChatGPT for object placement reasoning based on bounding boxes, and allows for user control over object manipulation within the scene. |
The pipeline generates high-fidelity 3D indoor scenes with consistent styles across multiple objects.
It outperforms existing methods in terms of visual quality, style consistency, and user control, as demonstrated through qualitative and quantitative comparisons and user studies.
The decoupled mesh representation enables flexible object manipulation and scene editing capabilities. |
Further exploration of style supervision from the whole scene is needed.
Incorporating optimization algorithms for object arrangement, such as LEGO-Net, could enhance scene composition. |
3d scene synthesis, style transfer, mesh generation, text-to-3d, indoor scene understanding |
2401.13011
Report |
CCA: Collaborative Competitive Agents for Image Editing |
Tiankai Hang, Shuyang Gu, Dong Chen, Xin Geng, Baining Guo |
This paper presents a novel generative model, Collaborative Competitive
Agents (CCA), which leverages the capabilities of multiple Large Language
Models (LLMs) based agents to execute complex tasks. Drawing inspiration from
Generative Adversarial Networks (GANs), the CCA system employs two equal-status
generator agents and a discriminator agent. The generators independently
process user instructions and generate results, while the discriminator
evaluates the outputs, and provides feedback for the generator agents to
further reflect and improve the generation results. Unlike the previous
generative model, our system can obtain the intermediate steps of generation.
This allows each generator agent to learn from other successful executions due
to its transparency, enabling a collaborative competition that enhances the
quality and robustness of the system's results. The primary focus of this study
is image editing, demonstrating the CCA's ability to handle intricate
instructions robustly. The paper's main contributions include the introduction
of a multi-agent-based generative model with controllable intermediate steps
and iterative optimization, a detailed examination of agent relationships, and
comprehensive experiments on image editing. Code is available at
\href{https://github.com/TiankaiHang/CCA}{https://github.com/TiankaiHang/CCA}. |
This paper introduces Collaborative Competitive Agents (CCA), a novel generative model leveraging multiple Large Language Models (LLMs) as agents to perform complex tasks, particularly image editing. |
Existing generative models struggle with complex, compound tasks and lack transparency in the generation process, hindering learning from other models. CCA addresses these challenges. |
Inspired by GANs, CCA uses two generator agents and one discriminator agent. Generators process instructions and produce results, while the discriminator evaluates and provides feedback. This process iterates until satisfactory results are achieved. |
CCA demonstrates robust handling of intricate image editing instructions, outperforming previous methods.
The study highlights the importance of collaboration and competition among agents for improved results.
A hierarchical tool configuration enables effective tool utilization by the agents. |
The current implementation primarily focuses on image editing, with potential for broader applications.
Future work can explore optimizing agent communication and feedback mechanisms for enhanced efficiency. |
generative models, multi-agent systems, large language models, image editing, collaboration and competition |
2401.12979
Report |
GALA: Generating Animatable Layered Assets from a Single Scan |
Taeksoo Kim, Byungjun Kim, Shunsuke Saito, Hanbyul Joo |
We present GALA, a framework that takes as input a single-layer clothed 3D
human mesh and decomposes it into complete multi-layered 3D assets. The outputs
can then be combined with other assets to create novel clothed human avatars
with any pose. Existing reconstruction approaches often treat clothed humans as
a single-layer of geometry and overlook the inherent compositionality of humans
with hairstyles, clothing, and accessories, thereby limiting the utility of the
meshes for downstream applications. Decomposing a single-layer mesh into
separate layers is a challenging task because it requires the synthesis of
plausible geometry and texture for the severely occluded regions. Moreover,
even with successful decomposition, meshes are not normalized in terms of poses
and body shapes, failing coherent composition with novel identities and poses.
To address these challenges, we propose to leverage the general knowledge of a
pretrained 2D diffusion model as geometry and appearance prior for humans and
other assets. We first separate the input mesh using the 3D surface
segmentation extracted from multi-view 2D segmentations. Then we synthesize the
missing geometry of different layers in both posed and canonical spaces using a
novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete
inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its
texture to obtain the complete appearance including the initially occluded
regions. Through a series of decomposition steps, we obtain multiple layers of
3D assets in a shared canonical space normalized in terms of poses and human
shapes, hence supporting effortless composition to novel identities and
reanimation with novel poses. Our experiments demonstrate the effectiveness of
our approach for decomposition, canonicalization, and composition tasks
compared to existing solutions. |
GALA decomposes a single-layer clothed 3D human scan into complete multi-layered 3D assets, enabling 3D garment transfer and avatar customization in any pose. |
Existing 3D human reconstruction methods often produce single-layer meshes, limiting their use in applications like virtual try-on or avatar customization that require layered and animatable assets. |
The method leverages a pre-trained 2D diffusion model as a geometry and appearance prior. It separates the input mesh using multi-view 2D segmentation and synthesizes missing geometry in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Texture inpainting using SDS completes the appearance. |
Outperforms state-of-the-art text-driven 3D editing methods in decomposition tasks.
Enables robust canonicalization of clothed humans from a single scan, surpassing existing methods.
Successfully transfers garments and reposes decomposed assets to create novel, animatable avatars. |
Currently generates a static canonical shape, limiting the accurate reposing of loose clothing.
Relies on accurate 2D segmentation, which can be a bottleneck. |
3d garment transfer, avatar customization, 3d decomposition, score distillation sampling, diffusion models |
2401.12978
Report |
Zero-Shot Learning for the Primitives of 3D Affordance in General Objects |
Hyeonwoo Kim, Sookwan Han, Patrick Kwon, Hanbyul Joo |
One of the major challenges in AI is teaching machines to precisely respond
and utilize environmental functionalities, thereby achieving the affordance
awareness that humans possess. Despite its importance, the field has been
lagging in terms of learning, especially in 3D, as annotating affordance
accompanies a laborious process due to the numerous variations of human-object
interaction. The low availability of affordance data limits the learning in
terms of generalization for object categories, and also simplifies the
representation of affordance, capturing only a fraction of the affordance. To
overcome these challenges, we propose a novel, self-supervised method to
generate the 3D affordance examples given only a 3D object, without any manual
annotations. The method starts by capturing the 3D object into images and
creating 2D affordance images by inserting humans into the image via inpainting
diffusion models, where we present the Adaptive Mask algorithm to enable human
insertion without altering the original details of the object. The method
consequently lifts inserted humans back to 3D to create 3D human-object pairs,
where the depth ambiguity is resolved within a depth optimization framework
that utilizes pre-generated human postures from multiple viewpoints. We also
provide a novel affordance representation defined on relative orientations and
proximity between dense human and object points, that can be easily aggregated
from any 3D HOI datasets. The proposed representation serves as a primitive
that can be manifested to conventional affordance representations via simple
transformations, ranging from physically exerted affordances to nonphysical
ones. We demonstrate the efficacy of our method and representation by
generating the 3D affordance samples and deriving high-quality affordance
examples from the representation, including contact, orientation, and spatial
occupancies. |
This paper introduces a novel self-supervised method for generating 3D affordance examples and a new primitive representation for 3D affordance, enabling zero-shot learning of object functionality from 3D objects. |
Current affordance learning methods struggle with generalization to diverse interactions and limited data availability. This work aims to overcome these challenges by generating affordance data without manual annotation and utilizing a richer representation. |
The method generates 2D affordance examples by inserting humans into object renderings using inpainting diffusion models with a novel Adaptive Mask algorithm. These 2D examples are then lifted to 3D using human pose estimation and depth optimization. A new affordance representation based on relative orientations and proximity between human and object points is proposed. |
Adaptive Mask Inpainting preserves original object details during human insertion, leading to more realistic affordance examples.
Depth optimization using multiview cues significantly improves the quality of 3D affordance samples.
The proposed primitive representation can effectively derive various affordance cues like contact, orientation tendency, and spatial occupancy. |
The method might exhibit spatial bias inherited from the inpainting diffusion models.
Modeling dexterous interactions, like grasping, remains challenging due to limitations in diffusion and 3D human prediction models. |
affordance learning, zero-shot learning, 3d vision, human-object interaction, diffusion models |
2401.12945
Report |
Lumiere: A Space-Time Diffusion Model for Video Generation |
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri |
We introduce Lumiere -- a text-to-video diffusion model designed for
synthesizing videos that portray realistic, diverse and coherent motion -- a
pivotal challenge in video synthesis. To this end, we introduce a Space-Time
U-Net architecture that generates the entire temporal duration of the video at
once, through a single pass in the model. This is in contrast to existing video
models which synthesize distant keyframes followed by temporal super-resolution
-- an approach that inherently makes global temporal consistency difficult to
achieve. By deploying both spatial and (importantly) temporal down- and
up-sampling and leveraging a pre-trained text-to-image diffusion model, our
model learns to directly generate a full-frame-rate, low-resolution video by
processing it in multiple space-time scales. We demonstrate state-of-the-art
text-to-video generation results, and show that our design easily facilitates a
wide range of content creation tasks and video editing applications, including
image-to-video, video inpainting, and stylized generation. |
Introduces Lumiere, a text-to-video diffusion model that synthesizes videos with realistic, diverse, and coherent motion by generating the entire temporal duration at once using a Space-Time U-Net (STUNet) architecture. |
Addresses the limitations of existing video models that rely on temporal super-resolution, which hinders global temporal consistency and realistic motion generation. |
Employs a STUNet that downsamples in both space and time, processes information in a compact representation, and leverages a pre-trained text-to-image diffusion model. It utilizes Multidiffusion for temporally consistent spatial super-resolution. |
Achieves state-of-the-art text-to-video generation with superior motion quality.
Facilitates various video content creation tasks like image-to-video, video inpainting, and stylized generation.
Demonstrates consistent video editing capabilities using off-the-shelf editing methods like SDEdit. |
Limited to generating single-shot videos without scene transitions.
Relies on a pixel-space T2I model, necessitating a spatial super-resolution module. |
text-to-video generation, diffusion models, space-time u-net, video inpainting, stylized video generation |
2401.12915
Report |
Red Teaming Visual Language Models |
Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu |
VLMs (Vision-Language Models) extend the capabilities of LLMs (Large Language
Models) to accept multimodal inputs. Since it has been verified that LLMs can
be induced to generate harmful or inaccurate content through specific test
cases (termed as Red Teaming), how VLMs perform in similar scenarios,
especially with their combination of textual and visual inputs, remains a
question. To explore this problem, we present a novel red teaming dataset
RTVLM, which encompasses 10 subtasks (e.g., image misleading, multi-modal
jail-breaking, face fairness, etc) under 4 primary aspects (faithfulness,
privacy, safety, fairness). Our RTVLM is the first red-teaming dataset to
benchmark current VLMs in terms of these 4 different aspects. Detailed analysis
shows that 10 prominent open-sourced VLMs struggle with the red teaming in
different degrees and have up to 31% performance gap with GPT-4V. Additionally,
we simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning
(SFT) using RTVLM, and this bolsters the models' performance with 10% in RTVLM
test set, 13% in MM-Hal, and without noticeable decline in MM-Bench,
overpassing other LLaVA-based models with regular alignment data. This reveals
that current open-sourced VLMs still lack red teaming alignment. Our code and
datasets will be open-source. |
This paper introduces RTVLM, the first red teaming dataset for vision-language models (VLMs) focusing on vulnerabilities in image-text understanding. |
VLMs, combining text and image processing, raise safety and ethical concerns, requiring a systematic benchmark like RTVLM for evaluation and improvement. |
RTVLM comprises 5,200 image-question pairs across 10 subtasks under faithfulness, privacy, safety, and fairness categories, annotated by humans and GPT-4. |
Open-sourced VLMs significantly lag behind GPT-4V in handling red teaming scenarios, showing up to a 31% performance gap.
VLMs are particularly susceptible to misleading information presented through images.
Current VLMs lack adequate alignment for red teaming, highlighting the need for dedicated training data. |
The current version of RTVLM primarily focuses on English-based question-image pairs.
Future work should explore more complex and subtle red teaming scenarios. |
vision-language models, red teaming, benchmarking, safety, fairness |
2401.12902
Report |
Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning? |
Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan Qi, Dongfang Liu |
As the scale of vision models continues to grow, the emergence of Visual
Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has
gained attention due to its superior performance compared to traditional
full-finetuning. However, the conditions favoring VPT (the ``when") and the
underlying rationale (the ``why") remain unclear. In this paper, we conduct a
comprehensive analysis across 19 distinct datasets and tasks. To understand the
``when" aspect, we identify the scenarios where VPT proves favorable by two
dimensions: task objectives and data distributions. We find that VPT is
preferrable when there is 1) a substantial disparity between the original and
the downstream task objectives (e.g., transitioning from classification to
counting), or 2) a similarity in data distributions between the two tasks
(e.g., both involve natural images). In exploring the ``why" dimension, our
results indicate VPT's success cannot be attributed solely to overfitting and
optimization considerations. The unique way VPT preserves original features and
adds parameters appears to be a pivotal factor. Our study provides insights
into VPT's mechanisms, and offers guidance for its optimal utilization. |
This paper investigates when and why visual prompt tuning (VPT) outperforms full finetuning (FT) in transfer learning for vision tasks. |
Understanding the conditions favoring VPT over traditional FT is crucial for efficient transfer learning in large-scale vision models. |
The authors conduct experiments on 19 datasets from VTAB-1k, analyzing the impact of task objectives, data distributions, and dataset size on the performance of VPT and FT. |
VPT is preferred when there's a large disparity between original and downstream task objectives or high similarity in data distributions, especially with limited data.
Overfitting doesn't fully explain VPT's success, and additional parameters alone don't guarantee better optimization.
Preserving original features while adding task-specific parameters is crucial for VPT's effectiveness. |
The study focuses on image classification, limiting generalizability to other vision tasks.
Further exploration of visual explanations for VPT's advantage is needed. |
visual prompt tuning, full finetuning, transfer learning, vision models, parameter efficiency |
2401.12900
Report |
PSAvatar: A Point-based Morphable Shape Model for Real-Time Head Avatar Animation with 3D Gaussian Splatting |
Zhongyuan Zhao, Zhenyu Bao, Qing Li, Guoping Qiu, Kanglin Liu |
Despite much progress, achieving real-time high-fidelity head avatar
animation is still difficult and existing methods have to trade-off between
speed and quality. 3DMM based methods often fail to model non-facial structures
such as eyeglasses and hairstyles, while neural implicit models suffer from
deformation inflexibility and rendering inefficiency. Although 3D Gaussian has
been demonstrated to possess promising capability for geometry representation
and radiance field reconstruction, applying 3D Gaussian in head avatar creation
remains a major challenge since it is difficult for 3D Gaussian to model the
head shape variations caused by changing poses and expressions. In this paper,
we introduce PSAvatar, a novel framework for animatable head avatar creation
that utilizes discrete geometric primitive to create a parametric morphable
shape model and employs 3D Gaussian for fine detail representation and high
fidelity rendering. The parametric morphable shape model is a Point-based
Morphable Shape Model (PMSM) which uses points instead of meshes for 3D
representation to achieve enhanced representation flexibility. The PMSM first
converts the FLAME mesh to points by sampling on the surfaces as well as off
the meshes to enable the reconstruction of not only surface-like structures but
also complex geometries such as eyeglasses and hairstyles. By aligning these
points with the head shape in an analysis-by-synthesis manner, the PMSM makes
it possible to utilize 3D Gaussian for fine detail representation and
appearance modeling, thus enabling the creation of high-fidelity avatars. We
show that PSAvatar can reconstruct high-fidelity head avatars of a variety of
subjects and the avatars can be animated in real-time ($\ge$ 25 fps at a
resolution of 512 $\times$ 512 ). |
PSAvatar, a novel framework for creating animatable head avatars that combines a point-based morphable shape model (PMSM) with 3D Gaussian representation. |
Achieving real-time high-fidelity head avatar animation is challenging due to trade-offs between speed and quality in existing methods. This method aims to overcome limitations in modeling non-facial features and improve rendering efficiency. |
A PMSM is built upon FLAME to model shape variations from pose and expressions, utilizing points for flexible 3D representation. Then, 3D Gaussians are employed for fine detail representation and appearance modeling during rendering. |
PSAvatar reconstructs high-fidelity head avatars, accurately capturing complex geometries like hair strands and eyeglasses.
The method enables real-time animation of the avatars (≥ 25 fps at 512 × 512 resolution).
Quantitative and qualitative evaluations demonstrate superior performance compared to state-of-the-art methods like IMAvatar, INSTA, and PointAvatar. |
The reliance on FLAME for initialization may limit the reconstruction of highly unstructured hairstyles.
Future work could explore personalized PMSM initialization to improve representation capability further. |
head avatar, 3d gaussian, point-based morphable shape model, real-time animation, high-fidelity rendering |
2401.12596
Report |
UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain Adaptation |
Hengjia Li, Yang Liu, Yuqi Lin, Zhanwei Zhang, Yibo Zhao, weihang Pan, Tu Zheng, Zheng Yang, Yuchun Jiang, Boxi Wu, Deng Cai |
Recently, generative domain adaptation has achieved remarkable progress,
enabling us to adapt a pre-trained generator to a new target domain. However,
existing methods simply adapt the generator to a single target domain and are
limited to a single modality, either text-driven or image-driven. Moreover,
they cannot maintain well consistency with the source domain, which impedes the
inheritance of the diversity. In this paper, we propose UniHDA, a
\textbf{unified} and \textbf{versatile} framework for generative hybrid domain
adaptation with multi-modal references from multiple domains. We use CLIP
encoder to project multi-modal references into a unified embedding space and
then linearly interpolate the direction vectors from multiple target domains to
achieve hybrid domain adaptation. To ensure \textbf{consistency} with the
source domain, we propose a novel cross-domain spatial structure (CSS) loss
that maintains detailed spatial structure information between source and target
generator. Experiments show that the adapted generator can synthesise realistic
images with various attribute compositions. Additionally, our framework is
generator-agnostic and versatile to multiple generators, e.g., StyleGAN, EG3D,
and Diffusion Models. |
This paper introduces UniHDA, a unified and versatile framework for multi-modal hybrid domain adaptation in generative models. |
Existing methods are limited to adapting to a single target domain and modality, often overfitting to domain-specific attributes and failing to maintain consistency with the source domain. UniHDA addresses these limitations by enabling adaptation to hybrid domains with multi-modal references (text and image) while preserving source domain diversity. |
UniHDA leverages CLIP to project multi-modal references into a unified embedding space. It then linearly interpolates direction vectors in this space to achieve hybrid domain adaptation. To maintain consistency, UniHDA introduces a cross-domain spatial structure loss that preserves detailed spatial information between source and target generators. |
UniHDA successfully adapts pre-trained generators (StyleGAN, Diffusion models, EG3D) to hybrid domains, synthesizing realistic images with integrated characteristics from multiple domains.
It outperforms existing methods in terms of both generation quality (e.g., CLIP Score, Structural Consistency Score) and efficiency (model size and training time).
The proposed cross-domain spatial structure loss is shown to be crucial for maintaining consistency and inheriting diversity from the source domain. |
UniHDA's reliance on CLIP during training might introduce potential bias for some domains.
Future work could focus on eliminating this bias and further exploring multi-modal hybrid domain adaptation. |
generative domain adaptation, multi-modal adaptation, hybrid domain adaptation, generative models, clip |
2401.12592
Report |
RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos |
Hongchi Xia, Yang Fu, Sifei Liu, Xiaolong Wang |
We introduce a new RGB-D object dataset captured in the wild called
WildRGB-D. Unlike most existing real-world object-centric datasets which only
come with RGB capturing, the direct capture of the depth channel allows better
3D annotations and broader downstream applications. WildRGB-D comprises
large-scale category-level RGB-D object videos, which are taken using an iPhone
to go around the objects in 360 degrees. It contains around 8500 recorded
objects and nearly 20000 RGB-D videos across 46 common object categories. These
videos are taken with diverse cluttered backgrounds with three setups to cover
as many real-world scenarios as possible: (i) a single object in one video;
(ii) multiple objects in one video; and (iii) an object with a static hand in
one video. The dataset is annotated with object masks, real-world scale camera
poses, and reconstructed aggregated point clouds from RGBD videos. We benchmark
four tasks with WildRGB-D including novel view synthesis, camera pose
estimation, object 6d pose estimation, and object surface reconstruction. Our
experiments show that the large-scale capture of RGB-D objects provides a large
potential to advance 3D object learning. Our project page is
https://wildrgbd.github.io/. |
This paper introduces WildRGB-D, a novel large-scale RGB-D object dataset captured in the wild, featuring 8500 tabletop objects across 44 categories in 20K videos with 360-degree views. |
Existing real-world object datasets often lack depth information, limiting 3D annotation accuracy and downstream applications. WildRGB-D addresses this gap by providing real-world scale camera poses, object masks, and point clouds, enabling advancements in 3D object learning. |
The dataset was created by capturing RGB-D videos of objects using iPhones. Automatic annotations were generated using SLAM algorithms for camera poses and point clouds, and a combination of Grounding-DINO, Segment-Anything, and XMem for object masks. |
Depth information in WildRGB-D consistently improves novel view synthesis, especially for generalizable NeRF models.
WildRGB-D enables learning generalizable camera pose estimation models that perform well on unseen object categories.
The dataset facilitates accurate object surface reconstruction, with depth information significantly boosting performance and SDF-based methods showing superior results. |
Current WildRGB-D lacks object 6D pose annotations, which are planned for future crowdsourcing efforts.
Further exploration is needed to address the limitations of translation prediction in camera pose estimation observed in the experiments. |
rgb-d dataset, object recognition, 3d object learning, novel view synthesis, camera pose estimation |
2401.12511
Report |
Convolutional Initialization for Data-Efficient Vision Transformers |
Jianqiao Zheng, Xueqian Li, Simon Lucey |
Training vision transformer networks on small datasets poses challenges. In
contrast, convolutional neural networks (CNNs) can achieve state-of-the-art
performance by leveraging their architectural inductive bias. In this paper, we
investigate whether this inductive bias can be reinterpreted as an
initialization bias within a vision transformer network. Our approach is
motivated by the finding that random impulse filters can achieve almost
comparable performance to learned filters in CNNs. We introduce a novel
initialization strategy for transformer networks that can achieve comparable
performance to CNNs on small datasets while preserving its architectural
flexibility. |
This paper introduces a novel initialization strategy for Vision Transformer (ViT) networks, drawing inspiration from the effectiveness of random impulse filters in Convolutional Neural Networks (CNNs). |
ViTs often struggle with small datasets compared to CNNs due to CNNs' inherent architectural inductive bias. This work aims to bridge this performance gap by reinterpreting CNNs' inductive bias as an initialization bias within ViTs, thereby improving their data efficiency. |
The authors analyze the performance of various spatial mixing filters in ConvMixer, showing that random impulse filters can achieve competitive results. Based on this, they propose initializing the attention maps of ViTs as random impulse convolution filters. They evaluate different ViT model variations and compare their impulse initialization with random and mimetic initializations. |
Random impulse filters are as effective as learned filters in ConvMixer when only channel mixing is learned, as long as linear independence and redundancy in channels are met.
Initializing ViT attention maps as random impulse convolution filters significantly improves performance on small datasets like CIFAR-10, CIFAR-100, and SVHN, surpassing both random and mimetic initializations.
The proposed impulse initialization also leads to faster convergence compared to other initialization methods. |
Determining the optimal scale of self-attention and weight normalization hyperparameters for the initialization process is challenging.
Adapting the impulse initialization strategy to the original ViT structure without the proposed modifications requires further investigation. |
vision transformer, convolutional neural network, initialization, inductive bias, data efficiency |
2401.12503
Report |
Small Language Model Meets with Reinforced Vision Vocabulary |
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang |
Playing Large Vision Language Models (LVLMs) in 2023 is trendy among the AI
community. However, the relatively large number of parameters (more than 7B) of
popular LVLMs makes it difficult to train and deploy on consumer GPUs,
discouraging many researchers with limited resources. Imagine how cool it would
be to experience all the features of current LVLMs on an old GTX1080ti (our
only game card). Accordingly, we present Vary-toy in this report, a small-size
Vary along with Qwen-1.8B as the base ``large'' language model. In Vary-toy, we
introduce an improved vision vocabulary, allowing the model to not only possess
all features of Vary but also gather more generality. Specifically, we replace
negative samples of natural images with positive sample data driven by object
detection in the procedure of generating vision vocabulary, more sufficiently
utilizing the capacity of the vocabulary network and enabling it to efficiently
encode visual information corresponding to natural objects. For experiments,
Vary-toy can achieve 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1%
accuracy on RefCOCO, and 29% on MMVet. The code will be publicly available on
the homepage. |
This paper introduces Vary-toy, a small-size Large Vision Language Model (LVLM) based on Qwen-1.8B, designed to be trained and deployed on consumer GPUs while retaining the features of larger LVLMs. |
Existing LVLMs often have a large number of parameters, making them difficult to train and deploy on consumer-grade hardware. Vary-toy addresses this issue by providing a smaller model that can be utilized by researchers with limited resources. |
The authors propose Vary-tiny+, an improved vision vocabulary generation pipeline that incorporates both dense textual data and natural object location data, enhancing the model's ability to encode visual information. They combine this vocabulary with a 1.8B language model to create Vary-toy. |
Vary-toy achieves 65.6% ANLS on DocVQA, comparable to the 7B Qwen-VL-chat.
It attains 59.1% accuracy on ChartQA, surpassing the 7B mPLUG-DocOwl.
Vary-toy achieves 88.1% accuracy on RefCOCO val, on par with the 7B Qwen-VL-chat. |
The generation ability of the 1.8B model is relatively poor and needs to be strengthened.
The authors suggest exploring the potential of replacing CLIP by adding a large amount of weakly labeled image caption data during the vision vocabulary generation process. |
large vision language models, vision vocabulary, object detection, document ocr, resource-constrained environments |
2401.12425
Report |
The Neglected Tails in Vision-Language Models |
Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong |
Vision-language models (VLMs) excel in zero-shot recognition but their
performance varies greatly across different visual concepts. For example,
although CLIP achieves impressive accuracy on ImageNet (60-80%), its
performance drops below 10% for more than ten concepts like night snake,
presumably due to their limited presence in the pretraining data. However,
measuring the frequency of concepts in VLMs' large-scale datasets is
challenging. We address this by using large language models (LLMs) to count the
number of pretraining texts that contain synonyms of these concepts. Our
analysis confirms that popular datasets, such as LAION, exhibit a long-tailed
concept distribution, yielding biased performance in VLMs. We also find that
downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and
text-to-image models (e.g., Stable Diffusion), often fail to recognize or
generate images of rare concepts identified by our method. To mitigate the
imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented
Learning (REAL). First, instead of prompting VLMs using the original class
names, REAL uses their most frequent synonyms found in pretraining texts. This
simple change already outperforms costly human-engineered and LLM-enriched
prompts over nine benchmark datasets. Second, REAL trains a linear classifier
on a small yet balanced set of pretraining data retrieved using concept
synonyms. REAL surpasses the previous zero-shot SOTA, using 400x less storage
and 10,000x less training time! |
The paper investigates the long-tailed issue in vision-language models (VLMs) and proposes Retrieval-Augmented Learning (REAL) to improve zero-shot recognition. |
VLMs, despite their strong capabilities, often exhibit biased performance due to the long-tailed concept distribution in their pretraining data. |
The paper uses LLMs to estimate concept frequency in VLM pretraining data and proposes two REAL variants: REAL-Prompt (uses the most frequent concept synonym in prompts) and REAL-Linear (trains a linear classifier on a balanced subset of retrieved pretraining data). |
REAL-Prompt outperforms existing prompting methods by simply replacing concept names with their most frequent synonyms.
REAL-Linear achieves state-of-the-art zero-shot recognition accuracy, surpassing previous methods while using significantly less storage and training time.
REAL improves both head and tail class accuracy and can be combined with existing prompting and retrieval-augmented methods for even better performance. |
The concept frequency estimation method's precision and recall cannot be accurately evaluated due to the lack of ground-truth annotations in pretraining data.
The frequency estimation relies solely on textual captions and may miss visual concepts present in images but not explicitly mentioned in captions. |
vision-language models, zero-shot learning, long-tail distribution, retrieval-augmented learning, prompt engineering |
2401.12233
Report |
Memorization in Self-Supervised Learning Improves Downstream Generalization |
Wenhao Wang, Muhammad Ahmad Kaleem, Adam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch |
Self-supervised learning (SSL) has recently received significant attention
due to its ability to train high-performance encoders purely on unlabeled
data-often scraped from the internet. This data can still be sensitive and
empirical evidence suggests that SSL encoders memorize private information of
their training data and can disclose them at inference time. Since existing
theoretical definitions of memorization from supervised learning rely on
labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a
framework for defining memorization within SSL. Our definition compares the
difference in alignment of representations for data points and their augmented
views returned by both encoders that were trained on these data points and
encoders that were not. Through comprehensive empirical analysis on diverse
encoder architectures and datasets we highlight that even though SSL relies on
large datasets and strong augmentations-both known in supervised learning as
regularization techniques that reduce overfitting-still significant fractions
of training data points experience high memorization. Through our empirical
results, we show that this memorization is essential for encoders to achieve
higher generalization performance on different downstream tasks. |
This paper proposes \name, a novel framework for defining and analyzing memorization in self-supervised learning (SSL) encoders. |
Memorization in SSL is unexplored, and existing definitions from supervised learning rely on labels, making them unsuitable for SSL. |
\name leverages data augmentations and alignment, common elements in SSL, to quantify memorization by comparing alignment differences between encoders trained with and without specific data points. Extensive experiments were conducted across various architectures, SSL methods, and datasets. |
Significant memorization exists in SSL encoders, especially for atypical data points, similar to observations in supervised learning.
SSL methods and architectures exhibit consistent memorization patterns, differing from those in supervised learning.
Memorization in SSL encoders is crucial for downstream generalization across diverse tasks and data distributions, highlighting its importance for SSL's success. |
The theoretical link between memorization and generalization in SSL needs further investigation.
Exploring approaches to mitigate privacy risks associated with memorization in SSL is crucial. |
self-supervised learning, memorization, representation learning, generalization, data augmentation |
2401.12217
Report |
Exploring Simple Open-Vocabulary Semantic Segmentation |
Zihang Lai |
Open-vocabulary semantic segmentation models aim to accurately assign a
semantic label to each pixel in an image from a set of arbitrary
open-vocabulary texts. In order to learn such pixel-level alignment, current
approaches typically rely on a combination of (i) image-level VL model (e.g.
CLIP), (ii) ground truth masks, and (iii) custom grouping encoders. In this
paper, we introduce S-Seg, a novel model that can achieve surprisingly strong
performance without depending on any of the above elements. S-Seg leverages
pseudo-mask and language to train a MaskFormer, and can be easily trained from
publicly available image-text datasets. Contrary to prior works, our model
directly trains for pixel-level features and language alignment. Once trained,
S-Seg generalizes well to multiple testing datasets without requiring
fine-tuning. In addition, S-Seg has the extra benefits of scalability with data
and consistently improvement when augmented with self-training. We believe that
our simple yet effective approach will serve as a solid baseline for future
research. |
\mname{} is a novel open-vocabulary semantic segmentation model that achieves strong performance without relying on existing large image-level alignment models, manually annotated segmentation labels, or custom grouping encoders. |
Open-vocabulary semantic segmentation is challenging because it requires assigning accurate semantic labels to each pixel in an image using arbitrary open-vocabulary texts, rather than a fixed set of classes. |
\mname{} leverages pseudo-masks generated through self-supervised clustering and language embeddings from noisy web texts to train a MaskFormer model. |
Achieves competitive results on Pascal VOC, Pascal Context, and COCO datasets.
Demonstrates scalability with data, showing consistent performance improvements with larger datasets.
Benefits significantly from self-training, leading to an average improvement of 5.5% mIoU over three datasets. |
Performance on segmenting smaller objects could be further improved.
Exploration of more advanced pseudo-mask generation techniques could lead to better supervision. |
open-vocabulary, semantic segmentation, weakly-supervised learning, pseudo-masks, maskformer |
2401.12175
Report |
Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM |
Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang |
Reconstructing 3D humans from a single image has been extensively
investigated. However, existing approaches often fall short on capturing fine
geometry and appearance details, hallucinating occluded parts with plausible
details, and achieving generalization across unseen and in-the-wild datasets.
We present Human-LRM, a diffusion-guided feed-forward model that predicts the
implicit field of a human from a single image. Leveraging the power of the
state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e
Stable Diffusion), our method is able to capture human without any template
prior, e.g., SMPL, and effectively enhance occluded parts with rich and
realistic details. Our approach first uses a single-view LRM model with an
enhanced geometry decoder to get the triplane NeRF representation. The novel
view renderings from the triplane NeRF provide strong geometry and color prior,
from which we generate photo-realistic details for the occluded parts using a
diffusion model. The generated multiple views then enable reconstruction with
high-quality geometry and appearance, leading to superior overall performance
comparing to all existing human reconstruction methods. |
Presents Human-LRM, a template-free diffusion-guided model for reconstructing detailed 3D humans from single images. |
Existing methods struggle to capture fine details, hallucinate occluded parts realistically, and generalize across diverse datasets. Human-LRM overcomes these limitations. |
Uses a three-stage approach: 1) Enhanced LRM predicts coarse geometry and color. 2) Conditional diffusion model generates high-fidelity novel views guided by coarse renderings. 3) Multi-view reconstruction model generates final 3D human using diffused views. |
Outperforms previous methods in geometry reconstruction on THuman 2.0, Alloy++, and X-Human datasets.
Achieves better appearance reconstruction (PSNR, SSIM, LPIPS) than volumetric methods on THuman 2.0.
Exhibits superior generalization to challenging poses compared to SMPL-based methods. |
Fine details like facial and hand features are not perfectly captured.
Future work includes exploring more powerful representations or refinement techniques. |
3d human reconstruction, single-view reconstruction, diffusion models, neural radiance fields, novel view synthesis |
2401.12168
Report |
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia |
Understanding and reasoning about spatial relationships is a fundamental
capability for Visual Question Answering (VQA) and robotics. While Vision
Language Models (VLM) have demonstrated remarkable performance in certain VQA
benchmarks, they still lack capabilities in 3D spatial reasoning, such as
recognizing quantitative relationships of physical objects like distances or
size differences. We hypothesize that VLMs' limited spatial reasoning
capability is due to the lack of 3D spatial knowledge in training data and aim
to solve this problem by training VLMs with Internet-scale spatial reasoning
data. To this end, we present a system to facilitate this approach. We first
develop an automatic 3D spatial VQA data generation framework that scales up to
2 billion VQA examples on 10 million real-world images. We then investigate
various factors in the training recipe, including data quality, training
pipeline, and VLM architecture. Our work features the first internet-scale 3D
spatial reasoning dataset in metric space. By training a VLM on such data, we
significantly enhance its ability on both qualitative and quantitative spatial
VQA. Finally, we demonstrate that this VLM unlocks novel downstream
applications in chain-of-thought spatial reasoning and robotics due to its
quantitative estimation capability. Project website:
https://spatial-vlm.github.io/ |
This paper introduces SpatialVLM, a vision-language model trained on a large-scale synthetic dataset of spatial reasoning visual question answering (VQA) pairs, significantly enhancing the spatial reasoning capabilities of VLMs. |
Current VLMs struggle with spatial reasoning tasks crucial for real-world applications like robotics and AR. This research aims to bridge this gap by equipping VLMs with human-like spatial understanding. |
The authors develop a pipeline to generate spatial VQA data by leveraging off-the-shelf computer vision models to extract object-centric contexts, lift 2D images to 3D point clouds, and synthesize diverse qualitative and quantitative spatial reasoning questions and answers. |
SpatialVLM achieves significantly higher accuracy than baseline VLMs on both qualitative and quantitative spatial reasoning VQA benchmarks.
Co-training on spatial VQA data does not degrade the model's performance on general VQA tasks, indicating the potential for VLMs to benefit from such specialized data.
The study demonstrates the potential of SpatialVLM for novel applications, including serving as a dense reward annotator in robotics and enabling chain-of-thought reasoning for complex spatial tasks. |
The accuracy of SpatialVLM's quantitative spatial reasoning is limited by the accuracy of the underlying depth estimation model used in data generation.
The current work primarily focuses on direct spatial reasoning, and future research could explore more complex spatial relations and reasoning tasks. |
vision-language models, spatial reasoning, visual question answering, data augmentation, robotics |
2401.12051
Report |
CloSe: A 3D Clothing Segmentation Dataset and Model |
Dimitrije Antić, Garvita Tiwari, Batuhan Ozcomlekci, Riccardo Marin, Gerard Pons-Moll |
3D Clothing modeling and datasets play crucial role in the entertainment,
animation, and digital fashion industries. Existing work often lacks detailed
semantic understanding or uses synthetic datasets, lacking realism and
personalization. To address this, we first introduce CloSe-D: a novel
large-scale dataset containing 3D clothing segmentation of 3167 scans, covering
a range of 18 distinct clothing classes. Additionally, we propose CloSe-Net,
the first learning-based 3D clothing segmentation model for fine-grained
segmentation from colored point clouds. CloSe-Net uses local point features,
body-clothing correlation, and a garment-class and point features-based
attention module, improving performance over baselines and prior work. The
proposed attention module enables our model to learn appearance and
geometry-dependent clothing prior from data. We further validate the efficacy
of our approach by successfully segmenting publicly available datasets of
people in clothing. We also introduce CloSe-T, a 3D interactive tool for
refining segmentation labels. Combining the tool with CloSe-T in a continual
learning setup demonstrates improved generalization on real-world data.
Dataset, model, and tool can be found at
https://virtualhumans.mpi-inf.mpg.de/close3dv24/. |
This paper introduces CloSe-D, a large-scale 3D clothing segmentation dataset, and CloSe, a novel 3D clothing segmentation model that predicts fine-grained clothing labels directly from colored point clouds, leveraging human body priors and clothing class-based attention. |
Existing 3D clothing datasets often lack detailed semantic understanding or realism, hindering the development of robust methods for comprehending digital clothing. This work addresses this gap by providing a high-quality, fine-grained dataset and a novel model that outperforms prior art. |
CloSe-D is created by manually refining segmentation labels of 3D scans using an interactive tool, CloSeTool. The CloSe model incorporates a point cloud encoder (DGCNN), a canonical body encoder based on SMPL, a clothing encoder with a learnable codebook and attention mechanism, and a segmentation decoder. It's trained with cross-entropy loss and refined using continual learning with user feedback from CloSeTool. |
CloSe-D contains segmentation labels for ~3000 scans and 18 garment categories, making it the first real-world dataset with such fine-grained detail.
CloSe significantly outperforms state-of-the-art part segmentation methods (DGCNN, DeltaConv) and prior 3D clothing segmentation methods (MGN, GIM3D) on various datasets.
The interactive tool, CloSeTool, facilitates efficient data annotation and model refinement, enhancing generalization to out-of-distribution datasets. |
The current method requires garment class as input, which requires preprocessing. Future work could integrate clothing prediction directly into the network.
The continual learning framework could be further explored by integrating more recent strategies, such as EWC, to enhance network generalization. |
3d clothing segmentation, dataset, deep learning, computer vision, human-computer interaction |
2401.11949
Report |
Feature Denoising Diffusion Model for Blind Image Quality Assessment |
Xudong Li, Jingyuan Zheng, Runze Hu, Yan Zhang, Ke Li, Yunhang Shen, Xiawu Zheng, Yutao Liu, ShengChuan Zhang, Pingyang Dai, Rongrong Ji |
Blind Image Quality Assessment (BIQA) aims to evaluate image quality in line
with human perception, without reference benchmarks. Currently, deep learning
BIQA methods typically depend on using features from high-level tasks for
transfer learning. However, the inherent differences between BIQA and these
high-level tasks inevitably introduce noise into the quality-aware features. In
this paper, we take an initial step towards exploring the diffusion model for
feature denoising in BIQA, namely Perceptual Feature Diffusion for IQA
(PFD-IQA), which aims to remove noise from quality-aware features.
Specifically, (i) We propose a {Perceptual Prior Discovery and Aggregation
module to establish two auxiliary tasks to discover potential low-level
features in images that are used to aggregate perceptual text conditions for
the diffusion model. (ii) We propose a Perceptual Prior-based Feature
Refinement strategy, which matches noisy features to predefined denoising
trajectories and then performs exact feature denoising based on text
conditions. Extensive experiments on eight standard BIQA datasets demonstrate
the superior performance to the state-of-the-art BIQA methods, i.e., achieving
the PLCC values of 0.935 ( vs. 0.905 in KADID) and 0.922 ( vs. 0.894 in LIVEC). |
This paper proposes PFD-IQA, a novel BIQA framework that utilizes a diffusion model for the first time to denoise quality-aware features, enhancing their representation for accurate image quality assessment. |
Existing deep learning BIQA methods often struggle to accurately assess image quality due to noise and excessive focus on high-level features from pre-trained models. This work addresses the need for effective filtering of quality-irrelevant information from features in BIQA. |
PFD-IQA consists of two key modules: (1) Perceptual Prior Discovery and Aggregation (PDA): Uses auxiliary tasks to discover distortion and quality level priors, then aggregates perceptual text embeddings to guide the diffusion model. (2) Perceptual Prior-based Diffusion Refinement (PDR): Employs teacher pseudo-features to predefine denoising trajectories, matches student features to these trajectories using adaptive noise alignment, and refines features through text-conditioned denoising. |
PFD-IQA outperforms 14 state-of-the-art BIQA methods on eight benchmark datasets, demonstrating its effectiveness and superiority.
Cross-dataset validation shows strong generalization ability of PFD-IQA, achieving best performance on most tested datasets.
Qualitative analysis using GradCAM visualizations confirms PFD-IQA's ability to effectively focus on quality degradation areas, unlike competing methods. |
The reliance on a pre-trained teacher model might limit the model's performance when presented with out-of-distribution images or distortions.
Future work could explore incorporating more diverse and fine-grained perceptual priors to further enhance the model's sensitivity to subtle quality degradations. |
blind image quality assessment (biqa), diffusion models, feature denoising, perceptual priors, image quality |
2401.11739
Report |
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models |
Koichi Namekata, Amirmojtaba Sabour, Sanja Fidler, Seung Wook Kim |
Diffusion models have recently received increasing research attention for
their remarkable transfer abilities in semantic segmentation tasks. However,
generating fine-grained segmentation masks with diffusion models often requires
additional training on annotated datasets, leaving it unclear to what extent
pre-trained diffusion models alone understand the semantic relations of their
generated images. To address this question, we leverage the semantic knowledge
extracted from Stable Diffusion (SD) and aim to develop an image segmentor
capable of generating fine-grained segmentation maps without any additional
training. The primary difficulty stems from the fact that semantically
meaningful feature maps typically exist only in the spatially lower-dimensional
layers, which poses a challenge in directly extracting pixel-level semantic
relations from these feature maps. To overcome this issue, our framework
identifies semantic correspondences between image pixels and spatial locations
of low-dimensional feature maps by exploiting SD's generation process and
utilizes them for constructing image-resolution segmentation maps. In extensive
experiments, the produced segmentation maps are demonstrated to be well
delineated and capture detailed parts of the images, indicating the existence
of highly accurate pixel-level semantic knowledge in diffusion models. |
This paper presents an unsupervised image segmentor that generates fine-grained segmentation maps solely from the semantic knowledge of a pre-trained diffusion model (Stable Diffusion). |
This is important because it investigates the extent to which pre-trained diffusion models understand semantic relations in images, without relying on additional training data like annotations. |
The method involves generating low-resolution segmentation maps from semantically meaningful feature maps of the diffusion model and then upscaling them to image resolution by identifying semantic correspondences between pixels and low-resolution masks. This is achieved by analyzing how local changes in low-dimensional feature maps affect pixel values in generated images. |
The generated segmentation maps are well-delineated and capture detailed object parts, demonstrating the existence of highly accurate pixel-level semantic knowledge in diffusion models.
The method outperforms existing unsupervised semantic segmentation methods on various datasets, especially when evaluated with a modified protocol that better assesses pixel embedding quality.
Integrating the framework with annotation-free open-vocabulary segmentation models significantly improves their performance, highlighting the accuracy of the generated segmentation masks. |
The framework struggles to segment extremely small objects due to potential information compression in lower-dimensional layers.
Feature representations may encode attributes beyond object meanings, leading to over-segmentation of elements like sky and ground. |
diffusion models, unsupervised semantic segmentation, open-vocabulary segmentation, stable diffusion, semantic knowledge |
2401.11708
Report |
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs |
Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui |
Diffusion models have exhibit exceptional performance in text-to-image
generation and editing. However, existing methods often face challenges when
handling complex text prompts that involve multiple objects with multiple
attributes and relationships. In this paper, we propose a brand new
training-free text-to-image generation/editing framework, namely Recaption,
Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning
ability of multimodal LLMs to enhance the compositionality of text-to-image
diffusion models. Our approach employs the MLLM as a global planner to
decompose the process of generating complex images into multiple simpler
generation tasks within subregions. We propose complementary regional diffusion
to enable region-wise compositional generation. Furthermore, we integrate
text-guided image generation and editing within the proposed RPG in a
closed-loop fashion, thereby enhancing generalization ability. Extensive
experiments demonstrate our RPG outperforms state-of-the-art text-to-image
diffusion models, including DALL-E 3 and SDXL, particularly in multi-category
object composition and text-image semantic alignment. Notably, our RPG
framework exhibits wide compatibility with various MLLM architectures (e.g.,
MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available
at: https://github.com/YangLing0818/RPG-DiffusionMaster |
This paper presents RPG (Recaption, Plan, Generate), a training-free text-to-image generation/editing framework that leverages multimodal LLMs (MLLMs) to enhance the compositionality of diffusion models. |
Existing diffusion models struggle to accurately handle complex prompts involving multiple objects, attributes, and relationships. RPG addresses this limitation by using MLLMs for better prompt understanding and region-wise image generation. |
RPG uses MLLMs for: (1) **Recaptioning:** Decomposing complex prompts into subprompts with detailed descriptions and analyzing image-prompt discrepancies for editing. (2) **CoT Planning:** Dividing the image into subregions and assigning subprompts to each region. (3) **Complementary Regional Diffusion:** Independently generating image content for each region based on assigned prompts and merging them to create the final image. |
RPG significantly outperforms state-of-the-art text-to-image models (e.g., DALL-E 3, SDXL) on compositional prompts, achieving better attribute binding, numeric accuracy, and complex relationship representation.
The hierarchical regional diffusion in RPG allows for increasingly complex image generation by further dividing subregions.
RPG is generalizable and compatible with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). |
The performance of RPG is dependent on the capabilities of the chosen MLLM and diffusion model.
Future work can explore incorporating more complex modalities as input conditions and extending RPG to more real-world applications. |
text-to-image generation, diffusion models, multimodal llms, compositional generation, image editing |
2401.11633
Report |
Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss |
Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, Clinton Fookes |
The fusion of vision and language has brought about a transformative shift in
computer vision through the emergence of Vision-Language Models (VLMs).
However, the resource-intensive nature of existing VLMs poses a significant
challenge. We need an accessible method for developing the next generation of
VLMs. To address this issue, we propose Zoom-shot, a novel method for
transferring the zero-shot capabilities of CLIP to any pre-trained vision
encoder. We do this by exploiting the multimodal information (i.e. text and
image) present in the CLIP latent space through the use of specifically
designed multimodal loss functions. These loss functions are (1)
cycle-consistency loss and (2) our novel prompt-guided knowledge distillation
loss (PG-KD). PG-KD combines the concept of knowledge distillation with CLIP's
zero-shot classification, to capture the interactions between text and image
features. With our multimodal losses, we train a $\textbf{linear mapping}$
between the CLIP latent space and the latent space of a pre-trained vision
encoder, for only a $\textbf{single epoch}$. Furthermore, Zoom-shot is entirely
unsupervised and is trained using $\textbf{unpaired}$ data. We test the
zero-shot capabilities of a range of vision encoders augmented as new VLMs, on
coarse and fine-grained classification datasets, outperforming the previous
state-of-the-art in this problem domain. In our ablations, we find Zoom-shot
allows for a trade-off between data and compute during training; and our
state-of-the-art results can be obtained by reducing training from 20% to 1% of
the ImageNet training data with 20 epochs. All code and models are available on
GitHub. |
Zoom-shot, a novel method that transfers CLIP's zero-shot capabilities to pre-trained vision encoders by training a linear mapping using multimodal loss functions. |
Developing new VLMs from scratch is computationally expensive. Zoom-shot offers an accessible method for augmenting existing vision encoders with zero-shot capabilities, democratizing VLM development. |
Zoom-shot uses cycle-consistency loss and a novel prompt-guided knowledge distillation loss (PG-KD) to train a linear mapping between CLIP's latent space and the latent space of a pre-trained vision encoder. |
Zoom-shot achieves state-of-the-art zero-shot performance on various datasets, outperforming previous methods like Linear Aligner.
Zoom-shot training demonstrates a trade-off between compute and data, enabling effective learning even with limited data.
The distribution of training images significantly impacts Zoom-shot performance, highlighting the importance of diverse and representative training data. |
Zoom-shot performance on fine-grained datasets still lags behind CLIP, indicating limitations in covering specific latent space regions.
Further investigation into optimizing the text subspace within the source latent space could enhance zero-shot performance when mapping CLIP text features. |
vision-language models, zero-shot classification, knowledge distillation, cross-modal alignment, clip |
2401.11239
Report |
Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles |
Yanlong Zang, Han Yang, Jiaxu Miao, Yi Yang |
Image-based virtual try-on systems,which fit new garments onto human
portraits,are gaining research attention.An ideal pipeline should preserve the
static features of clothes(like textures and logos)while also generating
dynamic elements(e.g.shadows,folds)that adapt to the model's pose and
environment.Previous works fail specifically in generating dynamic features,as
they preserve the warped in-shop clothes trivially with predicted an alpha mask
by composition.To break the dilemma of over-preserving and textures losses,we
propose a novel diffusion-based Product-level virtual try-on pipeline,\ie
PLTON, which can preserve the fine details of logos and embroideries while
producing realistic clothes shading and wrinkles.The main insights are in three
folds:1)Adaptive Dynamic Rendering:We take a pre-trained diffusion model as a
generative prior and tame it with image features,training a dynamic extractor
from scratch to generate dynamic tokens that preserve high-fidelity semantic
information. Due to the strong generative power of the diffusion prior,we can
generate realistic clothes shadows and wrinkles.2)Static Characteristics
Transformation: High-frequency Map(HF-Map)is our fundamental insight for static
representation.PLTON first warps in-shop clothes to the target model pose by a
traditional warping network,and uses a high-pass filter to extract an HF-Map
for preserving static cloth features.The HF-Map is used to generate modulation
maps through our static extractor,which are injected into a fixed U-net to
synthesize the final result.To enhance retention,a Two-stage Blended Denoising
method is proposed to guide the diffusion process for correct spatial layout
and color.PLTON is finetuned only with our collected small-size try-on
dataset.Extensive quantitative and qualitative experiments on 1024 768 datasets
demonstrate the superiority of our framework in mimicking real clothes
dynamics. |
This paper introduces PLTON, a novel diffusion-based virtual try-on system that excels at preserving static garment details (textures, logos) while realistically rendering dynamic features (shadows, folds) adapted to pose and environment. |
Existing virtual try-on methods struggle to balance the preservation of static clothes details with the realistic generation of dynamic features, often leading to unrealistic outputs. |
PLTON utilizes a two-stage approach: 1) Adaptive Dynamic Rendering extracts dynamic features from input clothes and uses them to guide a pre-trained diffusion model. 2) Static Characteristics Transformation extracts static features from a high-frequency map of the warped garment and injects them into the diffusion model to ensure their preservation. |
PLTON generates more realistic and visually appealing virtual try-on results than state-of-the-art methods.
The method demonstrates robustness to inaccurate human parsing and suboptimal garment warping.
PLTON achieves state-of-the-art quantitative results on high-resolution datasets, as measured by FID and LPIPS metrics. |
The reliance on CLIP input size limits the resolution of processed clothing images, potentially leading to information loss.
Future work could explore alternative solutions to address the resolution limitation and further enhance the preservation of fine details. |
virtual try-on, diffusion models, deep learning, computer vision, fashion |
2401.11115
Report |
MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation |
Nhat M. Hoang, Kehong Gong, Chuan Guo, Michael Bi Mi |
Controllable generation of 3D human motions becomes an important topic as the
world embraces digital transformation. Existing works, though making promising
progress with the advent of diffusion models, heavily rely on meticulously
captured and annotated (e.g., text) high-quality motion corpus, a
resource-intensive endeavor in the real world. This motivates our proposed
MotionMix, a simple yet effective weakly-supervised diffusion model that
leverages both noisy and unannotated motion sequences. Specifically, we
separate the denoising objectives of a diffusion model into two stages:
obtaining conditional rough motion approximations in the initial $T-T^*$ steps
by learning the noisy annotated motions, followed by the unconditional
refinement of these preliminary motions during the last $T^*$ steps using
unannotated motions. Notably, though learning from two sources of imperfect
data, our model does not compromise motion generation quality compared to fully
supervised approaches that access gold data. Extensive experiments on several
benchmarks demonstrate that our MotionMix, as a versatile framework,
consistently achieves state-of-the-art performances on text-to-motion,
action-to-motion, and music-to-dance tasks. Project page:
https://nhathoang2002.github.io/MotionMix-page/ |
This paper presents MotionMix, a weakly-supervised diffusion model for controllable 3D human motion generation that leverages both noisy annotated and clean unannotated motion sequences. |
Current diffusion models for motion generation rely on high-quality annotated motion data, which is expensive and time-consuming to obtain. MotionMix addresses this by effectively utilizing more accessible noisy and unannotated data. |
MotionMix employs a two-stage denoising process. It first generates rough motion approximations guided by conditions using noisy data, then refines them using clean unannotated data in a later stage. |
MotionMix achieves state-of-the-art performance on text-to-motion, action-to-motion, and music-to-dance tasks despite being trained on weakly-supervised data.
The method demonstrates robustness to different noisy data ratios and noise injection levels.
Experiments show MotionMix can even surpass the performance of fully supervised models trained on perfect data. |
Performance on smaller datasets might be slightly worse than fully supervised methods.
The denoising pivot, while robust within a range, requires careful tuning for optimal results. |
motion generation, diffusion models, weakly-supervised learning, 3d human motion, data efficiency |
2401.11078
Report |
UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures |
Mingyuan Zhou, Rakib Hyder, Ziwei Xuan, Guojun Qi |
Recent advances in 3D avatar generation have gained significant attentions.
These breakthroughs aim to produce more realistic animatable avatars, narrowing
the gap between virtual and real-world experiences. Most of existing works
employ Score Distillation Sampling (SDS) loss, combined with a differentiable
renderer and text condition, to guide a diffusion model in generating 3D
avatars. However, SDS often generates oversmoothed results with few facial
details, thereby lacking the diversity compared with ancestral sampling. On the
other hand, other works generate 3D avatar from a single image, where the
challenges of unwanted lighting effects, perspective views, and inferior image
quality make them difficult to reliably reconstruct the 3D face meshes with the
aligned complete textures. In this paper, we propose a novel 3D avatar
generation approach termed UltrAvatar with enhanced fidelity of geometry, and
superior quality of physically based rendering (PBR) textures without unwanted
lighting. To this end, the proposed approach presents a diffuse color
extraction model and an authenticity guided texture diffusion model. The former
removes the unwanted lighting effects to reveal true diffuse colors so that the
generated avatars can be rendered under various lighting conditions. The latter
follows two gradient-based guidances for generating PBR textures to render
diverse face-identity features and details better aligning with 3D mesh
geometry. We demonstrate the effectiveness and robustness of the proposed
method, outperforming the state-of-the-art methods by a large margin in the
experiments. |
Presents UltrAvatar, a novel 3D avatar generation approach that enhances fidelity of geometry and quality of physically based rendering (PBR) textures without unwanted lighting. |
Addresses limitations of existing methods that struggle with unwanted lighting effects, perspective views, and inferior image quality in single-image 3D avatar generation. |
Introduces a diffuse color extraction (DCE) model to remove lighting effects and an authenticity guided texture diffusion model (AGT-DM) to generate high-quality, aligned PBR textures. |
UltrAvatar generates high-quality, diverse 3D avatars with true colors and sharp details.
Outperforms state-of-the-art methods in text-to-avatar and image-to-avatar generation based on FID, KID, and CLIP Score metrics.
Demonstrates superior performance in qualitative evaluation using GPT-4V for photo-realism, artifact minimization, and text-prompt alignment. |
Relies on accurate face parsing for optimal DCE model performance.
Limited control over specific facial features during generation. |
3d avatar generation, diffuse color extraction, texture diffusion model, photometric guidance, edge guidance |
2401.11067
Report |
Make-A-Shape: a Ten-Million-scale 3D Shape Model |
Ka-Hei Hui, Aditya Sanghi, Arianna Rampini, Kamal Rahimi Malekshan, Zhengzhe Liu, Hooman Shayani, Chi-Wing Fu |
Significant progress has been made in training large generative models for
natural language and images. Yet, the advancement of 3D generative models is
hindered by their substantial resource demands for training, along with
inefficient, non-compact, and less expressive representations. This paper
introduces Make-A-Shape, a new 3D generative model designed for efficient
training on a vast scale, capable of utilizing 10 millions publicly-available
shapes. Technical-wise, we first innovate a wavelet-tree representation to
compactly encode shapes by formulating the subband coefficient filtering scheme
to efficiently exploit coefficient relations. We then make the representation
generatable by a diffusion model by devising the subband coefficients packing
scheme to layout the representation in a low-resolution grid. Further, we
derive the subband adaptive training strategy to train our model to effectively
learn to generate coarse and detail wavelet coefficients. Last, we extend our
framework to be controlled by additional input conditions to enable it to
generate shapes from assorted modalities, e.g., single/multi-view images, point
clouds, and low-resolution voxels. In our extensive set of experiments, we
demonstrate various applications, such as unconditional generation, shape
completion, and conditional generation on a wide range of modalities. Our
approach not only surpasses the state of the art in delivering high-quality
results but also efficiently generates shapes within a few seconds, often
achieving this in just 2 seconds for most conditions. |
This paper introduces
ickname, a novel 3D generative model trained on a massive dataset of over 10 million publicly available 3D shapes.
ickname can generate high-quality 3D shapes in just 2 seconds. |
Existing 3D generative models lag behind their 2D counterparts due to high resource demands, inefficient representations, and limitations in capturing shape complexity.
ickname addresses these challenges, enabling efficient large-scale training and high-quality 3D shape generation. |
The paper introduces: (i) a compact and expressive wavelet-tree representation for 3D shapes, (ii) a subband coefficient packing scheme for making the representation compatible with diffusion models, and (iii) a subband adaptive training strategy for effectively learning both coarse and detail wavelet coefficients. |
ickname consistently outperforms existing state-of-the-art methods in image-to-3D generation tasks, demonstrating superior quality in terms of both global structure and local details.
ickname exhibits robustness to the sparsity of input point clouds, generating high-quality shapes even with limited point information.
The proposed wavelet-tree representation and adaptive training strategy are crucial for achieving high-quality generation, surpassing baselines that rely on only coarse shape information or simple loss functions. |
The model currently lacks a mechanism to ensure balanced representation across different object categories, leading to potential biases in the generated shapes.
While the model excels in generating geometry, incorporating texture generation without relying on computationally expensive optimizations remains an open challenge. |
3d generative model, diffusion model, wavelet representation, large-scale 3d shape generation, conditional 3d shape generation |
2401.10891
Report |
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data |
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao |
This work presents Depth Anything, a highly practical solution for robust
monocular depth estimation. Without pursuing novel technical modules, we aim to
build a simple yet powerful foundation model dealing with any images under any
circumstances. To this end, we scale up the dataset by designing a data engine
to collect and automatically annotate large-scale unlabeled data (~62M), which
significantly enlarges the data coverage and thus is able to reduce the
generalization error. We investigate two simple yet effective strategies that
make data scaling-up promising. First, a more challenging optimization target
is created by leveraging data augmentation tools. It compels the model to
actively seek extra visual knowledge and acquire robust representations.
Second, an auxiliary supervision is developed to enforce the model to inherit
rich semantic priors from pre-trained encoders. We evaluate its zero-shot
capabilities extensively, including six public datasets and randomly captured
photos. It demonstrates impressive generalization ability. Further, through
fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs
are set. Our better depth model also results in a better depth-conditioned
ControlNet. Our models are released at
https://github.com/LiheYoung/Depth-Anything. |
This paper introduces Depth Anything, a highly practical model for robust monocular depth estimation that leverages the power of large-scale unlabeled data. |
A foundation model for depth estimation is crucial for various applications like robotics, autonomous driving, and VR, but is currently underexplored due to the difficulty in obtaining large-scale depth datasets. |
The authors design a data engine to collect and automatically annotate 62M unlabeled images using a pre-trained depth estimation model. They enhance training by challenging the student model with strongly perturbed unlabeled images and by incorporating semantic priors from a frozen DINOv2 encoder. |
Depth Anything exhibits superior zero-shot depth estimation capability compared to MiDaS v3.1 across six diverse datasets.
When fine-tuned with metric depth information, it significantly outperforms previous state-of-the-art methods on NYUv2 and KITTI.
The pre-trained encoder demonstrates strong performance in semantic segmentation tasks, highlighting its potential as a multi-task encoder. |
The current model size is limited to ViT-Large and could benefit from further scaling up to ViT-Giant.
Training resolution of 512x512 might be insufficient for real-world applications, and increasing it to 700+ or 1000+ could be beneficial. |
monocular depth estimation, foundation model, self-supervised learning, semantic segmentation, zero-shot learning |
2401.10889
Report |
Synthesizing Moving People with 3D Control |
Boyi Li, Jathushan Rajasegaran, Yossi Gandelsman, Alexei A. Efros, Jitendra Malik |
In this paper, we present a diffusion model-based framework for animating
people from a single image for a given target 3D motion sequence. Our approach
has two core components: a) learning priors about invisible parts of the human
body and clothing, and b) rendering novel body poses with proper clothing and
texture. For the first part, we learn an in-filling diffusion model to
hallucinate unseen parts of a person given a single image. We train this model
on texture map space, which makes it more sample-efficient since it is
invariant to pose and viewpoint. Second, we develop a diffusion-based rendering
pipeline, which is controlled by 3D human poses. This produces realistic
renderings of novel poses of the person, including clothing, hair, and
plausible in-filling of unseen regions. This disentangled approach allows our
method to generate a sequence of images that are faithful to the target motion
in the 3D pose and, to the input image in terms of visual similarity. In
addition to that, the 3D control allows various synthetic camera trajectories
to render a person. Our experiments show that our method is resilient in
generating prolonged motions and varied challenging and complex poses compared
to prior methods. Please check our website for more details:
https://boyiliee.github.io/3DHM.github.io/. |
This paper proposes 3DHM, a two-stage diffusion model-based framework for animating a person from a single image to imitate a target 3D motion sequence. |
The task of animating a person from a single image to imitate another's actions is challenging and requires a deep understanding of human pose, appearance, and clothing. |
3DHM uses a two-stage approach: 1) A diffusion model learns to in-fill unseen regions of a partial texture map extracted from the input image. 2) A second diffusion model renders realistic images from intermediate renderings generated using the complete texture map and 3D poses extracted from the target motion sequence. |
3DHM outperforms baselines in terms of frame-wise and video-level generation quality metrics (PSNR, SSIM, FID, LPIPS, L1, FID-VID, FVD).
3DHM demonstrates high pose accuracy, preserving the target motion faithfully.
3DHM generalizes well to unseen human images and motions from various sources, including 3D human videos, YouTube videos, and text input. |
The model currently generates frames independently, potentially leading to temporal inconsistencies.
Training on larger and more diverse datasets could further enhance the model's ability to reconstruct detailed textures. |
human animation, diffusion models, texture inpainting, 3d human pose, motion imitation |
2401.10831
Report |
Understanding Video Transformers via Universal Concept Discovery |
Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos G. Derpanis, Pavel Tokmakov |
This paper studies the problem of concept-based interpretability of
transformer representations for videos. Concretely, we seek to explain the
decision-making process of video transformers based on high-level,
spatiotemporal concepts that are automatically discovered. Prior research on
concept-based interpretability has concentrated solely on image-level tasks.
Comparatively, video models deal with the added temporal dimension, increasing
complexity and posing challenges in identifying dynamic concepts over time. In
this work, we systematically address these challenges by introducing the first
Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose
an efficient approach for unsupervised identification of units of video
transformer representations - concepts, and ranking their importance to the
output of a model. The resulting concepts are highly interpretable, revealing
spatio-temporal reasoning mechanisms and object-centric representations in
unstructured video models. Performing this analysis jointly over a diverse set
of supervised and self-supervised representations, we discover that some of
these mechanism are universal in video transformers. Finally, we show that VTCD
can be used for fine-grained action recognition and video object segmentation. |
This paper introduces VTCD, the first concept discovery algorithm specifically designed for interpreting video transformer representations. VTCD identifies high-level, spatiotemporal concepts learned by video transformers and quantifies their importance for model predictions. |
Understanding how video transformers process information is crucial for addressing concerns about transparency, fairness, and potential biases in AI systems, particularly as these models are increasingly deployed in real-world applications. |
VTCD employs SLIC clustering in the feature space to efficiently generate spatiotemporal tubelet proposals. These tubelets are then clustered using Convex Non-negative Matrix Factorization (CNMF) to identify concepts. To assess concept importance, the authors introduce CRIS, a robust method that masks concepts and measures the impact on model performance. |
VTCD successfully discovers human-interpretable spatiotemporal concepts, including object tracking, event detection, and positional cues.
The authors discover universal 'Rosetta concepts' shared across diverse video transformer models, revealing common mechanisms such as early-layer spatiotemporal basis representations and late-layer object-centric representations.
VTCD enables applications like model pruning for improved efficiency and zero-shot video object segmentation by leveraging the discovered concepts. |
The SLIC compactness hyperparameter in VTCD requires manual tuning for different models.
Calculating the Rosetta score becomes computationally demanding as the number of models analyzed increases. |
concept-based interpretability, video transformers, concept discovery, spatiotemporal reasoning, rosetta concepts |
2401.10822
Report |
ActAnywhere: Subject-Aware Video Background Generation |
Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang |
Generating video background that tailors to foreground subject motion is an
important problem for the movie industry and visual effects community. This
task involves synthesizing background that aligns with the motion and
appearance of the foreground subject, while also complies with the artist's
creative intention. We introduce ActAnywhere, a generative model that automates
this process which traditionally requires tedious manual efforts. Our model
leverages the power of large-scale video diffusion models, and is specifically
tailored for this task. ActAnywhere takes a sequence of foreground subject
segmentation as input and an image that describes the desired scene as
condition, to produce a coherent video with realistic foreground-background
interactions while adhering to the condition frame. We train our model on a
large-scale dataset of human-scene interaction videos. Extensive evaluations
demonstrate the superior performance of our model, significantly outperforming
baselines. Moreover, we show that ActAnywhere generalizes to diverse
out-of-distribution samples, including non-human subjects. Please visit our
project webpage at https://actanywhere.github.io. |
This paper introduces a novel task of automated subject-aware video background generation and proposes a diffusion-based model called ActAnywhere to address it. ActAnywhere generates coherent video backgrounds that adapt to the motion of a foreground subject, guided by a single condition frame depicting the desired background. |
This work offers a valuable tool for the film and VFX industry, enabling faster iteration of ideas and creative storytelling by automatically synthesizing realistic background interactions for acting subjects in diverse scenes, which was previously a tedious and expensive manual process. |
The model leverages a latent video diffusion model with cross-frame attention for temporal reasoning. It takes as input a foreground subject segmentation sequence, masks, and a single condition frame to generate a composite video with a hallucinated background. |
ActAnywhere generates high-quality videos with realistic subject-background interactions, camera motions, lighting, and shadows.
The model demonstrates strong generalization capability, extending to out-of-distribution data including non-human subjects.
ActAnywhere exhibits emergent capabilities for general video inpainting and robustness to inaccurate foreground segmentation masks. |
The model might fail to correct inaccurate details present in the provided condition frame.
Further exploration is needed to address potential biases present in the training data and prevent malicious use. |
video generation, diffusion models, video editing, subject-aware synthesis, foreground-background interaction |
2401.10404
Report |
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution |
Xin Yuan, Jinoo Baek, Keyang Xu, Omer Tov, Hongliang Fei |
We propose an efficient diffusion-based text-to-video super-resolution (SR)
tuning approach that leverages the readily learned capacity of pixel level
image diffusion model to capture spatial information for video generation. To
accomplish this goal, we design an efficient architecture by inflating the
weightings of the text-to-image SR model into our video generation framework.
Additionally, we incorporate a temporal adapter to ensure temporal coherence
across video frames. We investigate different tuning approaches based on our
inflated architecture and report trade-offs between computational costs and
super-resolution quality. Empirical evaluation, both quantitative and
qualitative, on the Shutterstock video dataset, demonstrates that our approach
is able to perform text-to-video SR generation with good visual quality and
temporal consistency. To evaluate temporal coherence, we also present
visualizations in video format in
https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing . |
Proposed "Inflation with Diffusion", a new method for text-to-video super-resolution that efficiently adapts text prompts to enhance temporal consistency. |
Addresses the limitations of existing text-to-video super-resolution methods that struggle with temporal consistency due to inefficient text prompt adaptation. |
Leverages a pretrained diffusion model by adding small, learnable vectors to intermediate features, enabling effective text guidance with minimal training data and computational cost. |
Achieves state-of-the-art results on text-to-video super-resolution benchmarks.
Demonstrates superior temporal consistency compared to previous methods.
Offers a computationally efficient approach for text-guided video generation. |
Limited diversity in generated video content due to reliance on pretrained models.
Further exploration needed for handling more complex text prompts and video content. |
text-to-video generation, super-resolution, diffusion models, temporal consistency, video generation |
2401.10229
Report |
OMG-Seg: Is One Model Good Enough For All Segmentation? |
Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy |
In this work, we address various segmentation tasks, each traditionally
tackled by distinct or partially unified models. We propose OMG-Seg, One Model
that is Good enough to efficiently and effectively handle all the segmentation
tasks, including image semantic, instance, and panoptic segmentation, as well
as their video counterparts, open vocabulary settings, prompt-driven,
interactive segmentation like SAM, and video object segmentation. To our
knowledge, this is the first model to handle all these tasks in one model and
achieve satisfactory performance. We show that OMG-Seg, a transformer-based
encoder-decoder architecture with task-specific queries and outputs, can
support over ten distinct segmentation tasks and yet significantly reduce
computational and parameter overhead across various tasks and datasets. We
rigorously evaluate the inter-task influences and correlations during
co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg. |
OMG-Seg is proposed as a unified segmentation model capable of handling various tasks, including image and video segmentation, open-vocabulary settings, and interactive segmentation, all within a single framework, significantly reducing computational and parameter overhead. |
A unified model eliminates task-specific design constraints and allows for knowledge sharing across different segmentation tasks, offering a more versatile and efficient approach. |
OMG-Seg utilizes a frozen CLIP visual encoder as the backbone and a shared encoder-decoder transformer architecture with task-specific queries. It employs unified query representation for image/tube masks, labels, IDs, and visual prompts, enabling diverse segmentation tasks within one model. |
OMG-Seg achieves competitive performance on image, video, open-vocabulary, and interactive segmentation settings across eight diverse datasets.
Joint co-training on multiple datasets leads to improved performance, particularly in video segmentation tasks, and significantly reduces model parameters.
The shared decoder design in OMG-Seg proves to be efficient as it aligns optimization objectives, benefiting video datasets with short clips. |
The frozen architecture, while enabling open-vocabulary capabilities, may limit performance on specific tasks.
Future work involves scaling up the model, incorporating more datasets, and potentially adding a text path for language-driven segmentation tasks. |
segmentation, unified model, open vocabulary, interactive segmentation, video segmentation |
2401.10228
Report |
RAP-SAM: Towards Real-Time All-Purpose Segment Anything |
Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang |
Advanced by transformer architecture, vision foundation models (VFMs) achieve
remarkable progress in performance and generalization ability. Segment Anything
Model (SAM) is one remarkable model that can achieve generalized segmentation.
However, most VFMs cannot run in realtime, which makes it difficult to transfer
them into several products. On the other hand, current real-time segmentation
mainly has one purpose, such as semantic segmentation on the driving scene. We
argue that diverse outputs are needed for real applications. Thus, this work
explores a new real-time segmentation setting, named all-purpose segmentation
in real-time, to transfer VFMs in real-time deployment. It contains three
different tasks, including interactive segmentation, panoptic segmentation, and
video segmentation. We aim to use one model to achieve the above tasks in
real-time. We first benchmark several strong baselines. Then, we present
Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an
efficient decoupled decoder to perform prompt-driven decoding. Moreover, we
further explore different training strategies and tuning methods to boost
co-training performance further. Our code and model are available at
https://github.com/xushilin1/RAP-SAM/. |
This paper introduces 'all-purpose segmentation', a new real-time segmentation setting encompassing interactive, panoptic, and video segmentation within a single model. |
Current vision foundation models often lack real-time capability, and existing real-time segmentation methods focus on single applications, limiting their practicality. All-purpose real-time segmentation addresses these limitations, enabling diverse applications like real-time editing, tracking, and segmentation. |
The paper proposes 'RAP-SAM' (Real-Time All-Purpose SAM), featuring an efficient encoder, a unified decoder with pooling-based dynamic convolution, and lightweight decoupled adapters to balance performance across tasks. It leverages joint co-training on COCO and YouTube-VIS datasets. |
RAP-SAM achieves the best speed and accuracy trade-off among benchmarked real-time methods on all three segmentation tasks.
Joint co-training with image and video data improves video instance segmentation performance.
The proposed asymmetric adapter design effectively balances performance for object queries and prompt queries. |
Performance balance across image, video, and interactive segmentation requires further improvement.
Future work includes model acceleration for edge deployment, exploring diverse knowledge distillation techniques, and incorporating various visual prompts like mask prompts. |
real-time segmentation, all-purpose segmentation, interactive segmentation, panoptic segmentation, video instance segmentation |
2401.10227
Report |
A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting |
Wouter Van Gansbeke, Bert De Brabandere |
Panoptic and instance segmentation networks are often trained with
specialized object detection modules, complex loss functions, and ad-hoc
post-processing steps to handle the permutation-invariance of the instance
masks. This work builds upon Stable Diffusion and proposes a latent diffusion
approach for panoptic segmentation, resulting in a simple architecture which
omits these complexities. Our training process consists of two steps: (1)
training a shallow autoencoder to project the segmentation masks to latent
space; (2) training a diffusion model to allow image-conditioned sampling in
latent space. The use of a generative model unlocks the exploration of mask
completion or inpainting, which has applications in interactive segmentation.
The experimental validation yields promising results for both panoptic
segmentation and mask inpainting. While not setting a new state-of-the-art, our
model's simplicity, generality, and mask completion capability are desirable
properties. |
This paper presents LDMSeg, a novel approach for panoptic segmentation and mask inpainting using latent diffusion models, building upon Stable Diffusion. |
The proposed method simplifies panoptic segmentation by avoiding specialized object detection modules, complex loss functions, and ad-hoc post-processing. |
LDMSeg employs a two-stage process: (1) training a shallow autoencoder to project segmentation masks to a latent space and (2) training a diffusion model conditioned on image latents for image-guided mask generation. |
LDMSeg effectively generates non-overlapping instance masks, achieving promising panoptic segmentation results.
The model demonstrates inherent mask inpainting capabilities, successfully completing sparse segmentation masks.
LDMSeg outperforms some general-purpose frameworks while being simpler and more computationally efficient. |
The model may miss small objects due to the latent space projection.
Inference is slower compared to specialized segmentation models due to the diffusion process.
Future work includes exploring higher resolution latents and open-vocabulary detection. |
panoptic segmentation, mask inpainting, latent diffusion models, generative models, stable diffusion |
2401.10226
Report |
Towards Language-Driven Video Inpainting via Multimodal Large Language Models |
Jianzong Wu, Xiangtai Li, Chenyang Si, Shangchen Zhou, Jingkang Yang, Jiangning Zhang, Yining Li, Kai Chen, Yunhai Tong, Ziwei Liu, Chen Change Loy |
We introduce a new task -- language-driven video inpainting, which uses
natural language instructions to guide the inpainting process. This approach
overcomes the limitations of traditional video inpainting methods that depend
on manually labeled binary masks, a process often tedious and labor-intensive.
We present the Remove Objects from Videos by Instructions (ROVI) dataset,
containing 5,650 videos and 9,091 inpainting results, to support training and
evaluation for this task. We also propose a novel diffusion-based
language-driven video inpainting framework, the first end-to-end baseline for
this task, integrating Multimodal Large Language Models to understand and
execute complex language-based inpainting requests effectively. Our
comprehensive results showcase the dataset's versatility and the model's
effectiveness in various language-instructed inpainting scenarios. We will make
datasets, code, and models publicly available. |
This paper introduces a novel task: language-driven video inpainting, aiming to replace manual mask annotations with natural language instructions. |
Current video inpainting methods heavily rely on tedious and time-consuming manual mask annotations, which limits their applicability. This new task leverages the flexibility and richness of natural language for more effective video inpainting. |
A new dataset, ROVI, is created containing video, removal expression, and inpainted video triplets. A diffusion-based model (LGVI) is proposed, incorporating temporal attention and a mask decoder. An MLLM-enhanced version, LGVI-I, handles interactive inpainting requests. |
LGVI outperforms existing language-driven image editing methods and achieves comparable results to multi-stage video inpainting methods on the referring video inpainting task.
LGVI-I, enhanced with an MLLM, shows superior performance on the interactive video inpainting task, effectively handling complex chat-style user requests.
The proposed method demonstrates robustness in handling challenging scenarios, such as inpainting multiple or non-existent objects. |
The model may struggle with ambiguous language descriptions or complex scenes where precise object identification is difficult.
Real-time processing and model scalability for diverse video types and languages are areas for future improvement. |
video inpainting, language-driven editing, multimodal learning, diffusion models, large language models |
2401.10222
Report |
Supervised Fine-tuning in turn Improves Visual Foundation Models |
Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan |
Image-text training like CLIP has dominated the pretraining of vision
foundation models in recent years. Subsequent efforts have been made to
introduce region-level visual learning into CLIP's pretraining but face
scalability challenges due to the lack of large-scale region-level datasets.
Drawing inspiration from supervised fine-tuning (SFT) in natural language
processing such as instruction tuning, we explore the potential of fine-grained
SFT in enhancing the generation of vision foundation models after their
pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash
the fine-grained knowledge of vision foundation models. In ViSFT, the vision
foundation model is enhanced by performing visual joint learning on some
in-domain tasks and then tested on out-of-domain benchmarks. With updating
using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over
4.4B parameters shows improvements across various out-of-domain benchmarks
including vision and vision-linguistic scenarios. |
This paper proposes ViSFT (Vision Supervised Fine-Tuning), a two-stage method to enhance the representation and generalization of vision foundation models, drawing inspiration from SFT in NLP (e.g., instruction tuning). |
Existing methods like RegionCLIP face scalability issues due to lack of large-scale region-level datasets. ViSFT addresses this by leveraging fine-grained SFT to improve vision models after pretraining. |
ViSFT uses a two-stage process: 1) Independently train in-domain task heads (detection, segmentation, captioning) on COCO with frozen backbone. 2) Introduce LoRA to the backbone, freeze task heads, and jointly train on all tasks, transferring knowledge to LoRA. |
ViSFT improves optical character recognition accuracy by at least 2.5 points.
Grounded object identification exhibits an enhancement ranging from 0.3 to 0.6 points, especially for smaller models.
ViSFT enhances zero-shot image classification, few-shot learning, image-text retrieval, and visual question answering. |
The impact of incorporating more diverse datasets with fine-grained annotations, beyond COCO, remains unexplored.
The study primarily focuses on the vision transformer within the CLIP model, and the impact on the text encoder is left for future research. |
vision foundation models, supervised fine-tuning, image-text representation learning, multi-task training, lora |
2401.10208
Report |
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer |
Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, Jifeng Dai |
Developing generative models for interleaved image-text data has both
research and practical value. It requires models to understand the interleaved
sequences and subsequently generate images and text. However, existing attempts
are limited by the issue that the fixed number of visual tokens cannot
efficiently capture image details, which is particularly problematic in the
multi-image scenarios. To address this, this paper presents MM-Interleaved, an
end-to-end generative model for interleaved image-text data. It introduces a
multi-scale and multi-image feature synchronizer module, allowing direct access
to fine-grained image features in the previous context during the generation
process. MM-Interleaved is end-to-end pre-trained on both paired and
interleaved image-text corpora. It is further enhanced through a supervised
fine-tuning phase, wherein the model improves its ability to follow complex
multi-modal instructions. Experiments demonstrate the versatility of
MM-Interleaved in recognizing visual details following multi-modal instructions
and generating consistent images following both textual and visual conditions.
Code and models are available at
\url{https://github.com/OpenGVLab/MM-Interleaved}. |
This paper proposes MM-Interleaved, an end-to-end generative model for interleaved image-text data that addresses the limitation of fixed visual tokens by using a multi-scale and multi-image feature synchronizer module (MMFS). |
Developing generative models for interleaved image-text data (e.g. news, blogs) is important because this format is ubiquitous online and necessitates models to comprehend interleaved sequences to generate images and text. |
MM-Interleaved leverages a Visual Foundation Model (VFM) for image tokenization, a Large Language Model (LLM) for multi-modal context feature extraction enhanced by MMFS, and a Diffusion Model (DM) for image generation conditioned on LLM outputs and fine-grained features from MMFS. |
MM-Interleaved achieves state-of-the-art results on various multi-modal comprehension benchmarks including image captioning, visual question answering, and visual dialogue.
The model demonstrates competitive zero-shot text-to-image generation capabilities compared to existing methods.
MM-Interleaved effectively handles segmentation-to-image translation and visual storytelling, showcasing its ability to generate realistic images with precise alignment and maintain semantic consistency in generated image sequences. |
The quality and quantity of publicly available interleaved image-text data are currently limited.
The model may encounter challenges related to hallucination and potential bias in generated content due to noise in the training data. |
interleaved image-text generation, multi-modal feature synchronizer, large language models, diffusion models, visual storytelling |
2401.10171
Report |
SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild |
Andreas Engelhardt, Amit Raj, Mark Boss, Yunzhi Zhang, Abhishek Kar, Yuanzhen Li, Deqing Sun, Ricardo Martin Brualla, Jonathan T. Barron, Hendrik P. A. Lensch, Varun Jampani |
We present SHINOBI, an end-to-end framework for the reconstruction of shape,
material, and illumination from object images captured with varying lighting,
pose, and background. Inverse rendering of an object based on unconstrained
image collections is a long-standing challenge in computer vision and graphics
and requires a joint optimization over shape, radiance, and pose. We show that
an implicit shape representation based on a multi-resolution hash encoding
enables faster and robust shape reconstruction with joint camera alignment
optimization that outperforms prior work. Further, to enable the editing of
illumination and object reflectance (i.e. material) we jointly optimize BRDF
and illumination together with the object's shape. Our method is class-agnostic
and works on in-the-wild image collections of objects to produce relightable 3D
assets for several use cases such as AR/VR, movies, games, etc. Project page:
https://shinobi.aengelhardt.com Video:
https://www.youtube.com/watch?v=iFENQ6AcYd8&feature=youtu.be |
SHINOBI reconstructs shape, material, and illumination from in-the-wild object images with varying lighting, pose, and background. |
It enables the creation of relightable 3D assets from casually captured images for applications in AR/VR, movies, and games. |
Uses a multi-resolution hash encoding for shape representation, jointly optimizes camera parameters, and incorporates BRDF optimization with a per-view importance weighting scheme. |
Outperforms prior work in view synthesis and relighting quality on the NAVI in-the-wild dataset.
Achieves faster optimization, reducing runtime by 3 times compared to SAMURAI.
Enables high-frequency detail reconstruction in shape, material, and illumination. |
May struggle with highly symmetric objects and extremely specular materials.
High-frequency detail reconstruction can be limited in some regions due to misaligned views and illumination representation. |
3d reconstruction, neural rendering, inverse rendering, camera pose estimation, brdf estimation |
2401.10166
Report |
VMamba: Visual State Space Model |
Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Yunfan Liu |
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have long
been the predominant backbone networks for visual representation learning.
While ViTs have recently gained prominence over CNNs due to their superior
fitting capabilities, their scalability is largely constrained by the quadratic
complexity of attention computation. Inspired by the capability of Mamba in
efficiently modeling long sequences, we propose VMamba, a generic vision
backbone model aiming to reduce the computational complexity to linear while
retaining ViTs' advantageous features. To enhance VMamba's adaptability in
processing vision data, we introduce the Cross-Scan Module (CSM) to enable 1D
selective scanning in 2D image space with global receptive fields.
Additionally, we make further improvements in implementation details and
architectural designs to enhance VMamba's performance and boost its inference
speed. Extensive experimental results demonstrate VMamba's promising
performance across various visual perception tasks, highlighting its pronounced
advantages in input scaling efficiency compared to existing benchmark models.
Source code is available at https://github.com/MzeroMiko/VMamba. |
Proposes VMamba, a novel vision backbone network based on State Space Models (SSMs) for efficient visual representation learning, aiming to achieve linear computational complexity while retaining the global receptive fields and dynamic weights of Vision Transformers (ViTs). |
Addresses the limitations of ViTs, whose quadratic complexity hinders scalability, and CNNs, which lack global receptive fields and dynamic weights, by introducing an alternative foundation model for efficient visual representation learning. |
Introduces the Cross-Scan Module (CSM) to adapt the 1D selective scanning of S6 models to 2D vision data, enabling global receptive fields without increasing complexity. Improves VMamba's efficiency through optimized implementation details and architectural design. |
Achieves superior or competitive performance on ImageNet-1K classification compared to benchmark models like ResNet, ViT, and Swin, while maintaining linear computational complexity.
Demonstrates strong performance in downstream tasks, achieving competitive results on COCO object detection and ADE20K semantic segmentation.
Exhibits remarkable input scaling efficiency, showing linear growth in FLOPs with increasing input size, unlike ViT-based models that show quadratic growth. |
The bidirectional scanning pattern exhibits instability during training, requiring further investigation and potential solutions.
Future work includes exploring larger-scale VMamba models and extending its application to more diverse vision tasks. |
vision backbone, state space models, linear complexity, global receptive field, cross-scan module |
2401.10150
Report |
Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation |
Changgu Chen, Junwei Shu, Lianggangxu Chen, Gaoqi He, Changbo Wang, Yang Li |
Recent large-scale pre-trained diffusion models have demonstrated a powerful
generative ability to produce high-quality videos from detailed text
descriptions. However, exerting control over the motion of objects in videos
generated by any video diffusion model is a challenging problem. In this paper,
we propose a novel zero-shot moving object trajectory control framework,
Motion-Zero, to enable a bounding-box-trajectories-controlled text-to-video
diffusion model. To this end, an initial noise prior module is designed to
provide a position-based prior to improve the stability of the appearance of
the moving object and the accuracy of position. In addition, based on the
attention map of the U-net, spatial constraints are directly applied to the
denoising process of diffusion models, which further ensures the positional and
spatial consistency of moving objects during the inference. Furthermore,
temporal consistency is guaranteed with a proposed shift temporal attention
mechanism. Our method can be flexibly applied to various state-of-the-art video
diffusion models without any training process. Extensive experiments
demonstrate our proposed method can control the motion trajectories of objects
and generate high-quality videos. |
This paper introduces Motion-Zero, a zero-shot framework that allows for bounding-box-trajectory control of object motion within pre-trained video diffusion models. |
This addresses the challenge of precisely manipulating object trajectories in generated videos, enabling more control over video generation without extensive training or specialized datasets. |
The framework uses an Initial Noise Prior Module (INPM) for position-based prior, applies Spatial Constraints (SC) through attention maps for position accuracy, and employs a Shift Temporal Attention Mechanism (STAM) to maintain motion continuity. |
Motion-Zero allows for precise control over object trajectories using user-defined bounding boxes.
The framework maintains the generative quality of the underlying pre-trained video diffusion models.
Quantitative and qualitative results, including user studies, demonstrate that Motion-Zero outperforms baseline methods and rivals pre-trained models like MotionCtrl in control capabilities. |
The generative performance of Motion-Zero is limited by the capabilities of the underlying pre-trained video diffusion model.
Currently, the trajectory control lacks semantic interaction with the generated video scene, requiring user-defined paths instead of automated navigation based on scene understanding and prompts. |
video diffusion models, motion trajectory control, zero-shot learning, text-to-video generation, controllable video synthesis |
2401.10061
Report |
DiffusionGPT: LLM-Driven Text-to-Image Generation System |
Jie Qin, Jie Wu, Weifeng Chen, Yuxi Ren, Huixia Li, Hefeng Wu, Xuefeng Xiao, Rui Wang, Shilei Wen |
Diffusion models have opened up new avenues for the field of image
generation, resulting in the proliferation of high-quality models shared on
open-source platforms. However, a major challenge persists in current
text-to-image systems are often unable to handle diverse inputs, or are limited
to single model results. Current unified attempts often fall into two
orthogonal aspects: i) parse Diverse Prompts in input stage; ii) activate
expert model to output. To combine the best of both worlds, we propose
DiffusionGPT, which leverages Large Language Models (LLM) to offer a unified
generation system capable of seamlessly accommodating various types of prompts
and integrating domain-expert models. DiffusionGPT constructs domain-specific
Trees for various generative models based on prior knowledge. When provided
with an input, the LLM parses the prompt and employs the Trees-of-Thought to
guide the selection of an appropriate model, thereby relaxing input constraints
and ensuring exceptional performance across diverse domains. Moreover, we
introduce Advantage Databases, where the Tree-of-Thought is enriched with human
feedback, aligning the model selection process with human preferences. Through
extensive experiments and comparisons, we demonstrate the effectiveness of
DiffusionGPT, showcasing its potential for pushing the boundaries of image
synthesis in diverse domains. |
DiffusionGPT, a novel unified image generation system that leverages Large Language Models (LLMs) to handle diverse prompts and integrate domain-expert models for superior image synthesis. |
Existing text-to-image systems struggle with diverse prompt types and often provide limited results due to reliance on single models with varying domain expertise. |
DiffusionGPT utilizes LLMs to parse prompts, build and search domain-specific model trees (Tree-of-Thought), select optimal models based on human feedback (Advantage Databases), and execute image generation with prompt extension. |
DiffusionGPT generates more realistic and semantically aligned images compared to baseline models like SD1.5 and SDXL.
Quantitative evaluation using image-reward and aesthetic scores demonstrate significant improvements over baseline models.
User studies confirm a strong preference for images generated by DiffusionGPT, highlighting its superior quality and alignment with user intent. |
Limited feedback incorporation in LLM optimization for prompt parsing and model selection.
Dependence on a finite set of model candidates, limiting the diversity and quality of potential outputs. |
image generation, diffusion models, large language models, prompt engineering, human feedback |
2401.10039
Report |
GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition |
Guangzhao Dai, Xiangbo Shu, Wenhao Wu, Rui Yan, Jiachao Zhang |
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have
shown impressive performance in various visual recognition tasks. This
advancement paves the way for notable performance in Zero-Shot Egocentric
Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global
video-text matching task, which often leads to suboptimal alignment of vision
and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs,
emphasizing fine-grained concept-description alignment that capitalizes on the
rich semantic and contextual details in egocentric videos. In this paper, we
introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for
ZS-EAR, designed to enhance the fine-grained alignment of concept and
description between vision and language. Extensive experiments demonstrate
GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric
video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%),
and CharadesEgo (31.5%, +2.6%). |
This paper introduces GPT4Ego, a novel Vision-Language Model (VLM) framework for Zero-Shot Egocentric Action Recognition (ZS-EAR) that prioritizes fine-grained alignment between vision and language. |
Existing VLM-based ZS-EAR approaches treat the task as a coarse-grained global video-text matching, leading to suboptimal alignment and limiting performance. |
GPT4Ego leverages two key components: 1) Ego-oriented Text Prompting (EgoTP) enhances text-contextual semantics by using ChatGPT to generate diverse textual descriptions from class names. 2) Ego-oriented Visual Parsing (EgoVP) utilizes SAM to parse refined visual concepts from video frames, enhancing vision-contextual semantics. |
GPT4Ego significantly outperforms state-of-the-art methods on EK100, EGTEA, and CharadesEgo benchmarks.
Both EgoTP and EgoVP individually contribute to performance gains, with their combination leading to the most significant improvement.
GPT4Ego effectively captures fine-grained semantic alignment between vision and language, as demonstrated by qualitative analysis. |
The current implementation relies on external models like ChatGPT and SAM, limiting its computational efficiency.
Future work could explore joint training of the VLM with the text generation and visual parsing modules for improved synergy. |
egocentric action recognition, zero-shot learning, vision-language learning, chatgpt, segment anything model (sam) |
2401.10005
Report |
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation |
Kohei Uehara, Nabarun Goswami, Hanqin Wang, Toshiaki Baba, Kohtaro Tanaka, Tomohiro Hashimoto, Kai Wang, Rei Ito, Takagi Naoya, Ryo Umagami, Yingyi Wen, Tanachai Anakewat, Tatsuya Harada |
The increasing demand for intelligent systems capable of interpreting and
reasoning about visual content requires the development of Large Multi-Modal
Models (LMMs) that are not only accurate but also have explicit reasoning
capabilities. This paper presents a novel approach to imbue an LMM with the
ability to conduct explicit reasoning based on visual content and textual
instructions. We introduce a system that can ask a question to acquire
necessary knowledge, thereby enhancing the robustness and explicability of the
reasoning process. Our method comprises the development of a novel dataset
generated by a Large Language Model (LLM), designed to promote chain-of-thought
reasoning combined with a question-asking mechanism. We designed an LMM, which
has high capabilities on region awareness to address the intricate requirements
of image-text alignment. The model undergoes a three-stage training phase,
starting with large-scale image-text alignment using a large-scale datasets,
followed by instruction tuning, and fine-tuning with a focus on
chain-of-thought reasoning. The results demonstrate a stride toward a more
robust, accurate, and interpretable LMM, capable of reasoning explicitly and
seeking information proactively when confronted with ambiguous visual input. |
This paper introduces a novel approach to enhance Large Multi-Modal Models (LMMs) by integrating an explicit Chain-of-Reasoning (CoR) process and the ability to generate clarifying questions during reasoning, aiming for more reliable and interpretable visual content interpretation. |
Current LMMs often suffer from hallucination, producing outputs not aligned with the input, and lack the ability to explain their reasoning. This work aims to address these limitations by incorporating explicit reasoning and question-asking capabilities. |
The authors create a new dataset containing reasoning steps with uncertainty scores, prompting LLM-generated questions when uncertainty is high. They then develop an LMM with improved region awareness, trained in three stages: image-text alignment, instruction tuning, and CoR fine-tuning. |
The model successfully generates explicit reasoning steps and asks relevant questions when encountering uncertainty.
Integrating question-asking significantly improves performance on knowledge-based VQA tasks like OK-VQA.
Current LMMs still struggle with generating perfectly consistent and coherent long reasoning steps, highlighting an area for future research. |
The model's performance on long reasoning steps needs further improvement to match the accuracy of direct answer generation.
Future work could explore alternative methods for acquiring external knowledge beyond relying solely on LLMs like GPT-4. |
large multi-modal models, chain-of-reasoning, question generation, visual reasoning, explainable ai |
2401.09985
Report |
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens |
Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, Jiwen Lu |
World models play a crucial role in understanding and predicting the dynamics
of the world, which is essential for video generation. However, existing world
models are confined to specific scenarios such as gaming or driving, limiting
their ability to capture the complexity of general world dynamic environments.
Therefore, we introduce WorldDreamer, a pioneering world model to foster a
comprehensive comprehension of general world physics and motions, which
significantly enhances the capabilities of video generation. Drawing
inspiration from the success of large language models, WorldDreamer frames
world modeling as an unsupervised visual sequence modeling challenge. This is
achieved by mapping visual inputs to discrete tokens and predicting the masked
ones. During this process, we incorporate multi-modal prompts to facilitate
interaction within the world model. Our experiments show that WorldDreamer
excels in generating videos across different scenarios, including natural
scenes and driving environments. WorldDreamer showcases versatility in
executing tasks such as text-to-video conversion, image-tovideo synthesis, and
video editing. These results underscore WorldDreamer's effectiveness in
capturing dynamic elements within diverse general world environments. |
Introducing *WorldDreamer*, the first general world model for video generation that effectively learns general world motion and physics from visual data. |
Existing world models are limited to specific scenarios, hindering their ability to capture the complexity of general world dynamics crucial for versatile video generation. |
*WorldDreamer* leverages VQGAN for visual tokenization and employs a novel Spatial Temporal Patchwise Transformer (STPT) to predict masked visual tokens. Multi-modal prompts, including text and action embeddings, guide the generation process. |
*WorldDreamer* excels in generating high-fidelity videos across diverse scenarios, including natural scenes and driving environments.
It exhibits versatility in various tasks such as text-to-video conversion, image-to-video synthesis, video editing, and action-to-video generation.
The model demonstrates significant speed advantages over diffusion-based methods, achieving video generation with considerably fewer iterations. |
The model currently operates at a resolution of 256x256, leaving room for improvement in generating higher-resolution videos.
Further exploration of more intricate masking strategies could potentially enhance the model's ability to capture and generate complex motions. |
video generation, world models, vision transformer, multi-modal learning, generative ai |
2401.09865
Report |
Improving fine-grained understanding in image-text pre-training |
Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, Jovana Mitrović |
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple
method for pretraining more fine-grained multimodal representations from
image-text pairs. Given that multiple image patches often correspond to single
words, we propose to learn a grouping of image patches for every token in the
caption. To achieve this, we use a sparse similarity metric between image
patches and language tokens and compute for each token a language-grouped
vision embedding as the weighted average of patches. The token and
language-grouped vision embeddings are then contrasted through a fine-grained
sequence-wise loss that only depends on individual samples and does not require
other batch samples as negatives. This enables more detailed information to be
learned in a computationally inexpensive manner. SPARC combines this
fine-grained loss with a contrastive loss between global image and text
embeddings to learn representations that simultaneously encode global and local
information. We thoroughly evaluate our proposed method and show improved
performance over competing approaches both on image-level tasks relying on
coarse-grained information, e.g. classification, as well as region-level tasks
relying on fine-grained information, e.g. retrieval, object detection, and
segmentation. Moreover, SPARC improves model faithfulness and captioning in
foundational vision-language models. |
The paper proposes SPARC, a new objective for multimodal pre-training that improves fine-grained understanding in vision-language models. |
Existing methods for learning fine-grained visual representations are computationally expensive, unstable, and often rely on pre-trained models, making it difficult to isolate the benefits of fine-grained objectives. |
SPARC learns language-grouped vision embeddings by aggregating image patches corresponding to individual words in the caption using a sparse similarity metric. It combines a fine-grained contrastive loss on these embeddings with a global image-text contrastive loss. |
SPARC outperforms or matches competing methods on zero-shot image classification across ImageNet and its variants.
SPARC achieves superior performance on zero-shot image-to-text and text-to-image retrieval on Flickr30k and MSCOCO datasets.
SPARC shows significant improvements on fine-grained localization tasks such as open-vocabulary object detection and zero-shot semantic segmentation. |
Exploring different sparsification approaches and leveraging bounding boxes/segmentation masks for learning patch groupings could further improve performance.
Further investigation is needed to evaluate SPARC encoders within multimodal foundational models like Flamingo, BLIP, and PALI. |
multimodal learning, contrastive learning, vision-language models, fine-grained understanding, image-text retrieval |
2401.09861
Report |
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models |
Li Sun, Liuan Wang, Jun Sun, Takayuki Okatani |
Recent advancements in Multimodal Large Language Models (MLLMs) have
significantly enhanced the comprehension of multimedia content, bringing
together diverse modalities such as text, images, and videos. However, a
critical challenge faced by these models, especially when processing video
inputs, is the occurrence of hallucinations - erroneous perceptions or
interpretations, particularly at the event level. This study introduces an
innovative method to address event-level hallucinations in MLLMs, focusing on
specific temporal understanding in video content. Our approach leverages a
novel framework that extracts and utilizes event-specific information from both
the event query and the provided video to refine MLLMs' response. We propose a
unique mechanism that decomposes on-demand event queries into iconic actions.
Subsequently, we employ models like CLIP and BLIP2 to predict specific
timestamps for event occurrences. Our evaluation, conducted using the
Charades-STA dataset, demonstrates a significant reduction in temporal
hallucinations and an improvement in the quality of event-related responses.
This research not only provides a new perspective in addressing a critical
limitation of MLLMs but also contributes a quantitatively measurable method for
evaluating MLLMs in the context of temporal-related questions. |
This paper introduces a novel framework to mitigate event-level temporal hallucinations in Multimodal Large Language Models (MLLMs) when processing video inputs, improving accuracy in answering temporal event queries. |
MLLMs, while proficient in understanding multimedia content, often suffer from hallucinations, particularly in accurately perceiving event timings and sequences in videos, leading to erroneous interpretations. |
The proposed method decomposes event queries into iconic actions, uses CLIP and BLIP2 models to predict specific timestamps for these actions, and corrects the MLLM's responses using these timestamps as factual evidence. |
Significantly reduces temporal hallucinations in MLLMs' responses to event-related questions.
Demonstrates superior performance in predicting event occurrence timestamps compared to baseline MLLMs and random predictions.
Shows substantial improvement in predicting the order of multiple events within a video. |
The current evaluation is limited to the Charades-STA dataset, potentially limiting the generalizability of the findings.
Future work can explore incorporating more sophisticated temporal reasoning mechanisms to further enhance the accuracy of event sequencing. |
multimodal large language models, temporal hallucination, video understanding, event sequencing, clip, blip2 |
2401.09794
Report |
Wavelet-Guided Acceleration of Text Inversion in Diffusion-Based Image Editing |
Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo |
In the field of image editing, Null-text Inversion (NTI) enables fine-grained
editing while preserving the structure of the original image by optimizing null
embeddings during the DDIM sampling process. However, the NTI process is
time-consuming, taking more than two minutes per image. To address this, we
introduce an innovative method that maintains the principles of the NTI while
accelerating the image editing process. We propose the WaveOpt-Estimator, which
determines the text optimization endpoint based on frequency characteristics.
Utilizing wavelet transform analysis to identify the image's frequency
characteristics, we can limit text optimization to specific timesteps during
the DDIM sampling process. By adopting the Negative-Prompt Inversion (NPI)
concept, a target prompt representing the original image serves as the initial
text value for optimization. This approach maintains performance comparable to
NTI while reducing the average editing time by over 80% compared to the NTI
method. Our method presents a promising approach for efficient, high-quality
image editing based on diffusion models. |
Presents WaveOpt-Estimator, a novel method to accelerate Null-Text Inversion (NTI) for efficient image editing with diffusion models. |
NTI enables fine-grained image editing while preserving the original structure but suffers from long processing times. |
Analyzes the relationship between image frequency components and NTI optimization endpoints using wavelet transform. Employs this analysis to train WaveOpt-Estimator which predicts optimal stopping points for NTI optimization, significantly reducing processing time. |
Images with different frequency characteristics exhibit varying optimal NTI optimization endpoints.
WaveOpt-Estimator accurately predicts optimization endpoints with a Mean Absolute Error (MAE) of 2.9 timesteps.
Applying WaveOpt-Estimator to NTI achieves an 80% reduction in processing time compared to standard NTI while maintaining high image quality (PSNR ratio > 0.9). |
The current implementation primarily focuses on image reconstruction without extensive evaluation on diverse editing prompts.
Exploration of other frequency analysis techniques beyond wavelet transform could further enhance the WaveOpt-Estimator's performance. |
image editing, diffusion models, null-text inversion, wavelet transform, optimization |
2401.09742
Report |
Image Translation as Diffusion Visual Programmers |
Cheng Han, James C. Liang, Qifan Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Ying Nian Wu, Dongfang Liu |
We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic
image translation framework. Our proposed DVP seamlessly embeds a
condition-flexible diffusion model within the GPT architecture, orchestrating a
coherent sequence of visual programs (i.e., computer vision models) for various
pro-symbolic steps, which span RoI identification, style transfer, and position
manipulation, facilitating transparent and controllable image translation
processes. Extensive experiments demonstrate DVP's remarkable performance,
surpassing concurrent arts. This success can be attributed to several key
features of DVP: First, DVP achieves condition-flexible translation via
instance normalization, enabling the model to eliminate sensitivity caused by
the manual guidance and optimally focus on textual descriptions for
high-quality content generation. Second, the framework enhances in-context
reasoning by deciphering intricate high-dimensional concepts in feature spaces
into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]),
allowing for localized, context-free editing while maintaining overall
coherence. Last but not least, DVP improves systemic controllability and
explainability by offering explicit symbolic representations at each
programming stage, empowering users to intuitively interpret and modify
results. Our research marks a substantial step towards harmonizing artificial
image translation processes with cognitive intelligence, promising broader
applications. |
This paper introduces Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework that combines a condition-flexible diffusion model with the GPT architecture for controllable and explainable image manipulation. |
Existing diffusion-based image translation methods suffer from limitations such as condition-rigid learning, context-free incompetence, and system opacity. This paper addresses these limitations by enabling more flexible and interpretable image translation. |
DVP leverages GPT to generate visual programs consisting of computer vision models for RoI identification, style transfer, and position manipulation. It utilizes instance normalization to enhance condition-flexibility and decomposes complex concepts into symbols for in-context reasoning. |
DVP achieves state-of-the-art performance on image translation benchmarks, demonstrating high fidelity and quality.
Instance normalization guidance in DVP's diffusion model enhances robustness and eliminates the need for manual guidance scale parameter tuning.
The visual programming paradigm enables context-free editing, allowing for specific RoI modifications while preserving overall image coherence. |
DVP struggles with image translation in challenging situations like poor photometric conditions and occluded objects, suggesting a need for specialized datasets and improved object segmentation.
While instance normalization guidance is effective for text-guided diffusion, its application in broader image generation tasks requires further exploration. |
image translation, diffusion models, visual programming, neuro-symbolic ai, explainable ai |
2401.09732
Report |
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation |
Zesen Cheng, Kehan Li, Hao Li, Peng Jin, Chang Liu, Xiawu Zheng, Rongrong Ji, Jie Chen |
Temporally locating objects with arbitrary class texts is the primary pursuit
of open-vocabulary Video Instance Segmentation (VIS). Because of the
insufficient vocabulary of video data, previous methods leverage image-text
pretraining model for recognizing object instances by separately aligning each
frame and class texts, ignoring the correlation between frames. As a result,
the separation breaks the instance movement context of videos, causing inferior
alignment between video and text. To tackle this issue, we propose to link
frame-level instance representations as a Brownian Bridge to model instance
dynamics and align bridge-level instance representation to class texts for more
precisely open-vocabulary VIS (BriVIS). Specifically, we build our system upon
a frozen video segmentor to generate frame-level instance queries, and design
Temporal Instance Resampler (TIR) to generate queries with temporal context
from frame queries. To mold instance queries to follow Brownian bridge and
accomplish alignment with class texts, we design Bridge-Text Alignment (BTA) to
learn discriminative bridge-level representations of instances via contrastive
objectives. Setting MinVIS as the basic video segmentor, BriVIS surpasses the
Open-vocabulary SOTA (OV2Seg) by a clear margin. For example, on the
challenging large-vocabulary VIS dataset (BURST), BriVIS achieves 7.43 mAP and
exhibits 49.49% improvement compared to OV2Seg (4.97 mAP). |
This paper proposes BriVIS, an open-vocabulary video instance segmentation method that leverages instance dynamics by modeling instance features as a Brownian Bridge and aligning the bridge center with class text embeddings. |
Existing open-vocabulary VIS methods rely on aligning individual frames with class texts, neglecting the crucial temporal context of instance movement in videos. |
BriVIS utilizes a frozen video segmentor to generate frame-level instance queries and employs a Temporal Instance Resampler (TIR) to capture temporal context. A Bridge-Text Alignment (BTA) module then links these features as a Brownian Bridge, aligning the bridge center with corresponding class texts via contrastive learning. |
BriVIS significantly outperforms previous open-vocabulary VIS methods, achieving a 49.49% improvement on the BURST dataset.
Analysis shows BriVIS effectively handles instances spanning long durations, indicating robust temporal modeling.
BriVIS demonstrates competitive performance against close-vocabulary VIS methods, highlighting its strong vocabulary generalization ability. |
The reliance on offline processing due to the Brownian Bridge modeling poses challenges for long videos or video streams.
The implicit modeling of temporal context within the CLIP visual space limits its applicability to complex video tasks demanding profound temporal reasoning. |
open-vocabulary video instance segmentation, brownian bridge, temporal context modeling, contrastive learning, vision-language pretraining |
2401.09720
Report |
GaussianBody: Clothed Human Reconstruction via 3d Gaussian Splatting |
Mengtian Li, Shengxiang Yao, Zhifeng Xie, Keyu Chen |
In this work, we propose a novel clothed human reconstruction method called
GaussianBody, based on 3D Gaussian Splatting. Compared with the costly neural
radiance based models, 3D Gaussian Splatting has recently demonstrated great
performance in terms of training time and rendering quality. However, applying
the static 3D Gaussian Splatting model to the dynamic human reconstruction
problem is non-trivial due to complicated non-rigid deformations and rich cloth
details. To address these challenges, our method considers explicit pose-guided
deformation to associate dynamic Gaussians across the canonical space and the
observation space, introducing a physically-based prior with regularized
transformations helps mitigate ambiguity between the two spaces. During the
training process, we further propose a pose refinement strategy to update the
pose regression for compensating the inaccurate initial estimation and a
split-with-scale mechanism to enhance the density of regressed point clouds.
The experiments validate that our method can achieve state-of-the-art
photorealistic novel-view rendering results with high-quality details for
dynamic clothed human bodies, along with explicit geometry reconstruction. |
This paper introduces GaussianBody, a novel method for clothed human reconstruction from monocular RGB videos, leveraging 3D Gaussian Splatting (3D-GS) for efficient high-fidelity reconstruction. |
Existing methods struggle to balance high-fidelity reconstruction with fast training and rendering. This work addresses this by adapting 3D-GS for dynamic human reconstruction, enabling fast, detailed, and animatable human modeling. |
The method utilizes SMPL for pose-guided deformation of canonical Gaussians. A physically-based prior regularizes Gaussian transformations, ensuring geometric consistency. A split-with-scale strategy enhances point cloud density and pose refinement improves SMPL parameter accuracy. |
GaussianBody achieves state-of-the-art results in novel view synthesis on PeopleSnapshot dataset, outperforming baselines in PSNR, SSIM, and LPIPS metrics.
The method generates high-quality point clouds that capture intricate clothing and body details, enabling accurate representation of non-rigid deformations.
Ablation studies validate the contribution of the physically-based prior, pose refinement, and split-with-scale strategies to the reconstruction quality. |
The current implementation faces challenges in novel pose synthesis due to sparse Gaussians and limitations in capturing complex non-rigid cloth deformations.
Further investigation is needed to improve the integration of deformation MLPs for more robust and accurate non-rigid deformation handling. |
3d human reconstruction, gaussian splatting, novel view synthesis, monocular reconstruction, physically-based priors |
2401.09673
Report |
Artwork Protection Against Neural Style Transfer Using Locally Adaptive Adversarial Color Attack |
Zhongliang Guo, Junhao Dong, Yifei Qian, Kaixuan Wang, Weiye Li, Ziheng Guo, Yuheng Wang, Yanli Li, Ognjen Arandjelović, Lei Fang |
Neural style transfer (NST) generates new images by combining the style of
one image with the content of another. However, unauthorized NST can exploit
artwork, raising concerns about artists' rights and motivating the development
of proactive protection methods. We propose Locally Adaptive Adversarial Color
Attack (LAACA), empowering artists to protect their artwork from unauthorized
style transfer by processing before public release. By delving into the
intricacies of human visual perception and the role of different frequency
components, our method strategically introduces frequency-adaptive
perturbations in the image. These perturbations significantly degrade the
generation quality of NST while maintaining an acceptable level of visual
change in the original image, ensuring that potential infringers are
discouraged from using the protected artworks, because of its bad NST
generation quality. Additionally, existing metrics often overlook the
importance of color fidelity in evaluating color-mattered tasks, such as the
quality of NST-generated images, which is crucial in the context of artistic
works. To comprehensively assess the color-mattered tasks, we propose the
Adversarial Color Distance Metric (ACDM), designed to quantify the color
difference of images pre- and post-manipulations. Experimental results confirm
that attacking NST using LAACA results in visually inferior style transfer, and
the ACDM can efficiently measure color-mattered tasks. By providing artists
with a tool to safeguard their intellectual property, our work relieves the
socio-technical challenges posed by the misuse of NST in the art community. |
This paper introduces LAACA, a novel method to protect artwork from unauthorized neural style transfer by subtly perturbing style images to disrupt style transfer quality while maintaining visual fidelity. |
Unauthorized neural style transfer poses risks to artists' rights, demanding proactive protection methods for digital artworks. |
LAACA leverages frequency domain analysis to strategically embed perturbations in high-frequency areas of style images, maximizing disruption to style transfer while minimizing perceptual changes to the original artwork. Additionally, a new metric, ACDM, is proposed to quantify color differences in images pre- and post-manipulation, addressing the limitations of existing metrics in evaluating color-sensitive tasks. |
LAACA effectively disrupts the quality of style transfer across five different NST methods, leading to visually inferior results.
LAACA preserves the visual integrity of the original artwork, with minimal perceptible changes introduced by the adversarial perturbations.
ACDM demonstrates superior performance compared to existing metrics like SSIMc and LPIPS in capturing color differences relevant for evaluating color-mattered tasks such as NST. |
The current implementation primarily focuses on color disruption, and future work could explore incorporating texture-based disruptions for enhanced protection.
Further research could investigate the generalization of LAACA to other domains beyond artistic style transfer where content-style separation is relevant. |
adversarial attack, neural style transfer, copyright protection, image processing, computer vision |
2401.09603
Report |
Rethinking FID: Towards a Better Evaluation Metric for Image Generation |
Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, Sanjiv Kumar |
As with many machine learning problems, the progress of image generation
methods hinges on good evaluation metrics. One of the most popular is the
Frechet Inception Distance (FID). FID estimates the distance between a
distribution of Inception-v3 features of real images, and those of images
generated by the algorithm. We highlight important drawbacks of FID:
Inception's poor representation of the rich and varied content generated by
modern text-to-image models, incorrect normality assumptions, and poor sample
complexity. We call for a reevaluation of FID's use as the primary quality
metric for generated images. We empirically demonstrate that FID contradicts
human raters, it does not reflect gradual improvement of iterative
text-to-image models, it does not capture distortion levels, and that it
produces inconsistent results when varying the sample size. We also propose an
alternative new metric, CMMD, based on richer CLIP embeddings and the maximum
mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased
estimator that does not make any assumptions on the probability distribution of
the embeddings and is sample efficient. Through extensive experiments and
analysis, we demonstrate that FID-based evaluations of text-to-image models may
be unreliable, and that CMMD offers a more robust and reliable assessment of
image quality. |
This paper argues that the commonly used Frèchet Inception Distance (FID) for evaluating image generation models has significant limitations and proposes an alternative metric called CMMD (CLIP-MMD). |
FID, despite being widely adopted, shows discrepancies with human perception and fails to accurately capture improvements in iterative image generation models or under complex image distortions. |
The paper analyzes FID's limitations, especially its reliance on normality assumptions for Inception embeddings which are often violated. It then proposes CMMD, which utilizes CLIP embeddings and the Maximum Mean Discrepancy (MMD) distance for a more robust and reliable evaluation. |
Human evaluation shows that FID contradicts human perception of image quality, while CMMD aligns better with human judgment.
CMMD accurately reflects the gradual improvement in iterative image generation models like Muse and Stable Diffusion, unlike FID which shows inconsistent behavior.
CMMD effectively captures image quality degradation under complex distortions in the latent space where FID fails. |
The paper acknowledges that the bandwidth parameter for the Gaussian RBF kernel in CMMD, while empirically observed to have insignificant impact, is fixed at 10 for consistency and proposes further investigation.
Future work includes exploring other kernels for MMD and conducting more comprehensive human evaluations. |
image generation, evaluation metrics, frèchet inception distance (fid), clip embeddings, maximum mean discrepancy (mmd) |
2401.09419
Report |
GARField: Group Anything with Radiance Fields |
Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg, Matthew Tancik, Angjoo Kanazawa |
Grouping is inherently ambiguous due to the multiple levels of granularity in
which one can decompose a scene -- should the wheels of an excavator be
considered separate or part of the whole? We present Group Anything with
Radiance Fields (GARField), an approach for decomposing 3D scenes into a
hierarchy of semantically meaningful groups from posed image inputs. To do this
we embrace group ambiguity through physical scale: by optimizing a
scale-conditioned 3D affinity feature field, a point in the world can belong to
different groups of different sizes. We optimize this field from a set of 2D
masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine
hierarchy, using scale to consistently fuse conflicting masks from different
viewpoints. From this field we can derive a hierarchy of possible groupings via
automatic tree construction or user interaction. We evaluate GARField on a
variety of in-the-wild scenes and find it effectively extracts groups at many
levels: clusters of objects, objects, and various subparts. GARField inherently
represents multi-view consistent groupings and produces higher fidelity groups
than the input SAM masks. GARField's hierarchical grouping could have exciting
downstream applications such as 3D asset extraction or dynamic scene
understanding. See the project website at https://www.garfield.studio/ |
Presents GARField, a method that decomposes 3D scenes into a hierarchy of semantically meaningful groups from posed images by optimizing a scale-conditioned 3D affinity feature field. |
Grouping is inherently ambiguous due to multiple levels of granularity; GARField addresses this by using physical scale to consolidate groups into a hierarchy. |
Distills 2D segmentation masks from SAM into a 3D volumetric scale-conditioned affinity field, using contrastive loss and containment auxiliary loss to ensure transitivity and containment properties. Hierarchical decomposition is achieved via recursive clustering at descending scales. |
Effectively extracts groups at multiple levels (clusters of objects, objects, subparts).
Produces consistent 3D groupings, often improving upon the quality of input 2D segmentation masks.
Enables applications like 3D asset extraction and interactive segmentation. |
Limited by the quality and coverage of input 2D masks.
Current tree generation is naive and can lead to spurious small groups. |
3d scene understanding, hierarchical grouping, scale-conditioned affinity field, nerf, segmentation |
2401.09417
Report |
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model |
Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang |
Recently the state space models (SSMs) with efficient hardware-aware designs,
i.e., the Mamba deep learning model, have shown great potential for long
sequence modeling. Meanwhile building efficient and generic vision backbones
purely upon SSMs is an appealing direction. However, representing visual data
is challenging for SSMs due to the position-sensitivity of visual data and the
requirement of global context for visual understanding. In this paper, we show
that the reliance on self-attention for visual representation learning is not
necessary and propose a new generic vision backbone with bidirectional Mamba
blocks (Vim), which marks the image sequences with position embeddings and
compresses the visual representation with bidirectional state space models. On
ImageNet classification, COCO object detection, and ADE20k semantic
segmentation tasks, Vim achieves higher performance compared to
well-established vision transformers like DeiT, while also demonstrating
significantly improved computation & memory efficiency. For example, Vim is
2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch
inference to extract features on images with a resolution of 1248$\times$1248.
The results demonstrate that Vim is capable of overcoming the computation &
memory constraints on performing Transformer-style understanding for
high-resolution images and it has great potential to be the next-generation
backbone for vision foundation models. Code is available at
https://github.com/hustvl/Vim. |
This paper proposes a novel vision backbone named Vision Mamba (Vim) built upon bidirectional Mamba blocks, marking a departure from self-attention reliance in visual representation learning. |
This approach aims to address the challenges faced by state space models (SSMs) in visual data representation, particularly the position-sensitivity of visual data and the need for global context. It holds promise for efficient and generic vision backbones based on SSMs. |
Vim utilizes position embeddings to encode spatial information within image sequences and leverages bidirectional state space models to compress visual representations. |
Vim outperforms established vision transformers like DeiT in ImageNet classification, COCO object detection, and ADE20k semantic segmentation.
Vim demonstrates superior computational and memory efficiency compared to DeiT, especially with high-resolution images.
For instance, Vim is 2.8 times faster and saves 86.8% GPU memory than DeiT during batch inference on images with a 1248x1248 resolution. |
The paper does not explicitly mention limitations.
Future work could focus on exploring the applicability of Vim in other vision tasks and datasets beyond those investigated in the paper. |
vision transformer, state space model, mamba, vision backbone, efficient deep learning |
2401.09416
Report |
TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion |
Yu-Ying Yeh, Jia-Bin Huang, Changil Kim, Lei Xiao, Thu Nguyen-Phuoc, Numair Khan, Cheng Zhang, Manmohan Chandraker, Carl S Marshall, Zhao Dong, Zhengqin Li |
We present TextureDreamer, a novel image-guided texture synthesis method to
transfer relightable textures from a small number of input images (3 to 5) to
target 3D shapes across arbitrary categories. Texture creation is a pivotal
challenge in vision and graphics. Industrial companies hire experienced artists
to manually craft textures for 3D assets. Classical methods require densely
sampled views and accurately aligned geometry, while learning-based methods are
confined to category-specific shapes within the dataset. In contrast,
TextureDreamer can transfer highly detailed, intricate textures from real-world
environments to arbitrary objects with only a few casually captured images,
potentially significantly democratizing texture creation. Our core idea,
personalized geometry-aware score distillation (PGSD), draws inspiration from
recent advancements in diffuse models, including personalized modeling for
texture information extraction, variational score distillation for detailed
appearance synthesis, and explicit geometry guidance with ControlNet. Our
integration and several essential modifications substantially improve the
texture quality. Experiments on real images spanning different categories show
that TextureDreamer can successfully transfer highly realistic, semantic
meaningful texture to arbitrary objects, surpassing the visual quality of
previous state-of-the-art. |
TextureDreamer, a novel image-guided texture synthesis method that transfers relightable textures from a few input images (3-5) to target 3D shapes. |
Texture creation is crucial for realistic 3D content, but existing methods require dense views or are category-specific. This method offers a more accessible approach for diverse objects. |
The method combines personalized Dreambooth fine-tuning for texture extraction, variational score distillation (VSD) for realistic appearance, and ControlNet for geometry-aware generation (PGSD). |
Transfers highly detailed textures from real-world images to arbitrary objects.
Generates semantically meaningful textures that align with target geometry.
Outperforms state-of-the-art methods in qualitative and quantitative evaluations. |
May bake in lighting from input images into textures.
Struggles with transferring special and non-repeated textures. |
texture synthesis, diffusion models, image-guided, 3d content creation, neural rendering |
2401.09414
Report |
Vlogger: Make Your Dream A Vlog |
Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang |
In this work, we present Vlogger, a generic AI system for generating a
minute-level video blog (i.e., vlog) of user descriptions. Different from short
videos with a few seconds, vlog often contains a complex storyline with
diversified scenes, which is challenging for most existing video generation
approaches. To break through this bottleneck, our Vlogger smartly leverages
Large Language Model (LLM) as Director and decomposes a long video generation
task of vlog into four key stages, where we invoke various foundation models to
play the critical roles of vlog professionals, including (1) Script, (2) Actor,
(3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings,
our Vlogger can generate vlogs through explainable cooperation of top-down
planning and bottom-up shooting. Moreover, we introduce a novel video diffusion
model, ShowMaker, which serves as a videographer in our Vlogger for generating
the video snippet of each shooting scene. By incorporating Script and Actor
attentively as textual and visual prompts, it can effectively enhance
spatial-temporal coherence in the snippet. Besides, we design a concise mixed
training paradigm for ShowMaker, boosting its capacity for both T2V generation
and prediction. Finally, the extensive experiments show that our method
achieves state-of-the-art performance on zero-shot T2V generation and
prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs
from open-world descriptions, without loss of video coherence on script and
actor. The code and model is all available at
https://github.com/zhuangshaobin/Vlogger. |
This paper introduces Vlogger, an AI system that uses large language models (LLMs) and foundation models to automatically generate minute-long, coherent video blogs (vlogs) from user stories. |
Existing video generation methods struggle to create long, coherent videos with diverse scenes and complex storylines, which are characteristic of vlogs. Vlogger addresses these limitations. |
Vlogger decomposes vlog generation into four stages: (1) Script creation with LLM as director, (2) Actor design with a character designer, (3) Video shooting with a novel video diffusion model (ShowMaker), and (4) Voiceover using a text-to-speech model. |
Vlogger achieves state-of-the-art performance on zero-shot text-to-video generation and prediction tasks.
It generates vlogs longer than 5 minutes from open-world descriptions, maintaining coherence in script and actor portrayal.
The novel ShowMaker component demonstrates effectiveness in generating controllable-duration video snippets with strong spatial-temporal coherence. |
The current implementation relies on multiple foundation models, which can be computationally expensive.
Future work will explore generating higher-resolution vlogs and incorporating more sophisticated editing techniques. |
video generation, large language models, video diffusion models, vlog generation, foundation models |
2401.09413
Report |
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images |
Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic |
We describe an approach to predict open-vocabulary 3D semantic voxel
occupancy map from input 2D images with the objective of enabling 3D grounding,
segmentation and retrieval of free-form language queries. This is a challenging
problem because of the 2D-3D ambiguity and the open-vocabulary nature of the
target tasks, where obtaining annotated training data in 3D is difficult. The
contributions of this work are three-fold. First, we design a new model
architecture for open-vocabulary 3D semantic occupancy prediction. The
architecture consists of a 2D-3D encoder together with occupancy prediction and
3D-language heads. The output is a dense voxel map of 3D grounded language
embeddings enabling a range of open-vocabulary tasks. Second, we develop a
tri-modal self-supervised learning algorithm that leverages three modalities:
(i) images, (ii) language and (iii) LiDAR point clouds, and enables training
the proposed architecture using a strong pre-trained vision-language model
without the need for any 3D manual language annotations. Finally, we
demonstrate quantitatively the strengths of the proposed model on several
open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing
datasets; 3D grounding and retrieval of free-form language queries, using a
small dataset that we propose as an extension of nuScenes. You can find the
project page here https://vobecant.github.io/POP3D. |
This paper proposes POP3D, a novel method for open-vocabulary 3D semantic occupancy prediction from 2D images, enabling 3D grounding, segmentation, and retrieval of objects based on free-form language queries. |
This approach addresses the limitations of traditional 3D semantic occupancy prediction methods that rely on manually annotated 3D data and are restricted to a predefined set of object classes. |
POP3D employs a tri-modal self-supervised learning algorithm, leveraging images, LiDAR point clouds, and a pre-trained image-language network (MaskCLIP+). The architecture consists of a 2D-3D encoder, an occupancy prediction head, and a 3D-language head. |
POP3D achieves superior occupancy prediction compared to a fully supervised counterpart, demonstrating the effectiveness of the proposed tri-modal self-supervised learning approach.
The method demonstrates strong performance on zero-shot 3D semantic segmentation, showcasing its open-vocabulary capabilities.
POP3D exhibits promising results for language-driven 3D grounding and retrieval tasks, enabling interaction with 3D scenes using natural language queries. |
The model's performance is limited by the resolution of the voxel grid, hindering its ability to detect small objects.
The architecture does not natively support image sequences, which could be beneficial for reasoning about occluded objects and dynamic scenes. Future work could explore incorporating temporal information into the model. |
3d semantic occupancy prediction, open-vocabulary learning, tri-modal self-supervised learning, language-driven 3d grounding, zero-shot semantic segmentation |
2401.09340
Report |
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding |
Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang |
3D vision-language grounding, which focuses on aligning language with the 3D
physical environment, stands as a cornerstone in the development of embodied
agents. In comparison to recent advancements in the 2D domain, grounding
language in 3D scenes faces several significant challenges: (i) the inherent
complexity of 3D scenes due to the diverse object configurations, their rich
attributes, and intricate relationships; (ii) the scarcity of paired 3D
vision-language data to support grounded learning; and (iii) the absence of a
unified learning framework to distill knowledge from grounded 3D data. In this
work, we aim to address these three major challenges in 3D vision-language by
examining the potential of systematically upscaling 3D vision-language learning
in indoor environments. We introduce the first million-scale 3D vision-language
dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising
2.5M vision-language pairs derived from both human annotations and our scalable
scene-graph-based generation approach. We demonstrate that this scaling allows
for a unified pre-training framework, Grounded Pre-training for Scenes (GPS),
for 3D vision-language learning. Through extensive experiments, we showcase the
effectiveness of GPS by achieving state-of-the-art performance on all existing
3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is
unveiled through zero-shot transfer experiments in the challenging 3D
vision-language tasks. Project website: https://scene-verse.github.io. |
This work presents \dataset, the first million-scale 3D vision-language dataset for grounded scene understanding, and \model, a unified pre-training framework based on multi-level contrastive alignment. |
Grounding language in 3D scenes is crucial for embodied agents but faces challenges due to scene complexity, data scarcity, and lack of unified learning frameworks. |
\dataset is created by combining existing 3D scene data with automatically generated scene-language pairs using scene graphs and LLMs. \model leverages this data with multi-level contrastive alignment for object-level, scene-level, and referral-object-level grounding. |
\model achieves state-of-the-art results on all existing 3D visual grounding benchmarks.
Pre-trained \model shows strong zero-shot generalization capabilities for grounded scene understanding.
Scaling data, especially with realistic scenes, significantly benefits 3D visual grounding and other 3D understanding tasks like semantic segmentation. |
The domain gap between real and synthetic scenes poses challenges for generalization.
Future work should focus on collecting more diverse, realistic, and large-scale 3D scenes. |
3d vision-language grounding, 3d scene understanding, million-scale dataset, contrastive learning, zero-shot transfer |
2401.09084
Report |
UniVG: Towards UNIfied-modal Video Generation |
Ludan Ruan, Lei Tian, Chuanwei Huang, Xu Zhang, Xinyan Xiao |
Diffusion based video generation has received extensive attention and
achieved considerable success within both the academic and industrial
communities. However, current efforts are mainly concentrated on
single-objective or single-task video generation, such as generation driven by
text, by image, or by a combination of text and image. This cannot fully meet
the needs of real-world application scenarios, as users are likely to input
images and text conditions in a flexible manner, either individually or in
combination. To address this, we propose a Unified-modal Video Genearation
system that is capable of handling multiple video generation tasks across text
and image modalities. To this end, we revisit the various video generation
tasks within our system from the perspective of generative freedom, and
classify them into high-freedom and low-freedom video generation categories.
For high-freedom video generation, we employ Multi-condition Cross Attention to
generate videos that align with the semantics of the input images or text. For
low-freedom video generation, we introduce Biased Gaussian Noise to replace the
pure random Gaussian Noise, which helps to better preserve the content of the
input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD)
on the public academic benchmark MSR-VTT, surpasses the current open-source
methods in human evaluations, and is on par with the current close-source
method Gen2. For more samples, visit https://univg-baidu.github.io. |
This paper presents UniVG, a unified video generation system that handles multiple video generation tasks (e.g., text-to-video, image-to-video) within a single framework. |
Current video generation models are limited to single-objective or single-task pipelines, lacking flexibility to meet diverse user needs who might input text and image conditions in various combinations. |
UniVG categorizes video generation tasks by "generative freedom": high-freedom (e.g., text/image-to-video) uses Multi-condition Cross Attention, and low-freedom (e.g., image animation, super-resolution) employs Biased Gaussian Noise for better content preservation. |
UniVG achieves the lowest FVD on MSR-VTT benchmark, surpassing open-source methods.
Human evaluations show UniVG is on par with the closed-source Gen2 and outperforms other open-source methods.
Ablation studies confirm the effectiveness of Multi-condition Cross Attention and Biased Gaussian Noise for their respective categories. |
The current model struggles to generate videos with a large amount of motion, potentially due to limitations in training data.
Future work will explore alternative solutions for Biased Gaussian Noise and extend its application to other low-freedom video generation tasks. |
video generation, diffusion models, multi-modal generation, image animation, video super-resolution |
2401.09050
Report |
Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior |
Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang |
Score distillation sampling (SDS) and its variants have greatly boosted the
development of text-to-3D generation, but are vulnerable to geometry collapse
and poor textures yet. To solve this issue, we first deeply analyze the SDS and
find that its distillation sampling process indeed corresponds to the
trajectory sampling of a stochastic differential equation (SDE): SDS samples
along an SDE trajectory to yield a less noisy sample which then serves as a
guidance to optimize a 3D model. However, the randomness in SDE sampling often
leads to a diverse and unpredictable sample which is not always less noisy, and
thus is not a consistently correct guidance, explaining the vulnerability of
SDS. Since for any SDE, there always exists an ordinary differential equation
(ODE) whose trajectory sampling can deterministically and consistently converge
to the desired target point as the SDE, we propose a novel and effective
"Consistent3D" method that explores the ODE deterministic sampling prior for
text-to-3D generation. Specifically, at each training iteration, given a
rendered image by a 3D model, we first estimate its desired 3D score function
by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling.
Next, we design a consistency distillation sampling loss which samples along
the ODE trajectory to generate two adjacent samples and uses the less noisy
sample to guide another more noisy one for distilling the deterministic prior
into the 3D model. Experimental results show the efficacy of our Consistent3D
in generating high-fidelity and diverse 3D objects and large-scale scenes, as
shown in Fig. 1. The codes are available at
https://github.com/sail-sg/Consistent3D. |
Consistent3D, a novel text-to-3D generation method using deterministic sampling priors to address the geometry collapse and poor texture issues in Score Distillation Sampling (SDS). |
SDS, while effective, suffers from geometry collapse and poor textures due to the inherent randomness in its SDE-based sampling process. |
Consistent3D leverages the deterministic nature of ODE trajectories, proposing a Consistency Distillation Sampling (CDS) loss to distill deterministic priors into a 3D model. It utilizes a fixed noise perturbation and samples adjacent points on the ODE trajectory to guide 3D model optimization. |
Generates high-fidelity and diverse 3D objects and large-scale scenes.
Outperforms existing methods like DreamFusion, Magic3D, and ProlificDreamer in both qualitative and quantitative evaluations.
Effectively addresses the randomness issue in SDS, providing more consistent and reliable guidance for 3D model optimization. |
Reliance on pre-trained diffusion models without 3D priors might limit performance in complex scenarios.
Potential bias transfer from pre-trained models needs further investigation. |
text-to-3d generation, score distillation sampling, diffusion models, ordinary differential equations, deterministic sampling |
2401.09048
Report |
Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis |
Jonghyun Lee, Hansam Cho, Youngjoon Yoo, Seoung Bum Kim, Yonghyun Jeong |
Addressing the limitations of text as a source of accurate layout
representation in text-conditional diffusion models, many works incorporate
additional signals to condition certain attributes within a generated image.
Although successful, previous works do not account for the specific
localization of said attributes extended into the three dimensional plane. In
this context, we present a conditional diffusion model that integrates control
over three-dimensional object placement with disentangled representations of
global stylistic semantics from multiple exemplar images. Specifically, we
first introduce \textit{depth disentanglement training} to leverage the
relative depth of objects as an estimator, allowing the model to identify the
absolute positions of unseen objects through the use of synthetic image
triplets. We also introduce \textit{soft guidance}, a method for imposing
global semantics onto targeted regions without the use of any additional
localization cues. Our integrated framework, \textsc{Compose and Conquer
(CnC)}, unifies these techniques to localize multiple conditions in a
disentangled manner. We demonstrate that our approach allows perception of
objects at varying depths while offering a versatile framework for composing
localized objects with different global semantics. Code:
https://github.com/tomtom1103/compose-and-conquer/ |
The paper introduces Compose and Conquer (CnC), a text-conditional diffusion model that integrates control over 3D object placement with disentangled global stylistic semantics from multiple exemplar images. |
Existing text-conditional diffusion models struggle with accurate 3D object placement and localizing global semantics from multiple sources. CnC addresses these limitations. |
CnC utilizes two novel techniques: Depth Disentanglement Training (DDT) for 3D object placement and 'soft guidance' for localizing global semantics. DDT leverages synthetic image triplets to teach the model relative depth, while soft guidance selectively masks cross-attention layers to inject regional semantics. |
CnC outperforms baseline models in generating images with accurate 3D object placement and localized global semantics.
The model demonstrates strong reconstruction ability, faithfully recreating objects in varying depths.
Soft guidance effectively prevents concept bleeding, ensuring global semantics are applied to targeted regions without unintended overlaps. |
The current framework limits the number of conditions and disentangled spatial grounds.
Future work includes decomposing images into finer depth primitives and exploring the 'middle ground'. |
diffusion models, 3d object placement, global semantic localization, depth disentanglement training, soft guidance |
2401.09047
Report |
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models |
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan |
Text-to-video generation aims to produce a video based on a given prompt.
Recently, several commercial video models have been able to generate plausible
videos with minimal noise, excellent details, and high aesthetic scores.
However, these models rely on large-scale, well-filtered, high-quality videos
that are not accessible to the community. Many existing research works, which
train models using the low-quality WebVid-10M dataset, struggle to generate
high-quality videos because the models are optimized to fit WebVid-10M. In this
work, we explore the training scheme of video models extended from Stable
Diffusion and investigate the feasibility of leveraging low-quality videos and
synthesized high-quality images to obtain a high-quality video model. We first
analyze the connection between the spatial and temporal modules of video models
and the distribution shift to low-quality videos. We observe that full training
of all modules results in a stronger coupling between spatial and temporal
modules than only training temporal modules. Based on this stronger coupling,
we shift the distribution to higher quality without motion degradation by
finetuning spatial modules with high-quality images, resulting in a generic
high-quality video model. Evaluations are conducted to demonstrate the
superiority of the proposed method, particularly in picture quality, motion,
and concept composition. |
This paper presents a method to train high-quality video diffusion models without relying on large-scale, high-quality video datasets. |
Existing research on text-to-video generation struggles to produce high-quality videos due to the reliance on low-quality datasets like WebVid-10M, while commercial models use private high-quality data inaccessible to the public. |
The authors analyze the connection between spatial and temporal modules in video diffusion models and leverage it to overcome data limitations. They propose a two-stage pipeline: 1) Fully train a video model with low-quality videos. 2) Fine-tune the spatial modules of the trained model using synthesized high-quality images. |
The method generates videos with comparable visual quality to commercial models trained on high-quality videos.
The proposed training strategy maintains good motion consistency without significant degradation.
Using synthesized images with complex concepts for finetuning improves the model's concept composition ability. |
The motion quality, while improved, is not yet on par with models trained on large-scale, high-quality video data.
The research focuses on a specific video diffusion model architecture based on Stable Diffusion; generalizability to other architectures needs further exploration. |
text-to-video generation, video diffusion models, data limitations, stable diffusion, concept composition |
2401.08973
Report |
OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality |
Aditya Sharma, Luke Yoffe, Tobias Höllerer |
One key challenge in Augmented Reality is the placement of virtual content in
natural locations. Most existing automated techniques can only work with a
closed-vocabulary, fixed set of objects. In this paper, we introduce and
evaluate several methods for automatic object placement using recent advances
in open-vocabulary vision-language models. Through a multifaceted evaluation,
we identify a new state-of-the-art method, OCTO+. We also introduce a benchmark
for automatically evaluating the placement of virtual objects in augmented
reality, alleviating the need for costly user studies. Through this, in
addition to human evaluations, we find that OCTO+ places objects in a valid
region over 70% of the time, outperforming other methods on a range of metrics. |
This paper introduces OCTO+, a novel pipeline for automatically placing virtual objects in augmented reality scenes using open-vocabulary vision-language models, and PEARL, a benchmark for evaluating these placements. |
Automatic and natural object placement is crucial for AR applications but challenging due to the need for open-vocabulary understanding and reasoning about object relationships. |
OCTO+ uses a 3-stage approach: 1) Image Understanding (RAM++ with Grounding DINO filtering), 2) Reasoning (GPT-4 to select the most natural placement surface), and 3) Locating (Grounded-Segment-Anything to determine 2D coordinates). |
OCTO+ outperforms previous methods, including GPT-4V and OCTOPUS, in placing objects naturally.
The proposed PEARL-Score metric, based on placement within valid regions and distance from edges, aligns with human judgment.
Human evaluation confirms that OCTO+ achieves natural placements comparable to expert annotations. |
Current pipeline is slow, taking up to 10 seconds per placement.
Placement logic doesn't consider object-specific orientations or more complex spatial relationships. |
object placement, open-vocabulary, benchmark, mixed reality, vision-language models |
2401.08937
Report |
ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization |
Weiyao Wang, Pierre Gleize, Hao Tang, Xingyu Chen, Kevin J Liang, Matt Feiszli |
Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View
Synthesis (NVS) given a set of 2D images. However, NeRF training requires
accurate camera pose for each input view, typically obtained by
Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax
this constraint, but they still often rely on decent initial poses which they
can refine. Here we aim at removing the requirement for pose initialization. We
present Incremental CONfidence (ICON), an optimization procedure for training
NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate
initial guess for poses. Further, ICON introduces ``confidence": an adaptive
measure of model quality used to dynamically reweight gradients. ICON relies on
high-confidence poses to learn NeRF, and high-confidence 3D structure (as
encoded by NeRF) to learn poses. We show that ICON, without prior pose
initialization, achieves superior performance in both CO3D and HO3D versus
methods which use SfM pose. |
This paper proposes ICON (Incremental CONfidence), an optimization procedure for training NeRFs from 2D video frames without requiring pose initialization by leveraging smooth camera motion and confidence-guided optimization. |
Existing methods for 3D object reconstruction from monocular video either rely on depth information or accurate camera poses, limiting their applicability in real-world scenarios where depth is unavailable and pose estimation is challenging. |
ICON incrementally registers video frames by leveraging motion smoothness and introduces a Neural Confidence Field to measure confidence in pose and 3D structure, using it to reweight gradients during optimization and escape local minima. |
ICON achieves superior performance on CO3D compared to methods requiring SfM pose initialization, even surpassing NeRF trained with COLMAP poses.
On HO3D, ICON achieves comparable pose tracking accuracy to state-of-the-art RGB-D methods while using only RGB input.
Ablation studies demonstrate the importance of incremental registration, confidence-based optimization, and restarts for handling challenging scenarios. |
ICON heavily relies on photometric consistency across viewpoints, limiting its performance in scenes with significant lighting variations, reflections, or transparency.
The reliance on gradient-based optimization through NeRF makes training computationally expensive. |
neural radiance fields, pose estimation, 3d reconstruction, confidence-based optimization, incremental registration |
2401.08930
Report |
3D Human Pose Analysis via Diffusion Synthesis |
Haorui Ji, Hongdong Li |
Diffusion models have demonstrated remarkable success in generative modeling.
In this paper, we propose PADS (Pose Analysis by Diffusion Synthesis), a novel
framework designed to address various challenges in 3D human pose analysis
through a unified pipeline. Central to PADS are two distinctive strategies: i)
learning a task-agnostic pose prior using a diffusion synthesis process to
effectively capture the kinematic constraints in human pose data, and ii)
unifying multiple pose analysis tasks like estimation, completion, denoising,
etc, as instances of inverse problems. The learned pose prior will be treated
as a regularization imposing on task-specific constraints, guiding the
optimization process through a series of conditional denoising steps. PADS
represents the first diffusion-based framework for tackling general 3D human
pose analysis within the inverse problem framework. Its performance has been
validated on different benchmarks, signaling the adaptability and robustness of
this pipeline. |
PADS: a novel framework that tackles various 3D human pose analysis problems in a unified diffusion-based pipeline, formulating them as instances of inverse problems. |
Addresses limitations of current 3D human pose analysis methods that rely on large paired datasets and are limited in application scope. |
1) Learns a task-agnostic pose prior using a diffusion synthesis process to capture kinematic constraints in human pose data. 2) Unifies multiple pose analysis tasks as inverse problems, using the learned pose prior as regularization during optimization through conditional denoising steps. |
Achieves state-of-the-art performance on H36M for 3D human pose estimation, outperforming existing unsupervised methods.
Demonstrates effective pose denoising capabilities, handling various noise types and intensities.
Shows promising results in pose completion, successfully reconstructing missing parts of human skeletons. |
Currently validated only on pose-based representations, with potential for generalization to mesh or implicit function representations.
Utilizes an image domain inverse problem solver (DPS); a tailored solver for human pose analysis could enhance performance. |
3d human pose analysis, diffusion models, inverse problems, pose prior, zero-shot learning |
2401.08815
Report |
Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive |
Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva |
Despite the recent advances in large-scale diffusion models, little progress
has been made on the layout-to-image (L2I) synthesis task. Current L2I models
either suffer from poor editability via text or weak alignment between the
generated image and the input layout. This limits their usability in practice.
To mitigate this, we propose to integrate adversarial supervision into the
conventional training pipeline of L2I diffusion models (ALDM). Specifically, we
employ a segmentation-based discriminator which provides explicit feedback to
the diffusion generator on the pixel-level alignment between the denoised image
and the input layout. To encourage consistent adherence to the input layout
over the sampling steps, we further introduce the multistep unrolling strategy.
Instead of looking at a single timestep, we unroll a few steps recursively to
imitate the inference process, and ask the discriminator to assess the
alignment of denoised images with the layout over a certain time window. Our
experiments show that ALDM enables layout faithfulness of the generated images,
while allowing broad editability via text prompts. Moreover, we showcase its
usefulness for practical applications: by synthesizing target distribution
samples via text control, we improve domain generalization of semantic
segmentation models by a large margin (~12 mIoU points). |
The paper introduces adversarial supervision and multistep unrolling strategy for layout-to-image (L2I) diffusion models to improve layout faithfulness without sacrificing text editability. |
Current L2I models struggle to balance adherence to layout conditions and flexibility in text-based editing, limiting their practical applications. |
The authors employ a segmentation-based discriminator to provide explicit feedback on layout alignment and introduce multistep unrolling to enforce consistent adherence to the layout over sampling steps. |
Adversarial supervision and multistep unrolling consistently improve layout faithfulness across different L2I diffusion models.
The proposed ALDM model achieves a balance between layout faithfulness (high mIoU) and text editability (high TIFA score).
Synthetic data augmentation using ALDM significantly improves domain generalization performance for semantic segmentation. |
Attribute editing can leak to unintended objects.
Perfect alignment with the layout map is not always achieved, especially with rare text prompts. |
layout-to-image synthesis, diffusion models, adversarial training, text controllability, domain generalization |
2401.08742
Report |
Fast Dynamic 3D Object Generation from a Single-view Video |
Zijie Pan, Zeyu Yang, Xiatian Zhu, Li Zhang |
Generating dynamic 3D object from a single-view video is challenging due to
the lack of 4D labeled data. Extending image-to-3D pipelines by transferring
off-the-shelf image generation models such as score distillation sampling,
existing methods tend to be slow and expensive to scale due to the need for
back-propagating the information-limited supervision signals through a large
pretrained model. To address this, we propose an efficient video-to-4D object
generation framework called Efficient4D. It generates high-quality
spacetime-consistent images under different camera views, and then uses them as
labeled data to directly train a novel 4D Gaussian splatting model with
explicit point cloud geometry, enabling real-time rendering under continuous
camera trajectories. Extensive experiments on synthetic and real videos show
that Efficient4D offers a remarkable 20-fold increase in speed when compared to
prior art alternatives while preserving the quality of novel view synthesis.
For example, Efficient4D takes only 6 mins to model a dynamic object, vs 120
mins by Consistent4D. |
This paper proposes "Efficient4D", an efficient two-staged pipeline for generating dynamic 3D objects from single-view videos. |
Existing methods for 4D object generation are computationally expensive and slow, hindering practical applications. Efficient4D addresses this efficiency challenge while maintaining high-quality novel view synthesis. |
The first stage generates temporally consistent multi-view images using a modified SyncDreamer with time-synchronous spatial volumes and frame interpolation. The second stage reconstructs the dynamic object using a novel 4D Gaussian splatting model optimized with a confidence-aware loss. |
Efficient4D achieves a 20x speedup compared to prior art (Consistent4D).
It demonstrates superior novel view synthesis quality both qualitatively and quantitatively.
The method is robust even with sparse input, generating smooth dynamics from as few as two frames. |
The local smoothing approach in image generation struggles with long videos. Future work could explore learnable attention layers for global receptive fields.
Handling long videos might require significant GPU memory. Utilizing multi-GPU or CPU solutions could alleviate this issue at the cost of processing time. |
4d generation, gaussian splatting, video, efficiency, novel view synthesis |
2401.08741
Report |
Fixed Point Diffusion Models |
Xingjian Bai, Luke Melas-Kyriazi |
We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to
image generation that integrates the concept of fixed point solving into the
framework of diffusion-based generative modeling. Our approach embeds an
implicit fixed point solving layer into the denoising network of a diffusion
model, transforming the diffusion process into a sequence of closely-related
fixed point problems. Combined with a new stochastic training method, this
approach significantly reduces model size, reduces memory usage, and
accelerates training. Moreover, it enables the development of two new
techniques to improve sampling efficiency: reallocating computation across
timesteps and reusing fixed point solutions between timesteps. We conduct
extensive experiments with state-of-the-art models on ImageNet, FFHQ,
CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in
performance and efficiency. Compared to the state-of-the-art DiT model, FPDM
contains 87% fewer parameters, consumes 60% less memory during training, and
improves image generation quality in situations where sampling computation or
time is limited. Our code and pretrained models are available at
https://lukemelas.github.io/fixed-point-diffusion-models. |
The paper introduces Fixed Point Diffusion Model (FPDM), a novel image generation approach integrating fixed point solving into diffusion models for reduced model size, memory usage, and improved sampling efficiency. |
Diffusion models are computationally expensive, posing challenges for deployment, especially on resource-constrained devices. FPDM addresses this by significantly reducing resource requirements while maintaining or enhancing image generation quality. |
FPDM incorporates an implicit fixed point layer within a denoising diffusion model, transforming the diffusion process into a sequence of fixed point problems. This allows for flexible computation allocation across timesteps and reuse of solutions between timesteps. A new training method, Stochastic Jacobian-Free Backpropagation (S-JFB), enables efficient training of the implicit layer. |
FPDM achieves superior image generation quality compared to DiT with significantly fewer parameters (87% reduction) and lower memory usage (60% reduction) when sampling computation is limited.
Smoothing computation across multiple timesteps in FPDM proves more effective than using fewer timesteps with more iterations per step, as is necessary in standard diffusion models.
Reusing fixed point solutions from previous timesteps significantly accelerates convergence, especially when the number of iterations per timestep is limited. |
FPDM's performance is slightly worse than DiT when sampling computation is not constrained, suggesting further optimization is needed for high-compute regimes.
The paper focuses on image generation, leaving exploration of FPDM's applicability to other domains, such as video or audio generation, for future work. |
diffusion models, image generation, implicit neural networks, fixed point solving, efficient sampling |
2401.08740
Report |
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers |
Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie |
We present Scalable Interpolant Transformers (SiT), a family of generative
models built on the backbone of Diffusion Transformers (DiT). The interpolant
framework, which allows for connecting two distributions in a more flexible way
than standard diffusion models, makes possible a modular study of various
design choices impacting generative models built on dynamical transport: using
discrete vs. continuous time learning, deciding the objective for the model to
learn, choosing the interpolant connecting the distributions, and deploying a
deterministic or stochastic sampler. By carefully introducing the above
ingredients, SiT surpasses DiT uniformly across model sizes on the conditional
ImageNet 256x256 benchmark using the exact same backbone, number of parameters,
and GFLOPs. By exploring various diffusion coefficients, which can be tuned
separately from learning, SiT achieves an FID-50K score of 2.06. |
Presents Scalable Interpolant Transformers (SiT), a family of generative models built on Diffusion Transformers (DiT) that surpasses DiT's performance by leveraging a flexible interpolant framework. |
To explore design choices impacting generative models built on dynamical transport and to simplify the learning problem for improved performance. |
Gradually transitions from a denoising diffusion model to an interpolant model, exploring choices of: discrete vs. continuous time learning, predicting velocity vs. score, various interpolants, and deterministic or stochastic sampling. |
SiT surpasses DiT across all model sizes on ImageNet 256x256 benchmark using identical backbones and training compute.
Learning a velocity model with a weighted score objective significantly improves performance over learning a score model.
SDE sampling generally outperforms ODE sampling, with tunable diffusion coefficients further enhancing results. |
The performance comparison between DDIM and Heun samplers is not directly comparable due to different orders of discretization.
Exploration of higher-order solvers did not yield performance improvements. |
generative models, diffusion models, transformers, image generation, stochastic interpolants |
2401.08725
Report |
Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks |
Chenyu Zhang, Lanjun Wang, Anan Liu |
Recent developments in text-to-image models, particularly Stable Diffusion,
have marked significant achievements in various applications. With these
advancements, there are growing safety concerns about the vulnerability of the
model that malicious entities exploit to generate targeted harmful images.
However, the existing methods in the vulnerability of the model mainly evaluate
the alignment between the prompt and generated images, but fall short in
revealing the vulnerability associated with targeted image generation. In this
study, we formulate the problem of targeted adversarial attack on Stable
Diffusion and propose a framework to generate adversarial prompts.
Specifically, we design a gradient-based embedding optimization method to craft
reliable adversarial prompts that guide stable diffusion to generate specific
images. Furthermore, after obtaining successful adversarial prompts, we reveal
the mechanisms that cause the vulnerability of the model. Extensive experiments
on two targeted attack tasks demonstrate the effectiveness of our method in
targeted attacks. The code can be obtained in
https://github.com/datar001/Revealing-Vulnerabilities-in-Stable-Diffusion-via-Targeted-Attacks. |
This paper proposes a targeted adversarial attack framework for Stable Diffusion to generate images of specific categories (objects or styles) from seemingly unrelated prompts. |
This work addresses the growing safety concerns regarding the vulnerability of text-to-image models like Stable Diffusion to malicious manipulation for generating harmful content. |
The framework uses two perturbation strategies (word substitution, suffix addition), a gradient-based embedding optimization method, and leverages image-text matching similarity to guide adversarial prompt generation. It also includes techniques to enhance prompt stealthiness and maintain semantic consistency. |
The proposed method significantly outperforms existing attack methods in terms of attack success rate and generated image quality.
The study reveals that verbs and prepositions play a crucial role in manipulating image generation, while longer suffixes increase the likelihood of successful attacks.
The analysis of successful attacks reveals vulnerabilities in Stable Diffusion related to the use of culturally diverse lexicon, hidden semantic connections, and the influence of early denoising steps on image generation. |
The negative correlation between attack success rate and semantic consistency in style attacks needs further investigation.
Future work will focus on exploring more sophisticated perturbation strategies and delving deeper into the model's vulnerability mechanisms. |
adversarial attack, stable diffusion, text-to-image generation, model vulnerability, prompt engineering |
2401.08570
Report |
RoHM: Robust Human Motion Reconstruction via Diffusion |
Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo |
We propose RoHM, an approach for robust 3D human motion reconstruction from
monocular RGB(-D) videos in the presence of noise and occlusions. Most previous
approaches either train neural networks to directly regress motion in 3D or
learn data-driven motion priors and combine them with optimization at test
time. The former do not recover globally coherent motion and fail under
occlusions; the latter are time-consuming, prone to local minima, and require
manual tuning. To overcome these shortcomings, we exploit the iterative,
denoising nature of diffusion models. RoHM is a novel diffusion-based motion
model that, conditioned on noisy and occluded input data, reconstructs
complete, plausible motions in consistent global coordinates. Given the
complexity of the problem -- requiring one to address different tasks
(denoising and infilling) in different solution spaces (local and global
motion) -- we decompose it into two sub-tasks and learn two models, one for
global trajectory and one for local motion. To capture the correlations between
the two, we then introduce a novel conditioning module, combining it with an
iterative inference scheme. We apply RoHM to a variety of tasks -- from motion
reconstruction and denoising to spatial and temporal infilling. Extensive
experiments on three popular datasets show that our method outperforms
state-of-the-art approaches qualitatively and quantitatively, while being
faster at test time. The code is available at
https://sanweiliti.github.io/ROHM/ROHM.html. |
This paper introduces RoHM, a novel diffusion-based approach for robust 3D human motion reconstruction from monocular RGB(-D) videos, effectively handling noise and occlusions. |
Reconstructing plausible 3D human motion from monocular videos is crucial for various applications, but existing methods often struggle with noise and occlusions, especially over extended periods. RoHM addresses these challenges by leveraging the iterative and generative nature of diffusion models. |
ROHM employs two diffusion models: TrajNet for global trajectory reconstruction and PoseNet for local pose estimation. It introduces TrajControl, a flexible conditioning module, to capture correlations between global and local motion. The model is trained on a large-scale motion capture dataset with synthetic noise and occlusions, and employs an iterative inference scheme for motion refinement. To further enhance realism, score-guided sampling is used with physics-based and image-based scores. |
ROHM significantly outperforms state-of-the-art optimization-based methods in terms of accuracy and physical plausibility, as evidenced by experiments on AMASS, PROX, and EgoBody datasets.
The method exhibits robustness to high levels of noise and varying occlusion patterns, demonstrating its ability to recover realistic motion dynamics even from significantly corrupted input.
ROHM achieves a substantial speedup of 30 times compared to optimization-based counterparts during inference, making it a promising approach for real-time applications. |
One limitation is its current offline processing nature, hindering real-time performance.
The method relies on 3D scene information and 2D joint detections for occlusion handling, posing challenges when these inputs are unreliable or unavailable. |
3d human motion reconstruction, diffusion models, motion denoising and in-filling, robust motion estimation, monocular rgb(-d) videos |
2401.08559
Report |
Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation |
Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, Gül Varol, Xue Bin Peng, Davis Rempe |
Recent advances in generative modeling have led to promising progress on
synthesizing 3D human motion from text, with methods that can generate
character animations from short prompts and specified durations. However, using
a single text prompt as input lacks the fine-grained control needed by
animators, such as composing multiple actions and defining precise durations
for parts of the motion. To address this, we introduce the new problem of
timeline control for text-driven motion synthesis, which provides an intuitive,
yet fine-grained, input interface for users. Instead of a single prompt, users
can specify a multi-track timeline of multiple prompts organized in temporal
intervals that may overlap. This enables specifying the exact timings of each
action and composing multiple actions in sequence or at overlapping intervals.
To generate composite animations from a multi-track timeline, we propose a new
test-time denoising method. This method can be integrated with any pre-trained
motion diffusion model to synthesize realistic motions that accurately reflect
the timeline. At every step of denoising, our method processes each timeline
interval (text prompt) individually, subsequently aggregating the predictions
with consideration for the specific body parts engaged in each action.
Experimental comparisons and ablations validate that our method produces
realistic motions that respect the semantics and timing of given text prompts.
Our code and models are publicly available at https://mathis.petrovich.fr/stmc. |
This paper introduces "multi-track timeline control" for text-driven 3D human motion synthesis, allowing users to specify complex actions with precise timing using a timeline interface. |
Current text-to-motion synthesis methods lack fine-grained control, making it difficult to compose multiple actions and define precise durations. This new method offers an intuitive solution for animators. |
The authors propose "Spatio-Temporal Motion Collage" (STMC), a test-time denoising method that leverages pre-trained motion diffusion models. STMC processes individual timeline intervals independently, stitching them together spatially and temporally for coherent motion. |
STMC outperforms baselines adapted from existing methods in both semantic correctness and realism, as demonstrated by quantitative metrics and perceptual studies.
The method effectively handles spatial and temporal composition, accurately reflecting the semantics and timing of text prompts within the timeline.
The authors also introduce an improved motion diffusion model with SMPL support, resulting in faster sampling and direct SMPL pose generation. |
The method's performance depends on the underlying pre-trained diffusion model, inheriting its limitations in handling complex compositions.
Current implementation restricts overlapping motions to compatible body parts, similar to the SINC method. |
3d human motion synthesis, text-driven animation, timeline control, motion diffusion models, spatio-temporal motion collage |
2401.08541
Report |
Scalable Pre-training of Large Autoregressive Image Models |
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin |
This paper introduces AIM, a collection of vision models pre-trained with an
autoregressive objective. These models are inspired by their textual
counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling
properties. Specifically, we highlight two key findings: (1) the performance of
the visual features scale with both the model capacity and the quantity of
data, (2) the value of the objective function correlates with the performance
of the model on downstream tasks. We illustrate the practical implication of
these findings by pre-training a 7 billion parameter AIM on 2 billion images,
that achieves 84.0% on ImageNet-1k with a frozen trunk. Interestingly, even at
this scale, we observe no sign of saturation in performance, suggesting that
AIM potentially represents a new frontier for training large-scale vision
models. The pre-training of AIM is similar to the pre-training of LLMs, and
does not require any image-specific strategy to stabilize the training at
scale. |
The paper introduces Autoregressive Image Models (AIM), a collection of vision models pre-trained with an autoregressive objective, achieving competitive performance with scaling properties similar to Large Language Models (LLMs). |
The paper explores the generalization of LLM's success in scaling transformers with an autoregressive objective to the vision domain. |
The paper utilizes a prefix attention mechanism, heavily parameterized token-level prediction head, and pixel-level regression loss to train ViT models on a large dataset of uncurated web images (DFN-2B) with an autoregressive objective. |
AIM performance scales with both model capacity and data quantity.
The autoregressive objective function value correlates with downstream task performance.
A 7 billion parameter AIM pre-trained on 2 billion images achieves 84.0% accuracy on ImageNet-1k with a frozen trunk, outperforming prior generative methods and nearing joint embedding method performance. |
Other methods like MAE show higher sample efficiency and lower risk of overfitting with smaller datasets.
Contrastive methods achieve better performance for a given model size but face scalability and loss tractability challenges. |
autoregressive models, vision transformers, self-supervised learning, pre-training at scale, generative pre-training |
2401.08472
Report |
Instilling Multi-round Thinking to Text-guided Image Generation |
Lidong Zeng, Zhedong Zheng, Yinwei Wei, Tat-seng Chua |
This paper delves into the text-guided image editing task, focusing on
modifying a reference image according to user-specified textual feedback to
embody specific attributes. Despite recent advancements, a persistent challenge
remains that the single-round generation often overlooks crucial details,
particularly in the realm of fine-grained changes like shoes or sleeves. This
issue compounds over multiple rounds of interaction, severely limiting
customization quality. In an attempt to address this challenge, we introduce a
new self-supervised regularization, \ie, multi-round regularization, which is
compatible with existing methods. Specifically, the multi-round regularization
encourages the model to maintain consistency across different modification
orders. It builds upon the observation that the modification order generally
should not affect the final result. Different from traditional one-round
generation, the mechanism underpinning the proposed method is the error
amplification of initially minor inaccuracies in capturing intricate details.
Qualitative and quantitative experiments affirm that the proposed method
achieves high-fidelity editing quality, especially the local modification, in
both single-round and multiple-round generation, while also showcasing robust
generalization to irregular text inputs. The effectiveness of our semantic
alignment with textual feedback is further substantiated by the retrieval
improvements on FahisonIQ and Fashion200k. |
This paper proposes a novel self-supervised regularization method for text-guided image editing, enhancing the consistency and accuracy of multi-round generation, particularly for fine-grained details. |
Existing single-round generation methods often miss crucial details, especially in multi-round interactions, limiting the quality of customization. |
The proposed method encourages consistency across different modification orders by optimizing error accumulation through a novel multi-round regularization loss. The approach leverages a pre-trained diffusion model and CLIP encoders for text and image representations. |
The method demonstrates superior performance on FashionIQ and Fashion200k datasets in terms of both visual quality (FID) and semantic alignment (CLIP Score, Recall@K).
It exhibits robust generalization to ill-formed text inputs, including swapped sentence order, rotated word order, and masked words.
The proposed approach effectively captures fine-grained details and maintains consistency across multiple rounds of generation. |
The model relies on pre-trained encoders and diffusion models, potentially limiting its flexibility in handling novel concepts.
The current work primarily focuses on two-round consistency; future work could explore extending it to a greater number of rounds. |
image editing, text guidance, multi-round thinking, self-supervised learning, fine-grained generation |
2401.08392
Report |
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) |
Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, Yi Yang |
Recent LLM-driven visual agents mainly focus on solving image-based tasks,
which limits their ability to understand dynamic scenes, making it far from
real-life applications like guiding students in laboratory experiments and
identifying their mistakes. Hence, this paper explores DoraemonGPT, a
comprehensive and conceptually elegant system driven by LLMs to understand
dynamic scenes. Considering the video modality better reflects the
ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a
video agent. Given a video with a question/task, DoraemonGPT begins by
converting the input video into a symbolic memory that stores task-related
attributes. This structured representation allows for spatial-temporal querying
and reasoning by well-designed sub-task tools, resulting in concise
intermediate results. Recognizing that LLMs have limited internal knowledge
when it comes to specialized domains (e.g., analyzing the scientific principles
underlying experiments), we incorporate plug-and-play tools to assess external
knowledge and address tasks across different domains. Moreover, a novel
LLM-driven planner based on Monte Carlo Tree Search is introduced to explore
the large planning space for scheduling various tools. The planner iteratively
finds feasible solutions by backpropagating the result's reward, and multiple
solutions can be summarized into an improved final answer. We extensively
evaluate DoraemonGPT's effectiveness on three benchmarks and several
in-the-wild scenarios. The code will be released at
https://github.com/z-x-yang/DoraemonGPT. |
This paper presents DoraemonGPT, an LLM-driven system for understanding dynamic scenes, exemplified as a video agent that can understand and reason about videos. |
Understanding dynamic scenes is crucial for real-life applications of AI, such as guiding students in lab experiments or analyzing surveillance footage, which current image-based LLM agents struggle with. |
DoraemonGPT converts videos into a symbolic memory (space-dominant and time-dominant), utilizes sub-task tools for spatial-temporal reasoning, incorporates external knowledge tools, and employs an LLM-driven MCTS planner to explore solutions. |
DoraemonGPT outperforms state-of-the-art LLM-driven agents on video question answering (NExT-QA, TVQA+) and referring object segmentation (Ref-YouTube-VOS).
The MCTS planner effectively explores the solution space, leading to more accurate and comprehensive answers compared to greedy search methods.
DoraemonGPT demonstrates its ability to handle complex, in-the-wild scenarios, including checking experimental operations, video understanding, and video editing. |
The current design of memory types relies on heuristics and lacks an automated approach.
The performance of DoraemonGPT is inherently tied to the capabilities and limitations of the foundation models it employs. |
large language models, video understanding, dynamic scene understanding, visual reasoning, monte carlo tree search |
2401.08276
Report |
AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception |
Yipo Huang, Quan Yuan, Xiangfei Sheng, Zhichao Yang, Haoning Wu, Pengfei Chen, Yuzhe Yang, Leida Li, Weisi Lin |
With collective endeavors, multimodal large language models (MLLMs) are
undergoing a flourishing development. However, their performances on image
aesthetics perception remain indeterminate, which is highly desired in
real-world applications. An obvious obstacle lies in the absence of a specific
benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This
blind groping may impede the further development of more advanced MLLMs with
aesthetic perception capacity. To address this dilemma, we propose AesBench, an
expert benchmark aiming to comprehensively evaluate the aesthetic perception
capacities of MLLMs through elaborate design across dual facets. (1) We
construct an Expert-labeled Aesthetics Perception Database (EAPD), which
features diversified image contents and high-quality annotations provided by
professional aesthetic experts. (2) We propose a set of integrative criteria to
measure the aesthetic perception abilities of MLLMs from four perspectives,
including Perception (AesP), Empathy (AesE), Assessment (AesA) and
Interpretation (AesI). Extensive experimental results underscore that the
current MLLMs only possess rudimentary aesthetic perception ability, and there
is still a significant gap between MLLMs and humans. We hope this work can
inspire the community to engage in deeper explorations on the aesthetic
potentials of MLLMs. Source data will be available at
https://github.com/yipoh/AesBench. |
This paper introduces AesBench, an expert-designed benchmark to comprehensively evaluate the aesthetic perception abilities of Multimodal Large Language Models (MLLMs). |
The effectiveness of MLLMs on image aesthetics perception, a crucial aspect in various real-world applications, remains underexplored. This benchmark aims to systematically evaluate and potentially guide the development of MLLMs with enhanced aesthetic perception capabilities. |
AesBench encompasses two key components: (1) EAPD, a high-quality dataset with diverse images and expert annotations covering aesthetic attributes, emotional responses, quality assessments, and interpretations. (2) A four-dimensional evaluation framework based on Perception, Empathy, Assessment, and Interpretation, with criteria designed to assess MLLMs' understanding and reasoning about image aesthetics. |
Current MLLMs demonstrate limited aesthetic perception abilities, showing a significant gap compared to human performance.
AesBench effectively differentiates the aesthetic perception capabilities across various MLLMs, with Q-Instruct, GPT-4V, and Gemini Pro Vision showcasing relatively better performance.
MLLMs struggle particularly with aesthetic interpretation, often exhibiting hallucinations and lacking precision in their reasoning. |
The study is limited by the reliance on GPT-assisted evaluation for certain tasks due to the open-ended nature of responses.
Future work can explore expanding the dataset with more diverse image styles and cultural contexts to enhance the generalizability of the benchmark. |
multimodal large language models, image aesthetics perception, benchmarking, expert annotations, aesthetic interpretation |
2401.08100
Report |
KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain |
Anh-Cuong Pham, Van-Quang Nguyen, Thi-Hong Vuong, Quang-Thuy Ha |
Image captioning is a crucial task with applications in a wide range of
domains, including healthcare and education. Despite extensive research on
English image captioning datasets, the availability of such datasets for
Vietnamese remains limited, with only two existing datasets. In this study, we
introduce KTVIC, a comprehensive Vietnamese Image Captioning dataset focused on
the life domain, covering a wide range of daily activities. This dataset
comprises 4,327 images and 21,635 Vietnamese captions, serving as a valuable
resource for advancing image captioning in the Vietnamese language. We conduct
experiments using various deep neural networks as the baselines on our dataset,
evaluating them using the standard image captioning metrics, including BLEU,
METEOR, CIDEr, and ROUGE. Our findings underscore the effectiveness of the
proposed dataset and its potential contributions to the field of image
captioning in the Vietnamese context. |
This paper introduces KTVIC, a novel Vietnamese image captioning dataset focused on daily life activities, featuring 4,327 images with 5 captions each (totaling 21,635 captions). |
Existing Vietnamese image captioning datasets are limited, hindering research in this domain. KTVIC addresses this gap by providing a comprehensive resource for advancing Vietnamese image captioning. |
KTVIC leverages images from the UIT-EVJVQA dataset, annotating each with 5 captions following established guidelines. Three baseline models (CNN-LSTM, ViT-Transformer, GRIT) are evaluated on the dataset. |
KTVIC proves effective, enabling all baseline models to generate meaningful Vietnamese captions.
Transformer-based models (ViT-Transformer, GRIT) outperform the CNN-LSTM model, highlighting the strength of Transformers in this task.
GRIT, utilizing both grid and region features, achieves the best performance, demonstrating the effectiveness of this approach for Vietnamese image captioning. |
All baselines are fine-tuned using cross-entropy loss without further optimization techniques like CIDEr-D.
Future work can explore more advanced architectures and optimization strategies to further enhance Vietnamese image captioning performance. |
vietnamese image captioning, image captioning dataset, deep neural networks, computer vision, natural language processing |
2401.08053
Report |
SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation |
Zhixuan Liu, Peter Schaldenbrand, Beverley-Claire Okogwu, Wenxuan Peng, Youngsik Yun, Andrew Hundt, Jihie Kim, Jean Oh |
Accurate representation in media is known to improve the well-being of the
people who consume it. Generative image models trained on large web-crawled
datasets such as LAION are known to produce images with harmful stereotypes and
misrepresentations of cultures. We improve inclusive representation in
generated images by (1) engaging with communities to collect a culturally
representative dataset that we call the Cross-Cultural Understanding Benchmark
(CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT) method
that leverages the model's known biases to self-improve. SCoFT is designed to
prevent overfitting on small datasets, encode only high-level information from
the data, and shift the generated distribution away from misrepresentations
encoded in a pretrained model. Our user study conducted on 51 participants from
5 different countries based on their self-selected national cultural
affiliation shows that fine-tuning on CCUB consistently generates images with
higher cultural relevance and fewer stereotypes when compared to the Stable
Diffusion baseline, which is further improved with our SCoFT technique. |
This paper introduces SCoFT, a novel fine-tuning method for pre-trained text-to-image models to improve the cultural representation in generated images and reduce harmful stereotypes. |
Accurate representation in media is crucial for well-being and understanding of diverse cultures. Existing models trained on large, unfiltered datasets often perpetuate harmful stereotypes and misrepresent cultures. |
The authors collected CCUB, a culturally representative dataset with images and captions. They then proposed SCoFT, a self-contrastive fine-tuning technique that leverages pre-trained model's biases by using its generated images as negative examples and CCUB images as positive examples during training. |
Fine-tuning on CCUB significantly reduces offensiveness and increases cultural relevance in generated images compared to the baseline Stable Diffusion model.
SCoFT further enhances these improvements by leveraging perceptual loss and a novel self-contrastive approach.
User studies with participants from diverse cultures confirm SCoFT’s effectiveness in generating more culturally representative and less stereotypical images. |
The current approach primarily focuses on generating accurate images within a specific cultural context, with future work exploring the generation of diverse images for generic prompts.
While CCUB was curated by cultural experts, more rigorous verification methods could be employed to further enhance the dataset's quality. |
culturally-aware image synthesis, text-to-image generation, stereotype mitigation, fine-tuning, contrastive learning |
2401.07781
Report |
Towards A Better Metric for Text-to-Video Generation |
Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou |
Generative models have demonstrated remarkable capability in synthesizing
high-quality text, images, and videos. For video generation, contemporary
text-to-video models exhibit impressive capabilities, crafting visually
stunning videos. Nonetheless, evaluating such videos poses significant
challenges. Current research predominantly employs automated metrics such as
FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis,
particularly in the temporal assessment of video content, thus rendering them
unreliable indicators of true video quality. Furthermore, while user studies
have the potential to reflect human perception accurately, they are hampered by
their time-intensive and laborious nature, with outcomes that are often tainted
by subjective bias. In this paper, we investigate the limitations inherent in
existing metrics and introduce a novel evaluation pipeline, the Text-to-Video
Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video
Alignment, which scrutinizes the fidelity of the video in representing the
given text description, and (2) Video Quality, which evaluates the video's
overall production caliber with a mixture of experts. Moreover, to evaluate the
proposed metrics and facilitate future improvements on them, we present the
TVGE dataset, collecting human judgements of 2,543 text-to-video generated
videos on the two criteria. Experiments on the TVGE dataset demonstrate the
superiority of the proposed T2VScore on offering a better metric for
text-to-video generation. |
This paper introduces T2VScore, a novel automatic evaluation metric for text-to-video generation, assessing both text-video alignment and video quality. |
Existing automated metrics for evaluating text-to-video models fall short in capturing temporal aspects and often misalign with human perception. |
T2VScore comprises two metrics: T2VScore-A, using vision-language models for text-video alignment assessment via question answering, and T2VScore-Q, employing a mix-of-experts approach combining technical and semantic quality evaluation for video quality assessment. The authors further present the TVGE dataset with human judgments on alignment and quality for benchmarking. |
T2VScore demonstrates superior correlation with human judgments compared to baseline metrics on the TVGE dataset.
Auxiliary trajectory information significantly enhances temporal understanding for evaluating text-video alignment.
The proposed adaptation strategy effectively generalizes T2VScore-Q to unseen text-to-video models. |
T2VScore-A's performance relies on the capabilities of multimodal large language models, which are still under development.
The TVGE dataset will be continuously expanded with more open-source text-to-video models. |
text-to-video generation, evaluation metric, video quality assessment, text-video alignment, multimodal large language models |
2401.07770
Report |
Seeing the Unseen: Visual Common Sense for Semantic Placement |
Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao Zeng, Luca Weihs |
Computer vision tasks typically involve describing what is present in an
image (e.g. classification, detection, segmentation, and captioning). We study
a visual common sense task that requires understanding what is not present.
Specifically, given an image (e.g. of a living room) and name of an object
("cushion"), a vision system is asked to predict semantically-meaningful
regions (masks or bounding boxes) in the image where that object could be
placed or is likely be placed by humans (e.g. on the sofa). We call this task:
Semantic Placement (SP) and believe that such common-sense visual understanding
is critical for assitive robots (tidying a house), and AR devices
(automatically rendering an object in the user's space). Studying the invisible
is hard. Datasets for image description are typically constructed by curating
relevant images and asking humans to annotate the contents of the image;
neither of those two steps are straightforward for objects not present in the
image. We overcome this challenge by operating in the opposite direction: we
start with an image of an object in context from web, and then remove that
object from the image via inpainting. This automated pipeline converts
unstructured web data into a dataset comprising pairs of images with/without
the object. Using this, we collect a novel dataset, with ${\sim}1.3$M images
across $9$ object categories, and train a SP prediction model called CLIP-UNet.
CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors
with object detectors on real-world and simulated images. In our user studies,
we find that the SP masks predicted by CLIP-UNet are favored $43.7\%$ and
$31.3\%$ times when comparing against the $4$ SP baselines on real and
simulated images. In addition, we demonstrate leveraging SP mask predictions
from CLIP-UNet enables downstream applications like building tidying robots in
indoor environments. |
This paper introduces Semantic Placement (SP), a novel task where a vision system predicts a binary mask highlighting semantically meaningful regions for placing a given object in an image. |
SP is crucial for applications like assistive robots, AR devices, and visually-grounded chatbots, requiring common-sense visual understanding of plausible object placements. |
The authors propose an automated data pipeline leveraging inpainting and object detection to generate a large-scale dataset of images with and without objects. They then train a CLIP-UNet model, combining a CLIP backbone with a language-conditioned UNet decoder, to predict SP masks. |
CLIP-UNet outperforms baselines combining LLMs with object detectors and VLM baselines (LLaVa, GPT4V) on SP prediction.
Human studies show strong preference for CLIP-UNet's SP mask predictions over other baselines.
The predicted SP masks enable a robot to perform an Embodied Semantic Placement (ESP) task in a simulated environment, demonstrating downstream applicability. |
The approach is limited by the performance of open-vocabulary detectors, segmentation models, and inpainting models used in data generation, which can introduce biases.
Zero-shot deployment for tasks like ESP can result in predictions not feasible for the robot's physical capabilities, necessitating embodiment-aware finetuning. |
semantic placement, computer vision, vision and language, robotics, common sense reasoning |
2401.07727
Report |
HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation |
Antoine Mercier, Ramin Nakhli, Mahesh Reddy, Rajeev Yasarla, Hong Cai, Fatih Porikli, Guillaume Berger |
Despite the latest remarkable advances in generative modeling, efficient
generation of high-quality 3D assets from textual prompts remains a difficult
task. A key challenge lies in data scarcity: the most extensive 3D datasets
encompass merely millions of assets, while their 2D counterparts contain
billions of text-image pairs. To address this, we propose a novel approach
which harnesses the power of large, pretrained 2D diffusion models. More
specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image
model to jointly predict 6 orthographic projections and the corresponding
latent triplane. We then decode these latents to generate a textured mesh.
HexaGen3D does not require per-sample optimization, and can infer high-quality
and diverse objects from textual prompts in 7 seconds, offering significantly
better quality-to-latency trade-offs when comparing to existing approaches.
Furthermore, HexaGen3D demonstrates strong generalization to new objects or
compositions. |
HexaGen3D is a novel text-to-3D model that generates textured meshes from text prompts in 7 seconds, leveraging pretrained text-to-image diffusion models. |
Efficient generation of high-quality 3D assets from text is crucial for various industries but remains challenging due to data scarcity. Existing methods are either slow or lack quality/diversity. |
HexaGen3D finetunes a pretrained text-to-image model to predict six orthographic projections (hexaview), then maps these to a triplanar latent representation, finally decoded into a textured mesh. It introduces 'orthographic hexaview guidance' for 3D consistency and uses a novel layout converter for hexaview-to-triplane mapping. |
HexaGen3D achieves competitive quality to state-of-the-art methods while being significantly faster (7 seconds vs. 20 minutes to 3 hours).
It demonstrates superior diversity across generated samples compared to methods like DreamFusion and MVDream.
The approach shows strong generalization to unseen objects and compositions. |
Generated meshes can occasionally exhibit box artifacts or struggle with intricate structures.
Future work will focus on refining the VAE pipeline and exploring the impact of larger 3D datasets. |
text-to-3d, diffusion models, generative models, 3d asset creation, multi-view synthesis |
2401.07709
Report |
Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks |
Siyu Zou, Jiji Tang, Yiyi Zhou, Jing He, Chaoyi Zhao, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun |
Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which
often applies a semantic mask to control the target area for diffusion-based
editing. However, most existing solutions obtain these masks via manual
operations or off-line processing, greatly reducing their efficiency. In this
paper, we propose a novel and efficient image editing method for Text-to-Image
(T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit). In
particular, InstDiffEdit aims to employ the cross-modal attention ability of
existing diffusion models to achieve instant mask guidance during the diffusion
steps. To reduce the noise of attention maps and realize the full automatics,
we equip InstDiffEdit with a training-free refinement scheme to adaptively
aggregate the attention distributions for the automatic yet accurate mask
generation. Meanwhile, to supplement the existing evaluations of DIE, we
propose a new benchmark called Editing-Mask to examine the mask accuracy and
local editing ability of existing methods. To validate InstDiffEdit, we also
conduct extensive experiments on ImageNet and Imagen, and compare it with a
bunch of the SOTA methods. The experimental results show that InstDiffEdit not
only outperforms the SOTA methods in both image quality and editing results,
but also has a much faster inference speed, i.e., +5 to +6 times. |
This paper proposes InstDiffEdit, a novel and efficient image editing method for text-to-image diffusion models that uses cross-modal attention for instant mask guidance during diffusion steps. |
Existing diffusion-based image editing methods often rely on manual or offline mask generation, limiting their efficiency. InstDiffEdit aims to automate this process and improve speed. |
InstDiffEdit leverages the cross-modal attention maps within diffusion models to generate masks instantly. It incorporates a training-free refinement scheme to reduce noise and adaptively aggregate attention distributions for accurate mask generation. Finally, it uses the generated mask for inpainting-based editing, ensuring global semantic consistency. |
InstDiffEdit achieves state-of-the-art performance on ImageNet and Imagen datasets, demonstrating a superior trade-off between computation efficiency and generation quality.
Compared to the current leading method, DiffEdit, InstDiffEdit achieves 5 to 6 times faster inference speed while producing better masks and editing results.
A new benchmark called Editing-Mask, containing 200 images with human-labeled masks, is introduced to evaluate the local editing ability and mask accuracy of different methods, further confirming the superiority of InstDiffEdit in background preservation. |
The performance of InstDiffEdit might be affected by the complexity and quality of input images and text prompts.
Further exploration of more sophisticated refinement techniques for attention maps could potentially lead to even better mask accuracy and editing results. Future work could investigate extending InstDiffEdit to other diffusion model architectures beyond Stable Diffusion. |
image editing, diffusion models, text-to-image synthesis, cross-modal attention, semantic image manipulation |
2401.07519
Report |
InstantID: Zero-shot Identity-Preserving Generation in Seconds |
Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, Yao Hu |
There has been significant progress in personalized image synthesis with
methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world
applicability is hindered by high storage demands, lengthy fine-tuning
processes, and the need for multiple reference images. Conversely, existing ID
embedding-based methods, while requiring only a single forward inference, face
challenges: they either necessitate extensive fine-tuning across numerous model
parameters, lack compatibility with community pre-trained models, or fail to
maintain high face fidelity. Addressing these limitations, we introduce
InstantID, a powerful diffusion model-based solution. Our plug-and-play module
adeptly handles image personalization in various styles using just a single
facial image, while ensuring high fidelity. To achieve this, we design a novel
IdentityNet by imposing strong semantic and weak spatial conditions,
integrating facial and landmark images with textual prompts to steer the image
generation. InstantID demonstrates exceptional performance and efficiency,
proving highly beneficial in real-world applications where identity
preservation is paramount. Moreover, our work seamlessly integrates with
popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving
as an adaptable plugin. Our codes and pre-trained checkpoints will be available
at https://github.com/InstantID/InstantID. |
Introduces InstantID, a plug-and-play module for pre-trained text-to-image diffusion models, enabling zero-shot identity-preserving image generation using a single facial image. |
Addresses limitations of existing methods that are either resource-intensive, require multiple reference images, or lack fidelity in preserving facial details. |
Combines an ID embedding from a pre-trained face model with a lightweight image adapter and a novel IdentityNet for encoding detailed facial features with spatial control. |
Achieves high fidelity in ID preservation with a single reference image, surpassing or matching the performance of training-based methods.
Maintains text editing capabilities of the original diffusion model, allowing for style variations while preserving identity.
Demonstrates compatibility with existing ControlNet models for spatial control and seamlessly integrates with pre-trained models like SD1.5 and SDXL. |
Facial attribute features are highly coupled in the ID embedding, limiting flexibility in face editing.
Potential biases from the pre-trained face model might impact the generated images. |
image synthesis, identity preservation, diffusion models, zero-shot learning, image customization |
2401.06994
Report |
UniVision: A Unified Framework for Vision-Centric 3D Perception |
Yu Hong, Qian Liu, Huayuan Cheng, Danjiao Ma, Hang Dai, Yu Wang, Guangzhi Cao, Yong Ding |
The past few years have witnessed the rapid development of vision-centric 3D
perception in autonomous driving. Although the 3D perception models share many
structural and conceptual similarities, there still exist gaps in their feature
representations, data formats, and objectives, posing challenges for unified
and efficient 3D perception framework design. In this paper, we present
UniVision, a simple and efficient framework that unifies two major tasks in
vision-centric 3D perception, \ie, occupancy prediction and object detection.
Specifically, we propose an explicit-implicit view transform module for
complementary 2D-3D feature transformation. We propose a local-global feature
extraction and fusion module for efficient and adaptive voxel and BEV feature
extraction, enhancement, and interaction. Further, we propose a joint
occupancy-detection data augmentation strategy and a progressive loss weight
adjustment strategy which enables the efficiency and stability of the
multi-task framework training. We conduct extensive experiments for different
perception tasks on four public benchmarks, including nuScenes LiDAR
segmentation, nuScenes detection, OpenOccupancy, and Occ3D. UniVision achieves
state-of-the-art results with +1.5 mIoU, +1.8 NDS, +1.5 mIoU, and +1.8 mIoU
gains on each benchmark, respectively. We believe that the UniVision framework
can serve as a high-performance baseline for the unified vision-centric 3D
perception task. The code will be available at
\url{https://github.com/Cc-Hy/UniVision}. |
UniVision, a simple and efficient framework unifying 3D object detection and occupancy prediction for vision-centric autonomous driving. |
Existing 3D perception models have gaps in feature representations, data formats, and objectives, making unified framework design challenging. |
UniVision utilizes an explicit-implicit view transform module, local-global feature extraction and fusion, joint occupancy-detection augmentation, and progressive loss weight adjustment. |
Achieves state-of-the-art results on nuScenes LiDAR segmentation, surpassing previous best by +1.5 mIoU.
Outperforms state-of-the-art methods on nuScenes detection, with a significant +1.8 NDS gain.
Sets new records on OpenOccupancy and Occ3D benchmarks with +1.5 mIoU and +1.8 mIoU gains respectively. |
Current version of UniVision does not incorporate temporal information.
Joint augmentation strategy currently relies on sampling and interpolation, which might introduce artifacts. |
3d object detection, occupancy prediction, autonomous driving, vision-centric perception, multi-task learning |
2401.06805
Report |
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning |
Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang |
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence
(AGI) with abstract reasoning ability is the goal of next-generation AI. Recent
advancements in Large Language Models (LLMs), along with the emerging field of
Multimodal Large Language Models (MLLMs), have demonstrated impressive
capabilities across a wide range of multimodal tasks and applications.
Particularly, various MLLMs, each with distinct model architectures, training
data, and training stages, have been evaluated across a broad range of MLLM
benchmarks. These studies have, to varying degrees, revealed different aspects
of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs
have not been systematically investigated. In this survey, we comprehensively
review the existing evaluation protocols of multimodal reasoning, categorize
and illustrate the frontiers of MLLMs, introduce recent trends in applications
of MLLMs on reasoning-intensive tasks, and finally discuss current practices
and future directions. We believe our survey establishes a solid base and sheds
light on this important topic, multimodal reasoning. |
This paper presents a comprehensive survey of Multimodal Large Language Models (MLLMs) focusing on their reasoning capabilities. |
Reasoning is a key aspect of intelligence and crucial for developing AGI, making the investigation of MLLMs' reasoning abilities essential. |
The paper reviews definitions and protocols for evaluating reasoning, different types of reasoning tasks, MLLM architectures, and the role of instruction tuning in facilitating reasoning. It analyzes the performance of various MLLMs on established benchmarks. |
MLLMs struggle with complex reasoning tasks requiring multi-step inference and domain knowledge.
Instruction tuning significantly improves MLLMs' reasoning capabilities but faces challenges in multimodal prompting.
Top-performing open-source MLLMs often employ three-stage training, leverage multi-task supervised learning, and benefit from improved visual representations. |
The analysis primarily focuses on top-performing models and a subset of reasoning-focused benchmarks, limiting the generalizability of findings.
Future research should address limitations in MLLM architectures, develop efficient training methods, and create more comprehensive evaluation benchmarks, particularly for long-context and multi-round conversational scenarios. |
multimodal reasoning, multimodal large language models, instruction tuning, benchmark analysis, future directions |
2401.06704
Report |
Scalable 3D Panoptic Segmentation As Superpoint Graph Clustering |
Damien Robert, Hugo Raguet, Loic Landrieu |
We introduce a highly efficient method for panoptic segmentation of large 3D
point clouds by redefining this task as a scalable graph clustering problem.
This approach can be trained using only local auxiliary tasks, thereby
eliminating the resource-intensive instance-matching step during training.
Moreover, our formulation can easily be adapted to the superpoint paradigm,
further increasing its efficiency. This allows our model to process scenes with
millions of points and thousands of objects in a single inference. Our method,
called SuperCluster, achieves a new state-of-the-art panoptic segmentation
performance for two indoor scanning datasets: $50.1$ PQ ($+7.8$) for S3DIS
Area~5, and $58.7$ PQ ($+25.2$) for ScanNetV2. We also set the first
state-of-the-art for two large-scale mobile mapping benchmarks: KITTI-360 and
DALES. With only $209$k parameters, our model is over $30$ times smaller than
the best-competing method and trains up to $15$ times faster. Our code and
pretrained models are available at
https://github.com/drprojects/superpoint_transformer. |
SuperCluster, a novel method for efficient and scalable 3D panoptic segmentation of large point clouds, redefines the task as a graph clustering problem. |
Large-scale 3D environment understanding is crucial for various applications like 'digital twins' and city digitization, requiring scalable models to process massive point clouds and identify objects. |
The method uses a neural network to predict semantic classes and object agreement for points (or superpoints). These predictions are used as parameters in a graph clustering problem, grouping points into object instances. Crucially, the model is trained using only local auxiliary tasks, eliminating computationally expensive instance matching during training. |
SuperCluster achieves state-of-the-art panoptic segmentation on S3DIS and ScanNet, significantly outperforming previous methods.
It sets the first panoptic segmentation benchmark for large-scale datasets KITTI-360 and DALES.
SuperCluster is extremely efficient, using a small network and training up to 15 times faster than competitors. |
The graph clustering function is non-differentiable, preventing end-to-end learning.
Superpoint partitioning can be sensitive to low point density in sparse scans. |
3d panoptic segmentation, graph clustering, point cloud processing, large-scale 3d, superpoints |
2401.06637
Report |
Adversarial Examples are Misaligned in Diffusion Model Manifolds |
Peter Lorenz, Ricard Durall, Janis Keuper |
In recent years, diffusion models (DMs) have drawn significant attention for
their success in approximating data distributions, yielding state-of-the-art
generative results. Nevertheless, the versatility of these models extends
beyond their generative capabilities to encompass various vision applications,
such as image inpainting, segmentation, adversarial robustness, among others.
This study is dedicated to the investigation of adversarial attacks through the
lens of diffusion models. However, our objective does not involve enhancing the
adversarial robustness of image classifiers. Instead, our focus lies in
utilizing the diffusion model to detect and analyze the anomalies introduced by
these attacks on images. To that end, we systematically examine the alignment
of the distributions of adversarial examples when subjected to the process of
transformation using diffusion models. The efficacy of this approach is
assessed across CIFAR-10 and ImageNet datasets, including varying image sizes
in the latter. The results demonstrate a notable capacity to discriminate
effectively between benign and attacked images, providing compelling evidence
that adversarial instances do not align with the learned manifold of the DMs. |
This paper presents a novel method for detecting adversarial examples in images using diffusion models (DMs). The key idea is to leverage the DM's ability to learn the manifold of natural images and exploit the fact that adversarial examples often lie outside this manifold. This results in distinct patterns in transformed adversarial images, which can be learned by a simple binary classifier. |
Detecting adversarial examples is crucial for deploying deep learning models in security-sensitive applications, as these examples can lead to misclassifications and system vulnerabilities. Existing defense mechanisms often struggle with high-resolution images and adaptive attacks. This method offers a new approach to address this challenge by using the transformative capabilities of DMs. |
The method involves the following steps: 1) Apply a pre-trained DM to transform both benign and adversarial images using the inversion and reverse process. 2) Train a binary classifier (ResNet-50 or ResNet-18) on the transformed images to distinguish between adversarial and benign samples. |
The method achieves high detection accuracy (AUC, ACC > 95%) across various white-box and black-box attacks on CIFAR-10 and ImageNet datasets, including high-resolution images (512x512 pixels).
Analysis of the transformed images suggests that adversarial perturbations introduce detectable patterns in the DM's reverse process, even after multiple transformations.
While the method shows promising results, its transferability to unseen attacks is limited, suggesting it acts as a complementary defense mechanism rather than a standalone solution. |
The method's reliance on pre-trained DMs limits its effectiveness against adaptive attacks that modify their strategies during test time.
The transferability of the learned patterns to unseen attacks is limited, necessitating further research on improving generalization. |
adversarial examples, diffusion models, adversarial detection, image classification, deep learning |
2401.06614
Report |
Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking |
Wei Cao, Chang Luo, Biao Zhang, Matthias Nießner, Jiapeng Tang |
We introduce Motion2VecSets, a 4D diffusion model for dynamic surface
reconstruction from point cloud sequences. While existing state-of-the-art
methods have demonstrated success in reconstructing non-rigid objects using
neural field representations, conventional feed-forward networks encounter
challenges with ambiguous observations from noisy, partial, or sparse point
clouds. To address these challenges, we introduce a diffusion model that
explicitly learns the shape and motion distribution of non-rigid objects
through an iterative denoising process of compressed latent representations.
The diffusion-based priors enable more plausible and probabilistic
reconstructions when handling ambiguous inputs. We parameterize 4D dynamics
with latent sets instead of using global latent codes. This novel 4D
representation allows us to learn local shape and deformation patterns, leading
to more accurate non-linear motion capture and significantly improving
generalizability to unseen motions and identities. For more temporally-coherent
object tracking, we synchronously denoise deformation latent sets and exchange
information across multiple frames. To avoid computational overhead, we
designed an interleaved space and time attention block to alternately aggregate
deformation latents along spatial and temporal domains. Extensive comparisons
against state-of-the-art methods demonstrate the superiority of our
Motion2VecSets in 4D reconstruction from various imperfect observations. More
detailed information can be found at
https://vveicao.github.io/projects/Motion2VecSets/. |
Motion2VecSets, a 4D diffusion model for dynamic surface reconstruction from sparse, noisy, or partial point cloud sequences. |
Existing feed-forward networks struggle with ambiguous observations from imperfect point clouds, and conventional 4D representations fail to capture accurate shape and motion priors. |
The model leverages a two-stage approach, first learning shape and deformation priors with autoencoders and then utilizing these priors in a diffusion model to reconstruct 4D surfaces. It employs latent sets for shape and deformation, enabling local representation, and an interleaved spatio-temporal attention mechanism for efficient and temporally consistent diffusion. |
Motion2VecSets reconstructs more plausible and accurate surfaces compared to previous state-of-the-art methods, especially in challenging scenarios with sparse or partial inputs.
The model exhibits superior generalization ability to unseen motions and object identities, thanks to the local representation power of latent sets.
Synchronized diffusion of deformation latent sets, facilitated by the interleaved spatio-temporal attention mechanism, ensures robust temporal coherence in reconstructed 4D surfaces. |
The current implementation has a relatively long inference time, limiting real-time applications.
The model's focus on single-view reconstruction could be extended to multi-view scenarios for more comprehensive 4D capture. |
4d reconstruction, diffusion model, dynamic surface, point cloud sequences, latent sets |
2401.06578
Report |
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model |
Qian Wang, Weiqi Li, Chong Mou, Xinhua Cheng, Jian Zhang |
Panorama video recently attracts more interest in both study and application,
courtesy of its immersive experience. Due to the expensive cost of capturing
360-degree panoramic videos, generating desirable panorama videos by prompts is
urgently required. Lately, the emerging text-to-video (T2V) diffusion methods
demonstrate notable effectiveness in standard video generation. However, due to
the significant gap in content and motion patterns between panoramic and
standard videos, these methods encounter challenges in yielding satisfactory
360-degree panoramic videos. In this paper, we propose a pipeline named
360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic
videos based on the given prompts and motion conditions. Specifically, we
introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques
to transform pre-trained T2V models for panorama video generation. We further
propose a new panorama dataset named WEB360 consisting of panoramic video-text
pairs for training 360DVD, addressing the absence of captioned panoramic video
datasets. Extensive experiments demonstrate the superiority and effectiveness
of 360DVD for panorama video generation. Our project page is at
https://akaneqwq.github.io/360DVD/. |
Introduces 360DVD, a controllable 360-degree panorama video generation diffusion model, by adapting a standard T2V model with a lightweight 360-Adapter. |
Existing text-to-video (T2V) diffusion models struggle to generate satisfactory 360-degree panoramic videos due to the distinct content and motion patterns compared to standard videos. This necessitates a dedicated approach. |
Leverages a pre-trained denoising U-Net with a trainable 360-Adapter to capture panoramic characteristics. Employs 360 Enhancement Techniques, including a latitude-aware loss and mechanisms for continuity, to enhance quality. Introduces WEB360, a new dataset of panoramic videos with detailed captions using a GPT-based 360 Text Fusion module. |
Generates text-aligned and coherent 360-degree panorama videos with high quality and diverse styles.
Successfully incorporates motion guidance, enabling control over video dynamics.
Outperforms baseline methods in terms of graphics quality, frame consistency, and adherence to panoramic video characteristics based on user studies. |
Performance relies on the underlying T2V model, limiting capabilities due to frozen parameters during training.
Reliance on predicted motion conditions from a panoramic optical flow estimator introduces limitations due to the estimator's performance.
Future work includes exploring the use of other motion conditions such as depth maps and expanding control beyond optical flow. |
panorama video generation, text-to-video synthesis, diffusion models, 360-degree videos, motion guidance |
2401.06442
Report |
RotationDrag: Point-based Image Editing with Rotated Diffusion Features |
Minxing Luo, Wentao Cheng, Jian Yang |
A precise and user-friendly manipulation of image content while preserving
image fidelity has always been crucial to the field of image editing. Thanks to
the power of generative models, recent point-based image editing methods allow
users to interactively change the image content with high generalizability by
clicking several control points. But the above mentioned editing process is
usually based on the assumption that features stay constant in the motion
supervision step from initial to target points. In this work, we conduct a
comprehensive investigation in the feature space of diffusion models, and find
that features change acutely under in-plane rotation. Based on this, we propose
a novel approach named RotationDrag, which significantly improves point-based
image editing performance when users intend to in-plane rotate the image
content. Our method tracks handle points more precisely by utilizing the
feature map of the rotated images, thus ensuring precise optimization and high
image fidelity. Furthermore, we build a in-plane rotation focused benchmark
called RotateBench, the first benchmark to evaluate the performance of
point-based image editing method under in-plane rotation scenario on both real
images and generated images. A thorough user study demonstrates the superior
capability in accomplishing in-plane rotation that users intend to achieve,
comparing the DragDiffusion baseline and other existing diffusion-based
methods. See the project page https://github.com/Tony-Lowe/RotationDrag for
code and experiment results. |
RotationDrag, a novel point-based image editing method leveraging rotated diffusion features to enhance accuracy in image manipulation, particularly under in-plane rotation scenarios. |
Existing point-based editing methods assume feature constancy during motion supervision, leading to inaccurate edits, especially during rotations which are common in user edits. RotationDrag addresses this by using features from rotated images for precise point tracking. |
RotationDrag calculates rotation angles between initial and current handle points during optimization. It then rotates the input image accordingly and utilizes the feature map of this rotated image for accurate point tracking and motion supervision. |
RotationDrag demonstrates superior performance in rotating and dragging image content compared to DragDiffusion, FreeDrag, and SDE-Drag.
A user study confirms RotationDrag's significantly better performance in achieving desired in-plane rotations.
The paper introduces RotationBench, a new benchmark dataset focused on evaluating in-plane rotation in image editing. |
RotationDrag's reliance on repeated inversions during point tracking impacts its speed compared to DragDiffusion.
Future work will explore improving Stable Diffusion's handling of rotations to potentially enhance speed. |
point-based image editing, stable diffusion, diffusion models, rotation invariance, image manipulation |
2401.06345
Report |
Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering |
Chang Yu, Junran Peng, Xiangyu Zhu, Zhaoxiang Zhang, Qi Tian, Zhen Lei |
The text-to-image synthesis by diffusion models has recently shown remarkable
performance in generating high-quality images. Although performs well for
simple texts, the models may get confused when faced with complex texts that
contain multiple objects or spatial relationships. To get the desired images, a
feasible way is to manually adjust the textual descriptions, i.e., narrating
the texts or adding some words, which is labor-consuming. In this paper, we
propose a framework to learn the proper textual descriptions for diffusion
models through prompt learning. By utilizing the quality guidance and the
semantic guidance derived from the pre-trained diffusion model, our method can
effectively learn the prompts to improve the matches between the input text and
the generated images. Extensive experiments and analyses have validated the
effectiveness of the proposed method. |
This paper introduces a novel framework leveraging prompt engineering to enhance text-to-image synthesis in diffusion models, specifically targeting improved accuracy for complex textual descriptions. |
Existing diffusion models often struggle to accurately synthesize images from complex text descriptions containing multiple objects or spatial relationships. This work addresses this limitation by learning appropriate textual prompts that guide the model to generate more accurate images. |
The proposed two-stage framework utilizes a pre-trained diffusion model. It first generates coarse and fine images from the input text. It then learns text-specific prompts guided by minimizing the difference between text/image embeddings and promoting consistency between the generated images and the input text, as well as sparsity in the learned prompts. |
The method effectively learns prompts that improve text-image matching and reduce artifacts in synthesized images.
It outperforms existing methods like Composable Diffusion and Structure Diffusion in synthesizing images from both composable and relational text descriptions.
Visualizations of cross-attention maps demonstrate that the learned prompts help the model focus on previously neglected objects or relationships in the text, leading to more accurate image generation. |
The method relies on pre-trained diffusion models and doesn't involve fine-tuning the model itself, which could potentially limit the extent of improvement.
The paper primarily focuses on generating images from complex texts, and further investigation is needed to evaluate its efficacy in other text-to-image synthesis scenarios. |
text-to-image synthesis, diffusion models, prompt engineering, composable diffusion, relational text |
2401.06341
Report |
AffordanceLLM: Grounding Affordance from Vision Language Models |
Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li |
Affordance grounding refers to the task of finding the area of an object with
which one can interact. It is a fundamental but challenging task, as a
successful solution requires the comprehensive understanding of a scene in
multiple aspects including detection, localization, and recognition of objects
with their parts, of geo-spatial configuration/layout of the scene, of 3D
shapes and physics, as well as of the functionality and potential interaction
of the objects and humans. Much of the knowledge is hidden and beyond the image
content with the supervised labels from a limited training set. In this paper,
we make an attempt to improve the generalization capability of the current
affordance grounding by taking the advantage of the rich world, abstract, and
human-object-interaction knowledge from pretrained large-scale vision language
models. Under the AGD20K benchmark, our proposed model demonstrates a
significant performance gain over the competing methods for in-the-wild object
affordance grounding. We further demonstrate it can ground affordance for
objects from random Internet images, even if both objects and actions are
unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/ |
Introduces \methodname, a novel affordance grounding approach leveraging world knowledge from pretrained Vision Language Models (VLMs) to improve generalization to unseen objects. |
Affordance grounding, crucial for embodied AI tasks like robot manipulation, struggles to generalize to novel objects unseen during training. |
Extends a VLM (LLaVA) with a mask decoder to predict affordance maps from images and text prompts. Incorporates pseudodepth maps as input to enhance 3D understanding. |
Significantly outperforms state-of-the-art baselines on AGD20K, especially on a newly proposed 'Hard' split designed to test generalization.
Demonstrates successful affordance grounding on random Internet images with novel objects and actions.
Shows the importance of both appropriate text prompts and the visual grounding capability of the image encoder. |
Can struggle with ambiguous situations or scenes with multiple objects.
Reliance on pretrained VLMs introduces potential biases and limitations. |
affordance grounding, vision language models, generalization, 3d understanding, robot manipulation |
2401.06310
Report |
ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation |
Akshita Jha, Vinodkumar Prabhakaran, Remi Denton, Sarah Laszlo, Shachi Dave, Rida Qadri, Chandan K. Reddy, Sunipa Dev |
Recent studies have shown that Text-to-Image (T2I) model generations can
reflect social stereotypes present in the real world. However, existing
approaches for evaluating stereotypes have a noticeable lack of coverage of
global identity groups and their associated stereotypes. To address this gap,
we introduce the ViSAGe (Visual Stereotypes Around the Globe) dataset to enable
the evaluation of known nationality-based stereotypes in T2I models, across 135
nationalities. We enrich an existing textual stereotype resource by
distinguishing between stereotypical associations that are more likely to have
visual depictions, such as `sombrero', from those that are less visually
concrete, such as 'attractive'. We demonstrate ViSAGe's utility through a
multi-faceted evaluation of T2I generations. First, we show that stereotypical
attributes in ViSAGe are thrice as likely to be present in generated images of
corresponding identities as compared to other attributes, and that the
offensiveness of these depictions is especially higher for identities from
Africa, South America, and South East Asia. Second, we assess the stereotypical
pull of visual depictions of identity groups, which reveals how the 'default'
representations of all identity groups in ViSAGe have a pull towards
stereotypical depictions, and that this pull is even more prominent for
identity groups from the Global South. CONTENT WARNING: Some examples contain
offensive stereotypes. |
This paper introduces ViSAGe, a dataset for evaluating nationality-based stereotypes in Text-to-Image models, covering 135 nationalities, by distinguishing visually depicted stereotypes from those less visually concrete. |
Existing approaches lack global coverage in evaluating social stereotypes in T2I models, making it crucial to develop methods to assess and mitigate potential harm, particularly for marginalized groups. |
The authors enriched an existing textual stereotype resource by identifying visually depictable stereotypes. They conducted large-scale human annotations and explored automated methods using CLIP to detect stereotypes in images generated by Stable Diffusion. |
Stereotypical attributes are three times more likely to be present in generated images compared to non-stereotypical attributes.
Offensive depictions are particularly high for identities from Africa, South America, and Southeast Asia.
T2I models exhibit a 'stereotypical pull', generating images aligning with stereotypes even when prompted otherwise, especially for Global South identities. |
Annotation of visual stereotypes can be subjective, potentially missing nuances.
Evaluation is limited by stereotypes present in the initial textual resource (SeeGULL), necessitating inclusion from other sources. |
stereotype evaluation, text-to-image models, visage dataset, global stereotypes, bias in ai |
2401.06209
Report |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs |
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie |
Is vision good enough for language? Recent advancements in multimodal models
primarily stem from the powerful reasoning abilities of large language models
(LLMs). However, the visual component typically depends only on the
instance-level contrastive language-image pre-training (CLIP). Our research
reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still
exhibit systematic shortcomings. To understand the roots of these errors, we
explore the gap between the visual embedding space of CLIP and vision-only
self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP
perceives as similar despite their clear visual differences. With these pairs,
we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes
areas where state-of-the-art systems, including GPT-4V, struggle with
straightforward questions across nine basic visual patterns, often providing
incorrect answers and hallucinated explanations. We further evaluate various
CLIP-based vision-and-language models and found a notable correlation between
visual patterns that challenge CLIP models and those problematic for multimodal
LLMs. As an initial effort to address these issues, we propose a Mixture of
Features (MoF) approach, demonstrating that integrating vision self-supervised
learning features with MLLMs can significantly enhance their visual grounding
capabilities. Together, our research suggests visual representation learning
remains an open challenge, and accurate visual grounding is crucial for future
successful multimodal systems. |
This paper introduces the Multimodal Visual Patterns (MVP) benchmark to expose the systematic visual shortcomings of Multimodal Large Language Models (MLLMs) like GPT-4V, particularly in visual grounding. |
Despite advancements in MLLMs, their visual component often relies on instance-level contrastive learning (e.g., CLIP), leading to fundamental visual understanding errors. |
The authors identify "CLIP-blind pairs" - images visually different but perceived as similar by CLIP. Using these pairs, they construct the MVP benchmark with straightforward VQA questions targeting these visual discrepancies. They also analyze systematic failure patterns in CLIP across various model scales and correlate them to MLLM errors. Finally, they propose Mixture-of-Features (MoF) approaches to enhance MLLM visual grounding. |
Human evaluation confirms the MVP benchmark questions are straightforward, achieving 95.7% accuracy, while MLLMs, even GPT-4V, struggle significantly.
Scaling up CLIP model size and data only marginally improves performance on two out of nine identified visual patterns.
A strong correlation exists between CLIP's visual pattern recognition errors and the performance of MLLMs, indicating CLIP as a bottleneck. |
The study primarily focuses on CLIP-based MLLMs, potentially limiting generalizability to other architectures.
While MoF shows promise, further exploration is needed to optimize feature integration and balance visual grounding with other capabilities. |
multimodal learning, visual grounding, large language models, benchmarking, visual representation learning |
2401.06197
Report |
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications |
Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai |
We introduce Deformable Convolution v4 (DCNv4), a highly efficient and
effective operator designed for a broad spectrum of vision applications. DCNv4
addresses the limitations of its predecessor, DCNv3, with two key enhancements:
1. removing softmax normalization in spatial aggregation to enhance its dynamic
property and expressive power and 2. optimizing memory access to minimize
redundant operations for speedup. These improvements result in a significantly
faster convergence compared to DCNv3 and a substantial increase in processing
speed, with DCNv4 achieving more than three times the forward speed. DCNv4
demonstrates exceptional performance across various tasks, including image
classification, instance and semantic segmentation, and notably, image
generation. When integrated into generative models like U-Net in the latent
diffusion model, DCNv4 outperforms its baseline, underscoring its possibility
to enhance generative models. In practical applications, replacing DCNv3 with
DCNv4 in the InternImage model to create FlashInternImage results in up to 80%
speed increase and further performance improvement without further
modifications. The advancements in speed and efficiency of DCNv4, combined with
its robust performance across diverse vision tasks, show its potential as a
foundational building block for future vision models. |
This paper introduces Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. |
Despite the advantages of DCN, it is not the go-to solution for vision backbone models due to its slow speed and counter-intuitive slower convergence compared to global attention at the initial backbone training phase. This work aims to address these limitations. |
The authors improve DCNv4 by 1) removing the softmax normalization in spatial aggregation and 2) optimizing memory access to minimize redundant operations for speedup. |
DCNv4 converges significantly faster than DCNv3 (its predecessor).
It accelerates forward speed by more than 3 times.
DCNv4 achieves performance improvement in various tasks, including image classification, instance and semantic segmentation, and image generation. |
The header parts in some experiments (e.g., BEVFormer v2 for 3D object detection) are underoptimized.
The architecture/hyperparameters might not be optimal for DCNv4 in some cases. |
deformable convolution, vision backbones, operator optimization, image classification, object detection, image generation |
2401.06191
Report |
TriNeRFLet: A Wavelet Based Multiscale Triplane NeRF Representation |
Rajaei Khatib, Raja Giryes |
In recent years, the neural radiance field (NeRF) model has gained popularity
due to its ability to recover complex 3D scenes. Following its success, many
approaches proposed different NeRF representations in order to further improve
both runtime and performance. One such example is Triplane, in which NeRF is
represented using three 2D feature planes. This enables easily using existing
2D neural networks in this framework, e.g., to generate the three planes.
Despite its advantage, the triplane representation lagged behind in its 3D
recovery quality compared to NeRF solutions. In this work, we propose
TriNeRFLet, a 2D wavelet-based multiscale triplane representation for NeRF,
which closes the 3D recovery performance gap and is competitive with current
state-of-the-art methods. Building upon the triplane framework, we also propose
a novel super-resolution (SR) technique that combines a diffusion model with
TriNeRFLet for improving NeRF resolution. |
This paper introduces TriNeRFLet, a novel NeRF representation based on a 2D wavelet multiscale triplane structure. It also proposes a super-resolution (SR) technique that combines a diffusion model with TriNeRFLet for improving NeRF resolution. |
Triplane, while efficient due to its 2D structure, lagged in 3D recovery quality compared to other methods. TriNeRFLet aims to close this gap. |
TriNeRFLet represents NeRF using multiscale 2D wavelet features, regularizing them to be sparse. It utilizes a coarse-to-fine training strategy. For SR, it leverages the multiscale structure to combine a pre-trained diffusion model with a low-resolution TriNeRFLet to generate high-resolution novel views. |
TriNeRFLet closes the performance gap of Triplane, achieving competitive 3D reconstruction quality compared to state-of-the-art methods like INGP and 3D Gaussian Splatting.
The proposed SR technique outperforms other 2D supervised NeRF SR methods in most experiments on the Blender dataset.
For LLFF dataset, TriNeRFLet SR achieves comparable or better results than state-of-the-art methods, demonstrating its effectiveness on real-world scenes. |
Training TriNeRFLet is more time-consuming than INGP due to the wavelet reconstruction step.
The diffusion-based SR model currently used only supports specific upscale factors, requiring padding or cropping for other resolutions. |
nerf, neural radiance field, triplane, wavelet, super-resolution |
2401.06129
Report |
Distilling Vision-Language Models on Millions of Videos |
Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan |
The recent advance in vision-language models is largely attributed to the
abundance of image-text data. We aim to replicate this success for
video-language models, but there simply is not enough human-curated video-text
data available. We thus resort to fine-tuning a video-language model from a
strong image-language baseline with synthesized instructional data. The
resulting video model by video-instruction-tuning (VIIT) is then used to
auto-label millions of videos to generate high-quality captions. We show the
adapted video-language model performs well on a wide range of video-language
benchmarks. For instance, it surpasses the best prior result on open-ended
NExT-QA by 2.8%. Besides, our model generates detailed descriptions for
previously unseen videos, which provide better textual supervision than
existing methods. Experiments show that a video-language dual-encoder model
contrastively trained on these auto-generated captions is 3.8% better than the
strongest baseline that also leverages vision-language models. Our best model
outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video
retrieval by 6%. As a side product, we generate the largest video caption
dataset to date. |
This paper proposes a method for adapting image-based vision-language models (VLMs) to the video domain and uses the adapted VLM to generate high-quality captions for a large-scale video dataset. |
This is important because there is a lack of high-quality, large-scale video-text data which is crucial for training effective video-language models. |
The adaptation is done in two stages: (1) visual adaptation by fine-tuning the visual encoder on video captioning data, and (2) language adaptation by fine-tuning the language model on instruction-following data. The adapted VLM is then used to generate captions for a large-scale web-scraped video dataset. |
The adapted VLM achieves state-of-the-art zero-shot performance on various video-language benchmarks, including video question answering and captioning.
The generated captions are of high quality and lead to significant improvements when used to train a video-language dual-encoder model.
The approach demonstrates a positive scaling effect, with performance increasing as more pseudo-captioned video data is used. |
One limitation is the reliance on existing video-text datasets for adaptation, which are still limited in scale and diversity compared to image-text datasets.
Further improvements might be achieved by exploring alternative methods for generating instruction-following data and by developing more sophisticated techniques for self-training with pseudo-captioned videos. |
video-language models, captioning, instruction tuning, pseudo-labeling, zero-shot learning |
2401.06105
Report |
PALP: Prompt Aligned Personalization of Text-to-Image Models |
Moab Arar, Andrey Voynov, Amir Hertz, Omri Avrahami, Shlomi Fruchter, Yael Pritch, Daniel Cohen-Or, Ariel Shamir |
Content creators often aim to create personalized images using personal
subjects that go beyond the capabilities of conventional text-to-image models.
Additionally, they may want the resulting image to encompass a specific
location, style, ambiance, and more. Existing personalization methods may
compromise personalization ability or the alignment to complex textual prompts.
This trade-off can impede the fulfillment of user prompts and subject fidelity.
We propose a new approach focusing on personalization methods for a
\emph{single} prompt to address this issue. We term our approach prompt-aligned
personalization. While this may seem restrictive, our method excels in
improving text alignment, enabling the creation of images with complex and
intricate prompts, which may pose a challenge for current techniques. In
particular, our method keeps the personalized model aligned with a target
prompt using an additional score distillation sampling term. We demonstrate the
versatility of our method in multi- and single-shot settings and further show
that it can compose multiple subjects or use inspiration from reference images,
such as artworks. We compare our approach quantitatively and qualitatively with
existing baselines and state-of-the-art techniques. |
The paper introduces PALP, a novel personalization method for text-to-image diffusion models that excels in aligning generated images with complex user prompts. |
Existing personalization methods often struggle to balance subject fidelity and adherence to intricate prompts, limiting their ability to fulfill user demands for creative image generation. |
PALP employs a two-pronged approach: fine-tuning a pre-trained model to learn the subject's unique features and using score distillation sampling to guide the model's noise predictions towards the target prompt. |
PALP demonstrates superior prompt alignment compared to existing methods while preserving high subject fidelity.
The method proves effective in both multi-shot and single-shot settings, enabling personalization even with a single reference image.
PALP allows for multi-subject personalization, enabling the creation of coherent scenes with multiple subjects or artistic compositions inspired by a single artwork. |
The current approach requires personalization for each specific prompt, limiting its real-time applicability.
Future work could explore prompt-aligned adapters for instant personalization on specific prompts or extend the method to excel on subsets of prompts for specialized applications. |
text-to-image synthesis, personalization, prompt alignment, diffusion models, score distillation sampling |
2401.06104
Report |
Transformers are Multi-State RNNs |
Matanel Oren, Michael Hassid, Yossi Adi, Roy Schwartz |
Transformers are considered conceptually different compared to the previous
generation of state-of-the-art NLP models - recurrent neural networks (RNNs).
In this work, we demonstrate that decoder-only transformers can in fact be
conceptualized as infinite multi-state RNNs - an RNN variant with unlimited
hidden state size. We further show that pretrained transformers can be
converted into $\textit{finite}$ multi-state RNNs by fixing the size of their
hidden state. We observe that several existing transformers cache compression
techniques can be framed as such conversion policies, and introduce a novel
policy, TOVA, which is simpler compared to these policies. Our experiments with
several long range tasks indicate that TOVA outperforms all other baseline
policies, while being nearly on par with the full (infinite) model, and using
in some cases only $\frac{1}{8}$ of the original cache size. Our results
indicate that transformer decoder LLMs often behave in practice as RNNs. They
also lay out the option of mitigating one of their most painful computational
bottlenecks - the size of their cache memory. We publicly release our code at
https://github.com/schwartz-lab-NLP/TOVA. |
This paper redefines decoder-only transformers as infinite multi-state RNNs and proposes a new compression method, TOVA, to convert them into finite multi-state RNNs. |
This work is important because it provides a new perspective on the relationship between transformers and RNNs, and proposes a practical method for reducing the memory footprint of LLMs during inference. |
The authors formally define multi-state RNNs and demonstrate how transformers can be conceptualized as a special case. They then propose TOVA, a compression policy that leverages attention scores to determine which tokens to keep in the multi-state. |
TOVA outperforms other compression policies on language modeling, achieving comparable perplexity to the full model using only 1/8 - 1/4 of the context.
On long-range understanding tasks, TOVA consistently outperforms baselines and achieves near-topline performance with a reduced multi-state size.
For text generation, TOVA enables using smaller multi-state sizes with minimal impact on story quality compared to the full model. |
Evaluating long text generation is computationally expensive and relies on GPT-4 for comparison, which has its own limitations.
The evaluation is focused on English, and the findings might not directly transfer to languages with different word order characteristics. |
transformers, rnns, language models, memory compression, long-range dependencies |
2401.06071
Report |
GroundingGPT:Language Enhanced Multi-modal Grounding Model |
Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Van Tu Vu, Zhida Huang, Tao Wang |
Multi-modal large language models have demonstrated impressive performance
across various tasks in different modalities. However, existing multi-modal
models primarily emphasize capturing global information within each modality
while neglecting the importance of perceiving local information across
modalities. Consequently, these models lack the ability to effectively
understand the fine-grained details of input data, limiting their performance
in tasks that require a more nuanced understanding. To address this limitation,
there is a compelling need to develop models that enable fine-grained
understanding across multiple modalities, thereby enhancing their applicability
to a wide range of tasks. In this paper, we propose GroundingGPT, a language
enhanced multi-modal grounding model. Beyond capturing global information like
other multi-modal models, our proposed model excels at tasks demanding a
detailed understanding of local information within the input. It demonstrates
precise identification and localization of specific regions in images or
moments in videos. To achieve this objective, we design a diversified dataset
construction pipeline, resulting in a multi-modal, multi-granularity dataset
for model training. The code, dataset, and demo of our model can be found at
https: //github.com/lzw-lzw/GroundingGPT. |
This paper proposes GroundingGPT, an end-to-end multi-modal grounding model for fine-grained understanding and grounding tasks across image, video, and audio. |
Existing multi-modal large language models (MLLMs) often prioritize global information, neglecting fine-grained details crucial for grounding tasks. |
The paper uses modality-specific adapters to map features to LLM embedding space, represents coordinates/timestamps textually, and employs a three-stage coarse-to-fine training strategy with a diversified multi-granularity dataset. |
GroundingGPT achieves impressive results in multi-modal grounding tasks like referring expression comprehension and temporal video grounding.
The model maintains or improves multi-modal understanding abilities, excelling in tasks like visual question answering and video question answering.
GroundingGPT effectively suppresses object hallucination, indicating enhanced local detail comprehension. |
The sampling strategy for videos and audios might lead to information loss.
Current training predominantly focuses on single-modal inputs, limiting performance on simultaneous multi-modal grounding tasks. |
multi-modal grounding, large language models, fine-grained understanding, coarse-to-fine training, object hallucination |
2401.06035
Report |
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks |
Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Schölkopf |
We present a novel unconditional video generative model designed to address
long-term spatial and temporal dependencies. To capture these dependencies, our
approach incorporates a hybrid explicit-implicit tri-plane representation
inspired by 3D-aware generative frameworks developed for three-dimensional
object representation and employs a singular latent code to model an entire
video sequence. Individual video frames are then synthesized from an
intermediate tri-plane representation, which itself is derived from the primary
latent code. This novel strategy reduces computational complexity by a factor
of $2$ as measured in FLOPs. Consequently, our approach facilitates the
efficient and temporally coherent generation of videos. Moreover, our joint
frame modeling approach, in contrast to autoregressive methods, mitigates the
generation of visual artifacts. We further enhance the model's capabilities by
integrating an optical flow-based module within our Generative Adversarial
Network (GAN) based generator architecture, thereby compensating for the
constraints imposed by a smaller generator size. As a result, our model is
capable of synthesizing high-fidelity video clips at a resolution of
$256\times256$ pixels, with durations extending to more than $5$ seconds at a
frame rate of 30 fps. The efficacy and versatility of our approach are
empirically validated through qualitative and quantitative assessments across
three different datasets comprising both synthetic and real video clips. |
Introduces a novel unconditional video generation model using a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks, enabling efficient and temporally coherent video generation. |
Addresses limitations of autoregressive models in unconditional video generation, particularly the accumulation of errors and challenges in capturing long-term spatial and temporal dependencies. |
Adapts the tri-plane representation from 3D object modeling to video data, organizing features into three planar grids aligned with spatial and temporal axes. Employs a StyleGAN-T backbone to generate tri-plane features and incorporates optical flow for explicit motion modeling, enhancing feature consistency over time. Utilizes double discrimination with separate discriminators for individual frames and the entire video to enhance training effectiveness. |
Generates high-fidelity video clips at 256x256 resolution with durations exceeding 5 seconds at 30 fps.
Demonstrates superior performance in capturing long-range spatial and temporal dependencies compared to state-of-the-art GAN-based approaches (StyleGAN-V, MoCoGAN).
Exhibits significant computational efficiency, requiring less than half the FLOPs of other SOTA models for generating a 160-frame video sample. |
Performance heavily reliant on the capacity of the generative backbone network, with limitations observed when using less expansive StyleGAN versions.
Current implementation lacks explicit disentanglement of objects within generated scenes, limiting control over individual elements. |
video generation, tri-plane representation, optical flow, generative adversarial networks (gans), long-term temporal dependencies |
2401.06003
Report |
TRIPS: Trilinear Point Splatting for Real-Time Radiance Field Rendering |
Linus Franke, Darius Rückert, Laura Fink, Marc Stamminger |
Point-based radiance field rendering has demonstrated impressive results for
novel view synthesis, offering a compelling blend of rendering quality and
computational efficiency. However, also latest approaches in this domain are
not without their shortcomings. 3D Gaussian Splatting [Kerbl and Kopanas et al.
2023] struggles when tasked with rendering highly detailed scenes, due to
blurring and cloudy artifacts. On the other hand, ADOP [R\"uckert et al. 2022]
can accommodate crisper images, but the neural reconstruction network decreases
performance, it grapples with temporal instability and it is unable to
effectively address large gaps in the point cloud.
In this paper, we present TRIPS (Trilinear Point Splatting), an approach that
combines ideas from both Gaussian Splatting and ADOP. The fundamental concept
behind our novel technique involves rasterizing points into a screen-space
image pyramid, with the selection of the pyramid layer determined by the
projected point size. This approach allows rendering arbitrarily large points
using a single trilinear write. A lightweight neural network is then used to
reconstruct a hole-free image including detail beyond splat resolution.
Importantly, our render pipeline is entirely differentiable, allowing for
automatic optimization of both point sizes and positions.
Our evaluation demonstrate that TRIPS surpasses existing state-of-the-art
methods in terms of rendering quality while maintaining a real-time frame rate
of 60 frames per second on readily available hardware. This performance extends
to challenging scenarios, such as scenes featuring intricate geometry,
expansive landscapes, and auto-exposed footage.
The project page is located at: https://lfranke.github.io/trips/ |
TRIPS, a novel point-based radiance field rendering method that uses trilinear splatting into an image pyramid to achieve real-time performance and high visual quality. |
Existing point-based radiance field rendering methods either struggle with details (3D Gaussian Splatting) or temporal instability and hole filling (ADOP). |
Points are splatted trilinearly into an image pyramid based on projected size. A lightweight neural network then reconstructs a hole-free, detailed image from the pyramid. |
TRIPS achieves superior visual quality compared to 3D Gaussian Splatting, particularly in detail rendering.
Outperforms ADOP in filling large gaps and maintaining temporal consistency.
Maintains real-time rendering (60 FPS) on a single RTX 4090, even with large point clouds. |
Requires a dense initial point cloud reconstruction.
Lacks anisotropic splatting, leading to potential artifacts with thin structures. |
neural rendering, point-based rendering, radiance fields, novel view synthesis, real-time rendering |
2401.05925
Report |
CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion |
Bin Dou, Tianyu Zhang, Yongjia Ma, Zhaohui Wang, Zejian Yuan |
We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a
method for compact 3D-consistent scene segmentation at fast rendering speed
with only RGB images input. Previous NeRF-based segmentation methods have
relied on time-consuming neural scene optimization. While recent 3D Gaussian
Splatting has notably improved speed, existing Gaussian-based segmentation
methods struggle to produce compact masks, especially in zero-shot
segmentation. This issue probably stems from their straightforward assignment
of learnable parameters to each Gaussian, resulting in a lack of robustness
against cross-view inconsistent 2D machine-generated labels. Our method aims to
address this problem by employing Dual Feature Fusion Network as Gaussians'
segmentation field. Specifically, we first optimize 3D Gaussians under RGB
supervision. After Gaussian Locating, DINO features extracted from images are
applied through explicit unprojection, which are further incorporated with
spatial features from the efficient point cloud processing network. Feature
aggregation is utilized to fuse them in a global-to-local strategy for compact
segmentation features. Experimental results show that our model outperforms
baselines on both semantic and panoptic zero-shot segmentation task, meanwhile
consumes less than 10% inference time compared to NeRF-based methods. Code and
more results will be available at https://David-Dou.github.io/CoSSegGaussians |
This paper proposes CoSSegGaussians, a method for achieving compact and fast 3D scene segmentation using only RGB images as input. |
Existing Gaussian-based scene segmentation methods struggle to produce compact masks, especially in zero-shot scenarios due to inconsistencies in 2D machine-generated labels. |
The method leverages 3D Gaussian Splatting for scene representation and employs a Dual Feature Fusion Network. It unprojects multi-scale DINO features onto 3D Gaussians and combines them with spatial features extracted using RandLA-Net. A global-to-local aggregation module then generates compact segmentation logits. |
CoSSegGaussians outperforms baselines on both semantic and panoptic zero-shot segmentation tasks.
It achieves significantly faster rendering speed than NeRF-based segmentation methods.
The method produces more compact segmentation masks compared to previous Gaussian-based methods. |
High GPU occupancy during training due to the large number of Gaussian points.
Lack of explicit structural constraints during training. |
scene segmentation, zero-shot learning, 3d gaussian splatting, dino features, spatial feature aggregation |
2401.05907
Report |
Efficient Image Deblurring Networks based on Diffusion Models |
Kang Chen, Yuanjie Liu |
This article introduces a sliding window model for defocus deblurring that
achieves the best performance to date with extremely low memory usage. Named
Swintormer, the method utilizes a diffusion model to generate latent prior
features that assist in restoring more detailed images. It also extends the
sliding window strategy to specialized Transformer blocks for efficient
inference. Additionally, we have further optimized Multiply-Accumulate
operations (Macs). Compared to the currently top-performing GRL method, our
Swintormer model drastically reduces computational complexity from 140.35 GMACs
to 8.02 GMacs, while also improving the Signal-to-Noise Ratio (SNR) for defocus
deblurring from 27.04 dB to 27.07 dB. This new method allows for the processing
of higher resolution images on devices with limited memory, significantly
expanding potential application scenarios. The article concludes with an
ablation study that provides an in-depth analysis of the impact of each network
module on final performance. The source code and model will be available at the
following website: https://github.com/bnm6900030/swintormer. |
This paper introduces Swintormer, a sliding window Transformer model for image deblurring that integrates a diffusion model to generate latent prior features, improving deblurring quality with low memory usage. |
Existing supervised image deblurring methods require large labeled datasets and often lack generalization ability. This paper leverages the power of pre-trained diffusion models to address these limitations. |
The method employs a pre-trained diffusion model fine-tuned for the deblurring task to generate latent image features. These features are then used as input along with the blurry image to train a memory-efficient sliding window Transformer model. |
Swintormer achieves state-of-the-art performance on defocus deblurring benchmarks like DPDD, outperforming previous methods in PSNR and LPIPS.
The use of latent features from the diffusion model significantly improves deblurring quality, particularly in challenging outdoor scenes.
The proposed sliding window approach with shifted windows and mixed attention mechanism allows for efficient inference on high-resolution images with low computational complexity (MACs). |
The paper focuses primarily on defocus and motion deblurring; further exploration is needed for other deblurring types.
Future work will investigate more efficient architectures and training strategies for the diffusion model to further reduce computational cost. |
image deblurring, diffusion models, transformer, sliding window, low memory |
2401.05750
Report |
GO-NeRF: Generating Virtual Objects in Neural Radiance Fields |
Peng Dai, Feitong Tan, Xin Yu, Yinda Zhang, Xiaojuan Qi |
Despite advances in 3D generation, the direct creation of 3D objects within
an existing 3D scene represented as NeRF remains underexplored. This process
requires not only high-quality 3D object generation but also seamless
composition of the generated 3D content into the existing NeRF. To this end, we
propose a new method, GO-NeRF, capable of utilizing scene context for
high-quality and harmonious 3D object generation within an existing NeRF. Our
method employs a compositional rendering formulation that allows the generated
3D objects to be seamlessly composited into the scene utilizing learned
3D-aware opacity maps without introducing unintended scene modification.
Moreover, we also develop tailored optimization objectives and training
strategies to enhance the model's ability to exploit scene context and mitigate
artifacts, such as floaters, originating from 3D object generation within a
scene. Extensive experiments on both feed-forward and $360^o$ scenes show the
superior performance of our proposed GO-NeRF in generating objects harmoniously
composited with surrounding scenes and synthesizing high-quality novel view
images. Project page at {\url{https://daipengwa.github.io/GO-NeRF/}. |
GO-NeRF, a novel pipeline that generates context-aware 3D virtual objects from text prompts and seamlessly integrates them into pre-trained NeRF scenes. |
Enables novel scene creation and editing by harmoniously compositing generated 3D objects into existing environments, enhancing immersion in applications like VR. |
Uses a compositional rendering formulation with a separate object NeRF and 3D-aware opacity maps for seamless composition. Employs context-aware learning objectives, including inpainting priors and saturation regularization, for high-quality, scene-consistent object generation. |
Generates high-quality, context-aware 3D objects within existing scenes, as demonstrated on feed-forward and 360° datasets.
Preserves unchanged scene content beyond the designated editing region, ensuring minimal unintended modifications.
Maintains compatibility with various NeRF representations, allowing for flexible integration with existing scene models. |
The method's ability to modify regions outside the user-specified 3D bounding box, such as reflections, is limited.
Reliance on SDS loss may introduce limitations inherent to that technique, such as the Janus problem. |
neural radiance fields, 3d object generation, scene editing, compositional rendering, text-to-3d |
2401.05735
Report |
Object-Centric Diffusion for Efficient Video Editing |
Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian |
Diffusion-based video editing have reached impressive quality and can
transform either the global style, local structure, and attributes of given
video inputs, following textual edit prompts. However, such solutions typically
incur heavy memory and computational costs to generate temporally-coherent
frames, either in the form of diffusion inversion and/or cross-frame attention.
In this paper, we conduct an analysis of such inefficiencies, and suggest
simple yet effective modifications that allow significant speed-ups whilst
maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as
OCD, to further reduce latency by allocating computations more towards
foreground edited regions that are arguably more important for perceptual
quality. We achieve this by two novel proposals: i) Object-Centric Sampling,
decoupling the diffusion steps spent on salient regions or background,
allocating most of the model capacity to the former, and ii) Object-Centric 3D
Token Merging, which reduces cost of cross-frame attention by fusing redundant
tokens in unimportant background regions. Both techniques are readily
applicable to a given video editing model \textit{without} retraining, and can
drastically reduce its memory and computational cost. We evaluate our proposals
on inversion-based and control-signal-based editing pipelines, and show a
latency reduction up to 10x for a comparable synthesis quality. |
This paper introduces Object-Centric Diffusion (OCD), a set of techniques for speeding up diffusion-based video editing by focusing computation on edited foreground objects. |
Diffusion-based video editing models are computationally expensive, especially those using inversion or cross-frame attention for temporal consistency. |
The authors propose (1) Object-Centric Sampling: separating diffusion sampling for foreground and background, with fewer steps on the latter, and (2) Object-Centric 3D Token Merging: reducing cross-frame attention tokens by fusing redundant ones predominantly in background regions. |
OCD speeds up inversion-based editing by 10x and ControlNet-based editing by 6x without sacrificing fidelity.
Object-Centric Sampling is especially effective for smaller objects, achieving up to 2x additional speed-up.
OCD reduces memory consumption for attention maps by 17x. |
OCD is less effective for global video editing tasks.
Hyperparameter tuning is still required per-sequence. |
video editing, diffusion models, efficiency, object-centric, token merging |
2401.05675
Report |
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation |
Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang |
Recent works demonstrate that using reinforcement learning (RL) with quality
rewards can enhance the quality of generated images in text-to-image (T2I)
generation. However, a simple aggregation of multiple rewards may cause
over-optimization in certain metrics and degradation in others, and it is
challenging to manually find the optimal weights. An effective strategy to
jointly optimize multiple rewards in RL for T2I generation is highly desirable.
This paper introduces Parrot, a novel multi-reward RL framework for T2I
generation. Through the use of the batch-wise Pareto optimal selection, Parrot
automatically identifies the optimal trade-off among different rewards during
the RL optimization of the T2I generation. Additionally, Parrot employs a joint
optimization approach for the T2I model and the prompt expansion network,
facilitating the generation of quality-aware text prompts, thus further
enhancing the final image quality. To counteract the potential catastrophic
forgetting of the original user prompt due to prompt expansion, we introduce
original prompt centered guidance at inference time, ensuring that the
generated image remains faithful to the user input. Extensive experiments and a
user study demonstrate that Parrot outperforms several baseline methods across
various quality criteria, including aesthetics, human preference, image
sentiment, and text-image alignment. |
Presents Parrot, a novel framework for improving text-to-image generation using multi-reward reinforcement learning, enabling joint optimization of image quality and prompt expansion. |
Existing text-to-image models struggle to consistently produce high-quality images across various aspects, such as aesthetics, alignment with user input, and emotional impact. Manually balancing these factors is challenging, necessitating a more efficient optimization approach. |
Parrot employs batch-wise Pareto-optimal selection to identify and leverage samples that achieve the best trade-offs between multiple quality rewards, enabling simultaneous optimization across aesthetics, human preference, image sentiment, and text-image alignment. It also jointly optimizes the text-to-image model and a prompt expansion network for generating quality-aware prompts. To preserve faithfulness to the original user prompt, it incorporates original prompt-centered guidance during inference. |
Parrot consistently outperforms baseline methods in generating images with improved aesthetics, human preference alignment, image sentiment, and text-image alignment.
Joint optimization of the prompt expansion network and the text-to-image model proves superior to individually fine-tuning either component.
Original prompt-centered guidance effectively mitigates the risk of catastrophic forgetting, ensuring generated images remain faithful to the initial user input while incorporating expanded details. |
The effectiveness of Parrot relies on the quality and comprehensiveness of the image quality metrics used.
Further exploration of additional rewards and improved quality metrics can enhance Parrot's capabilities. |
text-to-image generation, reinforcement learning, multi-objective optimization, prompt expansion, image quality assessment |
2401.05583
Report |
Diffusion Priors for Dynamic View Synthesis from Monocular Videos |
Chaoyang Wang, Peiye Zhuang, Aliaksandr Siarohin, Junli Cao, Guocheng Qian, Hsin-Ying Lee, Sergey Tulyakov |
Dynamic novel view synthesis aims to capture the temporal evolution of visual
content within videos. Existing methods struggle to distinguishing between
motion and structure, particularly in scenarios where camera poses are either
unknown or constrained compared to object motion. Furthermore, with information
solely from reference images, it is extremely challenging to hallucinate unseen
regions that are occluded or partially observed in the given videos. To address
these issues, we first finetune a pretrained RGB-D diffusion model on the video
frames using a customization technique. Subsequently, we distill the knowledge
from the finetuned model to a 4D representations encompassing both dynamic and
static Neural Radiance Fields (NeRF) components. The proposed pipeline achieves
geometric consistency while preserving the scene identity. We perform thorough
experiments to evaluate the efficacy of the proposed method qualitatively and
quantitatively. Our results demonstrate the robustness and utility of our
approach in challenging cases, further advancing dynamic novel view synthesis. |
This paper presents a novel method for dynamic novel view synthesis from monocular videos, leveraging the power of 2D diffusion priors to address the limitations of hand-crafted priors used in previous works. |
Existing methods for dynamic novel view synthesis struggle to handle self-occlusions, out-of-view details, and complex motions, especially when relying solely on information from reference views. This work explores the use of 2D diffusion priors to overcome these limitations and improve the quality of dynamic scene reconstruction. |
The proposed method represents a 4D scene using separate NeRFs for static and dynamic regions. It employs a combination of reconstruction losses on existing views and an SDS loss with RGB-D diffusion priors for novel views. Additionally, Dreambooth fine-tuning is used to personalize the diffusion model and preserve scene identity. |
The method generates visually superior novel views compared to existing state-of-the-art methods, particularly in handling complex object motions and hallucinating unseen regions.
Quantitative evaluation on the iPhone dataset demonstrates competitive performance in terms of mLPIPS and mSSIM scores, although these metrics are found to not fully capture the perceived visual quality.
User studies confirm that the proposed method produces more realistic and visually pleasing results compared to baselines, highlighting the benefits of using 2D diffusion priors. |
The method is computationally expensive, requiring high-end GPUs for training and limiting the achievable rendering resolution. Future work could explore more efficient representations and lighter diffusion models.
Temporal smoothness relies on the multi-level design of instant-NGP, which might be insufficient for complex scenarios. Exploring stronger video diffusion models for temporal consistency is an area for future research. |
novel view synthesis, dynamic scene reconstruction, diffusion models, neural radiance fields (nerf), dreambooth |
2401.05516
Report |
FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields |
GeonU Kim, Kim Youwang, Tae-Hyun Oh |
We present FPRF, a feed-forward photorealistic style transfer method for
large-scale 3D neural radiance fields. FPRF stylizes large-scale 3D scenes with
arbitrary, multiple style reference images without additional optimization
while preserving multi-view appearance consistency. Prior arts required tedious
per-style/-scene optimization and were limited to small-scale 3D scenes. FPRF
efficiently stylizes large-scale 3D scenes by introducing a style-decomposed 3D
neural radiance field, which inherits AdaIN's feed-forward stylization
machinery, supporting arbitrary style reference images. Furthermore, FPRF
supports multi-reference stylization with the semantic correspondence matching
and local AdaIN, which adds diverse user control for 3D scene styles. FPRF also
preserves multi-view consistency by applying semantic matching and style
transfer processes directly onto queried features in 3D space. In experiments,
we demonstrate that FPRF achieves favorable photorealistic quality 3D scene
stylization for large-scale scenes with diverse reference images. Project page:
https://kim-geonu.github.io/FPRF/ |
Presents FPRF, a feed-forward photorealistic style transfer method for large-scale 3D neural radiance fields using adaptive instance normalization (AdaIN). |
Existing 3D scene style transfer methods are computationally expensive, requiring per-style or per-scene optimization, and don't scale well to large scenes. |
Trains a stylizable radiance field comprised of a scene content field for geometry and appearance and a scene semantic field for local style matching. Employs a pre-trained MLP color decoder for generalization. Uses a style dictionary of clustered style reference images for efficient semantic matching and local AdaIN style transfer. |
Achieves multi-view consistent style transfer on large-scale scenes, unlike 2D methods.
Outperforms competing 3D style transfer methods on small-scale scenes in terms of quality and efficiency.
Successfully transfers styles from multiple reference images based on semantic correspondence with the scene. |
Semantic matching performance is limited by the DINO semantic encoder.
Future work includes exploring more advanced semantic encoders. |
style transfer, neural radiance fields, 3d scene stylization, adaptive instance normalization, semantic matching |
2401.05293
Report |
Score Distillation Sampling with Learned Manifold Corrective |
Thiemo Alldieck, Nikos Kolotouros, Cristian Sminchisescu |
Score Distillation Sampling (SDS) is a recent but already widely popular
method that relies on an image diffusion model to control optimization problems
using text prompts. In this paper, we conduct an in-depth analysis of the SDS
loss function, identify an inherent problem with its formulation, and propose a
surprisingly easy but effective fix. Specifically, we decompose the loss into
different factors and isolate the component responsible for noisy gradients. In
the original formulation, high text guidance is used to account for the noise,
leading to unwanted side effects. Instead, we train a shallow network mimicking
the timestep-dependent denoising deficiency of the image diffusion model in
order to effectively factor it out. We demonstrate the versatility and the
effectiveness of our novel loss formulation through several qualitative and
quantitative experiments, including optimization-based image synthesis and
editing, zero-shot image translation network training, and text-to-3D
synthesis. |
This paper provides an in-depth analysis of the Score Distillation Sampling (SDS) loss function, identifying a noise issue and proposing a solution called LMC-SDS (Score Distillation Sampling with Learned Manifold Corrective) to provide better gradients for improved image fidelity. |
SDS, while popular for controlling optimization problems with text prompts, suffers from issues like blurry results at low guidance and artifacts at high guidance. This limits its effectiveness and applicability. |
The authors decompose the SDS loss, pinpoint the component responsible for noisy gradients, and introduce LMC-SDS. This involves training a shallow network to approximate the diffusion model's denoising deficiencies and using it to achieve cleaner gradients. |
LMC-SDS generates higher quality images with balanced colors compared to the original SDS, especially at lower guidance levels.
It excels in optimization-based image editing, preserving image structure while effectively aligning with target prompts.
LMC-SDS proves beneficial in training image-to-image translation networks and enhancing text-to-3D models like DreamFusion. |
LMC-SDS relies on the diffusion model's understanding of prompts, limiting its effectiveness for ambiguous prompts.
The method may struggle with optimization states that deviate significantly from the natural image manifold. |
image diffusion models, score distillation sampling, image synthesis, image editing, text-to-3d synthesis |
2401.05252
Report |
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models |
Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li |
This technical report introduces PIXART-{\delta}, a text-to-image synthesis
framework that integrates the Latent Consistency Model (LCM) and ControlNet
into the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for its
ability to generate high-quality images of 1024px resolution through a
remarkably efficient training process. The integration of LCM in
PIXART-{\delta} significantly accelerates the inference speed, enabling the
production of high-quality images in just 2-4 steps. Notably, PIXART-{\delta}
achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images,
marking a 7x improvement over the PIXART-{\alpha}. Additionally,
PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUs
within a single day. With its 8-bit inference capability (von Platen et al.,
2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memory
constraints, greatly enhancing its usability and accessibility. Furthermore,
incorporating a ControlNet-like module enables fine-grained control over
text-to-image diffusion models. We introduce a novel ControlNet-Transformer
architecture, specifically tailored for Transformers, achieving explicit
controllability alongside high-quality image generation. As a state-of-the-art,
open-source image generation model, PIXART-{\delta} offers a promising
alternative to the Stable Diffusion family of models, contributing
significantly to text-to-image synthesis. |
\model is a novel text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and a novel ControlNet-Transformer architecture into the \pixarta model to achieve accelerated inference speed and fine-grained controllability. |
This integration enables the generation of high-quality, controllable images at a 1024px resolution in a mere 0.5 seconds, a significant improvement over existing methods. |
The authors incorporate LCM into \model for faster inference and propose a ControlNet-Transformer architecture tailored for the \pixarta model to enhance controllability over the generated images. The model is trained on a 120K internal image-text dataset and ablations are conducted on the network architecture and training hyperparameters. |
\model achieves a breakthrough 0.5 seconds for generating 1024 × 1024 pixel images, marking a 7× improvement over \pixarta.
The ControlNet-Transformer architecture effectively controls the generation process while maintaining high image quality.
The model can be trained on a single 32GB V100 GPU within a day and performs inference with 8-bit precision using only 8GB of GPU memory. |
The ControlNet module is only explored with HED edge maps as a conditioning input.
Exploration of larger batch sizes and alternative sampling methods is left for future work. |
text-to-image synthesis, latent consistency model, controlnet, transformer, fast inference |
2401.05224
Report |
Do Vision and Language Encoders Represent the World Similarly? |
Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, Noel E. O'Connor |
Aligned text-image encoders such as CLIP have become the de facto model for
vision-language tasks. Furthermore, modality-specific encoders achieve
impressive performances in their respective domains. This raises a central
question: does an alignment exist between uni-modal vision and language
encoders since they fundamentally represent the same physical world? Analyzing
the latent spaces structure of vision and language models on image-caption
benchmarks using the Centered Kernel Alignment (CKA), we find that the
representation spaces of unaligned and aligned encoders are semantically
similar. In the absence of statistical similarity in aligned encoders like
CLIP, we show that a possible matching of unaligned encoders exists without any
training. We frame this as a seeded graph-matching problem exploiting the
semantic similarity between graphs and propose two methods - a Fast Quadratic
Assignment Problem optimization, and a novel localized CKA metric-based
matching/retrieval. We demonstrate the effectiveness of this on several
downstream tasks including cross-lingual, cross-domain caption matching and
image classification. Code available at github.com/mayug/0-shot-llm-vision. |
This paper investigates the inherent alignment between unaligned vision and language encoders and leverages it for downstream cross-modal tasks in a training-free manner using Centered Kernel Alignment (CKA). |
Aligned text-image encoders are the standard for vision-language tasks, but require extensive training. This work explores whether this training is necessary due to the inherent alignment stemming from both modalities representing the same world. |
The authors leverage CKA similarity between vision and language encoders and propose two methods: 1) Quadratic Assignment Problem (QAP) optimization to maximize CKA for matching, and 2) A novel localized CKA metric for retrieval tasks. |
Unaligned encoders exhibit surprisingly high semantic similarity, comparable to aligned encoders, especially when trained on large, diverse datasets.
The proposed QAP matching and localized CKA retrieval methods outperform baseline methods like linear regression and relative representations on cross-domain and cross-lingual caption matching/retrieval tasks.
The methods are shown to be effective for image classification on ImageNet-100 and cross-lingual image retrieval using multilingual sentence transformers. |
The computational complexity of QAP matching and local CKA retrieval is higher than baseline methods, although the authors propose potential optimizations.
The study primarily focuses on global image-caption alignment and could be extended to explore finer-grained alignments. |
vision-language models, zero-shot learning, centered kernel alignment, cross-modal retrieval, cross-lingual retrieval |
2401.05097
Report |
Any-Way Meta Learning |
Junhoo Lee, Yearim Kim, Hyunho Lee, Nojun Kwak |
Although meta-learning seems promising performance in the realm of rapid
adaptability, it is constrained by fixed cardinality. When faced with tasks of
varying cardinalities that were unseen during training, the model lacks its
ability. In this paper, we address and resolve this challenge by harnessing
`label equivalence' emerged from stochastic numeric label assignments during
episodic task sampling. Questioning what defines ``true" meta-learning, we
introduce the ``any-way" learning paradigm, an innovative model training
approach that liberates model from fixed cardinality constraints. Surprisingly,
this model not only matches but often outperforms traditional fixed-way models
in terms of performance, convergence speed, and stability. This disrupts
established notions about domain generalization. Furthermore, we argue that the
inherent label equivalence naturally lacks semantic information. To bridge this
semantic information gap arising from label equivalence, we further propose a
mechanism for infusing semantic class information into the model. This would
enhance the model's comprehension and functionality. Experiments conducted on
renowned architectures like MAML and ProtoNet affirm the effectiveness of our
method. |
This paper proposes "any-way" few-shot learning, overcoming the limitation of fixed cardinality in conventional meta-learning by leveraging "label equivalence". |
Current meta-learning models struggle to adapt to new tasks with varying numbers of classes (different "ways"), hindering their application in real-world scenarios requiring flexibility. |
The authors utilize the concept of "label equivalence" arising from stochastic numeric label assignments during episodic task sampling. They propose an "any-way" learning method, allowing the model to handle tasks with any cardinality. They further introduce a mechanism to inject semantic class information into the model. |
The proposed "any-way" meta-learning model matches or outperforms traditional fixed-way models in performance, convergence speed, and stability.
The model exhibits strong domain generalization capabilities, adapting well to unseen task cardinalities and datasets.
Injecting semantic class information improves performance, particularly for fine-grained datasets, and allows incorporating techniques like Mixup from supervised learning. |
Performance degradation can occur when applying Mixup in "same" scenarios due to the trade-off between generality and specificity.
Further research on advanced algorithms to exploit label equivalence and enhance ensemble techniques is needed. |
meta-learning, few-shot learning, label equivalence, domain generalization, semantic class information |
2401.05011
Report |
Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object Detection |
Yucheng Han, Na Zhao, Weiling Chen, Keng Teck Ma, Hanwang Zhang |
Semi-supervised 3D object detection is a promising yet under-explored
direction to reduce data annotation costs, especially for cluttered indoor
scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this
task by utilizing a teacher model to generate pseudo-labels for unlabeled
samples. However, the availability of unlabeled samples in the 3D domain is
relatively limited compared to its 2D counterpart due to the greater effort
required to collect 3D data. Moreover, the loose consistency regularization in
SESS and restricted pseudo-label selection strategy in 3DIoUMatch lead to
either low-quality supervision or a limited amount of pseudo labels. To address
these issues, we present a novel Dual-Perspective Knowledge Enrichment approach
named DPKE for semi-supervised 3D object detection. Our DPKE enriches the
knowledge of limited training data, particularly unlabeled data, from two
perspectives: data-perspective and feature-perspective. Specifically, from the
data-perspective, we propose a class-probabilistic data augmentation method
that augments the input data with additional instances based on the varying
distribution of class probabilities. Our DPKE achieves feature-perspective
knowledge enrichment by designing a geometry-aware feature matching method that
regularizes feature-level similarity between object proposals from the student
and teacher models. Extensive experiments on the two benchmark datasets
demonstrate that our DPKE achieves superior performance over existing
state-of-the-art approaches under various label ratio conditions. The source
code will be made available to the public. |
This paper proposes DPKE, a Dual-Perspective Knowledge Enrichment approach for semi-supervised 3D object detection, addressing the challenges of limited data diversity and effective pseudo-label utilization in cluttered indoor scenes. |
Annotating 3D data, especially for cluttered indoor scenes, is expensive and time-consuming. Semi-supervised methods aim to alleviate this by learning from both labeled and unlabeled data. However, existing methods suffer from limited data diversity and low-quality or low-recall pseudo labels. |
DPKE enriches knowledge from two perspectives: 1) **Data-perspective:** employs a class-probabilistic data augmentation method to diversify training data by inserting instances from a proposal bank into scenes based on class probabilities. 2) **Feature-perspective:** utilizes a geometry-aware feature matching method to regularize feature-level similarity between student and teacher model proposals, focusing on potential foreground proposals based on geometry similarity. |
DPKE achieves superior performance over existing state-of-the-art methods on ScanNet and SUN RGB-D datasets under various label ratios.
Class-probabilistic data augmentation effectively handles limited diversity by increasing the presence of less-learned categories.
Geometry-aware feature matching improves pseudo-label recall by leveraging feature-level similarity with geometry constraints. |
The improvement on SUN RGB-D is less significant than ScanNet potentially due to lower point cloud quality and ground truth proposal accuracy.
Future work could explore alternative data augmentation techniques or other feature matching strategies for further performance improvement. |
semi-supervised learning, 3d object detection, data augmentation, feature matching, knowledge distillation |
2401.04861
Report |
CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video |
Xingyu Miao, Yang Bai, Haoran Duan, Yawen Huang, Fan Wan, Yang Long, Yefeng Zheng |
The goal of our work is to generate high-quality novel views from monocular
videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have
shown impressive performance by leveraging time-varying dynamic radiation
fields. However, these methods have limitations when it comes to accurately
modeling the motion of complex objects, which can lead to inaccurate and blurry
renderings of details. To address this limitation, we propose a novel approach
that builds upon a recent generalization NeRF, which aggregates nearby views
onto new viewpoints. However, such methods are typically only effective for
static scenes. To overcome this challenge, we introduce a module that operates
in both the time and frequency domains to aggregate the features of object
motion. This allows us to learn the relationship between frames and generate
higher-quality images. Our experiments demonstrate significant improvements
over state-of-the-art methods on dynamic scene datasets. Specifically, our
approach outperforms existing methods in terms of both the accuracy and visual
quality of the synthesized views. |
This paper proposes CTNeRF, a novel dynamic neural radiance field method for synthesizing high-quality novel views from monocular videos of dynamic scenes by aggregating multi-view features. |
Existing methods struggle to accurately model complex object motion in dynamic scenes, leading to blurry or inaccurate renderings. |
The method uses a ray-based cross-time (RBCT) aggregation module to capture temporal relationships between features and a global spatio-temporal filter (GSTF) to model motion in the frequency domain. |
CTNeRF achieves state-of-the-art results on the Nvidia Dynamic Scene Dataset, outperforming existing methods in most tested scenarios.
The RBCT and GSTF modules are shown to be crucial for improving the quality of synthesized views, enhancing detail and reducing artifacts.
The method shows comparable performance to existing techniques on the iPhone dataset and even surpasses them in some cases. |
The method may not perform optimally when rendering novel views for long-sequence videos due to limited aggregation view length.
Fine details might be lost during feature aggregation, particularly in scenes with small non-rigid deformations. |
dynamic neural radiance field, monocular video, novel view synthesis, scene flow, transformer |
2401.04728
Report |
Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation |
Xiyi Chen, Marko Mihajlovic, Shaofei Wang, Sergey Prokudin, Siyu Tang |
Recent advances in generative diffusion models have enabled the previously
unfeasible capability of generating 3D assets from a single input image or a
text prompt. In this work, we aim to enhance the quality and functionality of
these models for the task of creating controllable, photorealistic human
avatars. We achieve this by integrating a 3D morphable model into the
state-of-the-art multi-view-consistent diffusion approach. We demonstrate that
accurate conditioning of a generative pipeline on the articulated 3D model
enhances the baseline model performance on the task of novel view synthesis
from a single image. More importantly, this integration facilitates a seamless
and accurate incorporation of facial expression and body pose control into the
generation process. To the best of our knowledge, our proposed framework is the
first diffusion model to enable the creation of fully 3D-consistent,
animatable, and photorealistic human avatars from a single image of an unseen
subject; extensive quantitative and qualitative evaluations demonstrate the
advantages of our approach over existing state-of-the-art avatar creation
models on both novel view and novel expression synthesis tasks. The code for
our project is publicly available. |
This paper introduces a novel morphable diffusion model for controllable, photorealistic human avatar creation from a single image, integrating a 3D morphable model with multi-view consistent diffusion for improved novel view synthesis and facial expression control. |
Existing methods for generating photorealistic avatars often require extensive visual input, lack controllability, or struggle with 3D consistency, limiting their use in creating animatable and realistic avatars from minimal input. |
The method combines a 3D morphable model with a multi-view diffusion model. A 3D morphable model unprojects noisy image features to 3D space, guiding the diffusion process for improved reconstruction and animation. A shuffled training scheme disentangles reconstruction and animation, enabling novel expression synthesis from a single image. |
The model outperforms baselines in novel view synthesis of faces and bodies, achieving higher scores on LPIPS, SSIM, FID, and PCK metrics.
It effectively synthesizes novel facial expressions from a single image, demonstrating superior quality and controllability compared to existing methods.
Quantitative and qualitative evaluations highlight the model's ability to generate high-fidelity, animatable avatars with improved 3D consistency. |
The model's generalizability is limited by the training data's ethnic and hairstyle diversity, primarily featuring Asian subjects with a specific cap.
Generalization to out-of-distribution camera parameters remains a challenge, requiring external methods for comprehensive 3D reconstruction and free-view synthesis. |
diffusion models, 3d morphable models, avatar creation, novel view synthesis, facial expression control |
2401.04716
Report |
Low-Resource Vision Challenges for Foundation Models |
Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek |
Low-resource settings are well-established in natural language processing,
where many languages lack sufficient data for deep learning at scale. However,
low-resource problems are under-explored in computer vision. In this paper, we
address this gap and explore the challenges of low-resource image tasks with
vision foundation models. We first collect a benchmark of genuinely
low-resource image data, covering historic maps, circuit diagrams, and
mechanical drawings. These low-resource settings all share three challenges:
data scarcity, fine-grained differences, and the distribution shift from
natural images to the specialized domain of interest. While existing foundation
models have shown impressive generalizability, we find they cannot transfer
well to our low-resource tasks. To begin to tackle the challenges of
low-resource vision, we introduce one simple baseline per challenge.
Specifically, we i) enlarge the data space by generative models, ii) adopt the
best sub-kernels to encode local regions for fine-grained difference discovery
and iii) learn attention for specialized domains. Experiments on our three
low-resource tasks demonstrate our proposals already provide a better baseline
than transfer learning, data augmentation, and fine-grained methods. This
highlights the unique characteristics and challenges of low-resource vision for
foundation models that warrant further investigation. Project page:
https://xiaobai1217.github.io/Low-Resource-Vision/. |
This paper investigates the challenges of low-resource image recognition and presents a benchmark covering historic maps, circuit diagrams, and mechanical drawings. |
Low-resource scenarios are common in computer vision, but under-explored, making it crucial to understand these challenges and adapt existing methods. |
The authors collect a benchmark of low-resource image data, analyze its challenges, and propose three baselines to address data scarcity, fine-grained details, and specialized domain shift. |
Existing foundation models struggle to generalize to the specialized domains of low-resource vision tasks.
Simple transformations and existing fine-grained recognition methods fail to handle the limited data and domain shift.
The proposed baselines, especially generated data augmentation, improve performance over zero-shot transfer and existing transfer learning methods. |
The proposed baselines are an initial step and struggle to fully address the complex relationships and rare image styles in low-resource vision.
Future work should explore more diverse generated data, consider inter-region relationships, and adapt foundation models to non-natural images. |
low-resource vision, foundation models, transfer learning, data augmentation, fine-grained recognition |
2401.04651
Report |
Learning to Prompt Segment Anything Models |
Jiaxing Huang, Kai Jiang, Jingyi Zhang, Han Qiu, Lewei Lu, Shijian Lu, Eric Xing |
Segment Anything Models (SAMs) like SEEM and SAM have demonstrated great
potential in learning to segment anything. The core design of SAMs lies with
Promptable Segmentation, which takes a handcrafted prompt as input and returns
the expected segmentation mask. SAMs work with two types of prompts including
spatial prompts (e.g., points) and semantic prompts (e.g., texts), which work
together to prompt SAMs to segment anything on downstream datasets. Despite the
important role of prompts, how to acquire suitable prompts for SAMs is largely
under-explored. In this work, we examine the architecture of SAMs and identify
two challenges for learning effective prompts for SAMs. To this end, we propose
spatial-semantic prompt learning (SSPrompt) that learns effective semantic and
spatial prompts for better SAMs. Specifically, SSPrompt introduces spatial
prompt learning and semantic prompt learning, which optimize spatial prompts
and semantic prompts directly over the embedding space and selectively leverage
the knowledge encoded in pre-trained prompt encoders. Extensive experiments
show that SSPrompt achieves superior image segmentation performance
consistently across multiple widely adopted datasets. |
This paper proposes spatial-semantic prompt learning (SSPrompt), a novel prompt learning technique for Segment Anything Models (SAMs) that enhances segmentation performance on downstream datasets by directly optimizing spatial and semantic prompts in the embedding space. |
Existing SAMs often underperform when using default prompts on downstream datasets. Optimizing prompts for these models is crucial to unlocking their full potential, especially in few-shot learning scenarios. |
SSPrompt leverages two key components: 1) spatial prompt learning (SpaPrompt), which optimizes spatial prompts in a high-dimensional embedding space to overcome limitations of the 2D coordinate system, and 2) semantic prompt learning (SemPrompt), which efficiently optimizes semantic prompts in the embedding space and selectively utilizes knowledge from pretrained text encoders. |
SSPrompt consistently outperforms state-of-the-art prompt learning methods across various image segmentation datasets, including Cityscapes, BDD100K, Mapillary, ADE20K, PASCAL Context, and ACDC.
Ablation studies highlight the effectiveness of optimizing prompts in the embedding space and selectively leveraging knowledge from pretrained prompt encoders.
SSPrompt demonstrates robustness to varying training data sizes, effectively improving performance even with limited data. |
The paper primarily focuses on SEEM due to the unavailability of open-sourced SAM versions with text prompt encoders, potentially limiting the generalizability of the findings to different SAM architectures.
Future work could explore prompt learning techniques for other recently released semantic-aware SAMs, such as Semantic SAM and SAM-CLIP. |
prompt learning, segment anything model (sam), image segmentation, few-shot learning, computer vision |
2401.04608
Report |
EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models |
Jingyuan Yang, Jiawei Feng, Hui Huang |
Recent years have witnessed remarkable progress in image generation task,
where users can create visually astonishing images with high-quality. However,
existing text-to-image diffusion models are proficient in generating concrete
concepts (dogs) but encounter challenges with more abstract ones (emotions).
Several efforts have been made to modify image emotions with color and style
adjustments, facing limitations in effectively conveying emotions with fixed
image contents. In this work, we introduce Emotional Image Content Generation
(EICG), a new task to generate semantic-clear and emotion-faithful images given
emotion categories. Specifically, we propose an emotion space and construct a
mapping network to align it with the powerful Contrastive Language-Image
Pre-training (CLIP) space, providing a concrete interpretation of abstract
emotions. Attribute loss and emotion confidence are further proposed to ensure
the semantic diversity and emotion fidelity of the generated images. Our method
outperforms the state-of-the-art text-to-image approaches both quantitatively
and qualitatively, where we derive three custom metrics, i.e., emotion
accuracy, semantic clarity and semantic diversity. In addition to generation,
our method can help emotion understanding and inspire emotional art design. |
Introduces Emotional Image Content Generation (EICG), a novel task to generate images with clear semantics that evoke specific emotions, addressing the limitations of text-to-image models in handling abstract concepts like emotions. |
Current text-to-image models excel at concrete concepts but struggle with abstract ones like emotions. Existing emotion modification methods are limited by fixed image content. EICG aims to bridge this gap by generating images that are both semantically meaningful and emotionally evocative. |
Proposes a mapping network to align a learned emotion space with the CLIP space, leveraging an attribute loss based on EmoSet's annotations and an emotion confidence mechanism to ensure semantic clarity, diversity, and emotion fidelity. |
Outperforms state-of-the-art text-to-image generation methods in generating images with higher fidelity, diversity, semantic clarity, and emotion accuracy.
User study confirms the superiority of the proposed method in terms of image fidelity, emotion faithfulness, and semantic diversity.
Demonstrates potential applications in emotion decomposition, emotion transfer for image editing and design, and emotion fusion for creating complex emotional experiences. |
Current work focuses on content and could be enhanced by incorporating other visual elements like color and style.
The emotion-content relationship is simplified as binary, neglecting the nuanced emotional associations of certain objects/scenes. |
emotion generation, text-to-image synthesis, visual emotion analysis, contrastive language-image pretraining (clip), diffusion models |
2401.04468
Report |
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation |
Weimin Wang, Jiawei Liu, Zhijie Lin, Jiangqiao Yan, Shuo Chen, Chetwin Low, Tuyen Hoang, Jie Wu, Jun Hao Liew, Hanshu Yan, Daquan Zhou, Jiashi Feng |
The growing demand for high-fidelity video generation from textual
descriptions has catalyzed significant research in this field. In this work, we
introduce MagicVideo-V2 that integrates the text-to-image model, video motion
generator, reference image embedding module and frame interpolation module into
an end-to-end video generation pipeline. Benefiting from these architecture
designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution
video with remarkable fidelity and smoothness. It demonstrates superior
performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph,
Moon Valley and Stable Video Diffusion model via user evaluation at large
scale. |
MagicVideo-V2, a novel multi-stage Text-to-Video (T2V) framework that generates high-fidelity and smooth videos from text descriptions. |
Addresses the growing demand for high-fidelity video generation from textual descriptions. |
Integrates Text-to-Image (T2I), Image-to-Video (I2V), Video-to-Video (V2V), and Video Frame Interpolation (VFI) modules into an end-to-end pipeline. |
Generates aesthetically pleasing, high-resolution videos with remarkable fidelity and smoothness.
Outperforms leading T2V systems like Runway, Pika 1.0, Morph, Moon Valley, and Stable Video Diffusion in large-scale user evaluations.
Demonstrates superior performance in generating smooth and high-aesthetic videos through qualitative examples. |
Limited diversity and volume in video training datasets.
Reliance on human evaluation for performance assessment. |
text-to-video generation, video generation, diffusion models, video frame interpolation, high-fidelity video synthesis |
2401.04463
Report |
D3AD: Dynamic Denoising Diffusion Probabilistic Model for Anomaly Detection |
Justin Tebbe, Jawad Tayyub |
Diffusion models have found valuable applications in anomaly detection by
capturing the nominal data distribution and identifying anomalies via
reconstruction. Despite their merits, they struggle to localize anomalies of
varying scales, especially larger anomalies like entire missing components.
Addressing this, we present a novel framework that enhances the capability of
diffusion models, by extending the previous introduced implicit conditioning
approach Meng et al. (2022) in three significant ways. First, we incorporate a
dynamic step size computation that allows for variable noising steps in the
forward process guided by an initial anomaly prediction. Second, we demonstrate
that denoising an only scaled input, without any added noise, outperforms
conventional denoising process. Third, we project images in a latent space to
abstract away from fine details that interfere with reconstruction of large
missing components. Additionally, we propose a fine-tuning mechanism that
facilitates the model to effectively grasp the nuances of the target domain.
Our method undergoes rigorous evaluation on two prominent anomaly detection
datasets VISA and BTAD, yielding state-of-the-art performance. Importantly, our
framework effectively localizes anomalies regardless of their scale, marking a
pivotal advancement in diffusion-based anomaly detection. |
This paper proposes D3AD, a novel diffusion model-based anomaly detection framework that enhances anomaly localization by introducing dynamic implicit conditioning, using a noiseless scaled input, and leveraging a latent diffusion model. |
Existing diffusion models struggle to localize anomalies of varying scales, especially large ones. This work aims to overcome this limitation and improve the accuracy of anomaly detection in industrial settings where accurate localization is crucial. |
The proposed D3AD method uses a dynamic implicit conditioning mechanism to determine the level of perturbation based on an initial anomaly estimate using KNN distances of domain-adapted features. It avoids initial noising and instead uses a scaled input for improved anomaly segmentation. A latent diffusion model is used to improve efficiency and handle large anomalies. |
D3AD achieves state-of-the-art anomaly segmentation performance on the VisA benchmark, outperforming previous methods by a significant margin (2.7% higher PRO score).
The dynamic implicit conditioning mechanism effectively identifies large anomalies without compromising performance on smaller ones.
Ablation studies confirm the individual contributions of domain adaptation, noiseless scaling, and dynamic implicit conditioning to D3AD's performance. |
The inference speed of D3AD is slower than some existing methods, requiring further optimization for real-time applications.
Future work could explore precomputed features and more efficient approximations for anomaly severity to enhance inference speed. |
anomaly detection, diffusion models, dynamic implicit conditioning, unsupervised learning, computer vision |
2401.04339
Report |
Memory-Efficient Personalization using Quantized Diffusion Model |
Hyogon Ryu, Seohyun Lim, Hyunjung Shim |
The rise of billion-parameter diffusion models like Stable Diffusion XL,
Imagen, and Dall-E3 markedly advances the field of generative AI. However,
their large-scale nature poses challenges in fine-tuning and deployment due to
high resource demands and slow inference speed. This paper ventures into the
relatively unexplored yet promising realm of fine-tuning quantized diffusion
models. We establish a strong baseline by customizing three models: PEQA for
fine-tuning quantization parameters, Q-Diffusion for post-training
quantization, and DreamBooth for personalization. Our analysis reveals a
notable trade-off between subject and prompt fidelity within the baseline
model. To address these issues, we introduce two strategies, inspired by the
distinct roles of different timesteps in diffusion models: S1 optimizing a
single set of fine-tuning parameters exclusively at selected intervals, and S2
creating multiple fine-tuning parameter sets, each specialized for different
timestep intervals. Our approach not only enhances personalization but also
upholds prompt fidelity and image quality, significantly outperforming the
baseline qualitatively and quantitatively. The code will be made publicly
available. |
This paper addresses the challenge of fine-tuning large, quantized diffusion models for personalization, proposing two novel strategies to improve efficiency and performance. |
Fine-tuning large diffusion models like Stable Diffusion XL is computationally expensive. This work enables efficient personalization of these models by using quantized (low-precision) weights, saving memory and computation. |
The authors first establish a baseline by combining Q-Diffusion (for post-training quantization), DreamBooth (for personalization), and PEQA (for fine-tuning quantization parameters). Then, they introduce two strategies: (S1) selective fine-tuning at specific timesteps crucial for learning the target subject and (S2) specialized fine-tuning with multiple parameter sets tailored to different timestep intervals. |
Both S1 and S2 outperform the baseline in terms of subject fidelity, prompt fidelity, and image quality.
S2 generally shows better performance than S1, but requires three times more computation for fine-tuning.
The proposed methods achieve comparable performance to full-precision fine-tuning while using quantized weights. |
Quantization can sometimes lead to unwanted artifacts like cast shadows in generated images.
The current method does not support Low-Rank Adaptation (LoRA), which could further enhance model versatility. |
diffusion models, quantization, personalization, fine-tuning, computer vision |
2401.04247
Report |
Robust Image Watermarking using Stable Diffusion |
Lijun Zhang, Xiao Liu, Antoni Viros Martin, Cindy Xiong Bearfield, Yuriy Brun, Hui Guan |
Watermarking images is critical for tracking image provenance and claiming
ownership. With the advent of generative models, such as stable diffusion, able
to create fake but realistic images, watermarking has become particularly
important, e.g., to make generated images reliably identifiable. Unfortunately,
the very same stable diffusion technology can remove watermarks injected using
existing methods. To address this problem, we present a ZoDiac, which uses a
pre-trained stable diffusion model to inject a watermark into the trainable
latent space, resulting in watermarks that can be reliably detected in the
latent vector, even when attacked. We evaluate ZoDiac on three benchmarks,
MS-COCO, DiffusionDB, and WikiArt, and find that ZoDiac is robust against
state-of-the-art watermark attacks, with a watermark detection rate over 98%
and a false positive rate below 6.4%, outperforming state-of-the-art
watermarking methods. Our research demonstrates that stable diffusion is a
promising approach to robust watermarking, able to withstand even
stable-diffusion-based attacks. |
\pjn is a novel zero-shot watermarking framework based on stable diffusion that embeds invisible watermarks in the latent space of images, making it robust even to attacks utilizing stable diffusion. |
Watermarking images is crucial for proving ownership and tracking provenance, especially with the rise of AI-generated content. Existing methods are vulnerable to attacks that leverage generative AI, particularly stable diffusion, to remove watermarks. |
\pjn initializes a latent vector from an image using DDIM inversion, encodes a ring-like watermark in the vector's Fourier space, and optimizes the watermarked vector to generate a perceptually similar image. It then adaptively mixes the watermarked and original images to further improve visual quality. Watermark detection is performed by applying DDIM inversion, Fourier transformation, and statistical testing on the latent vector. |
\pjn achieves high watermark detection rates (above 98%) and low false positive rates (below 6.4%) even against state-of-the-art attacks, outperforming existing methods, especially against stable diffusion-based removal attacks and combined attacks.
It maintains high image quality with PSNR > 30dB and SSIM > 0.9, exceeding the quality of the most robust existing method.
\pjn is flexible and can be applied with different pre-trained stable diffusion backbones while maintaining its effectiveness. |
The current implementation of \pjn is limited to zero-bit watermarking, meaning it can only embed a mark and detect its presence but cannot encode meaningful messages.
While \pjn shows robustness against most attacks, it is vulnerable to rotation attacks. A proposed solution involves automatically correcting the image orientation before detection, but this increases the false positive rate, necessitating further exploration for a better trade-off. |
watermarking, stable diffusion, generative ai, robustness, zero-shot |
2401.04136
Report |
The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline |
Haonan Wang, Qianli Shen, Yao Tong, Yang Zhang, Kenji Kawaguchi |
The commercialization of text-to-image diffusion models (DMs) brings forth
potential copyright concerns. Despite numerous attempts to protect DMs from
copyright issues, the vulnerabilities of these solutions are underexplored. In
this study, we formalized the Copyright Infringement Attack on generative AI
models and proposed a backdoor attack method, SilentBadDiffusion, to induce
copyright infringement without requiring access to or control over training
processes. Our method strategically embeds connections between pieces of
copyrighted information and text references in poisoning data while carefully
dispersing that information, making the poisoning data inconspicuous when
integrated into a clean dataset. Our experiments show the stealth and efficacy
of the poisoning data. When given specific text prompts, DMs trained with a
poisoning ratio of 0.20% can produce copyrighted images. Additionally, the
results reveal that the more sophisticated the DMs are, the easier the success
of the attack becomes. These findings underline potential pitfalls in the
prevailing copyright protection strategies and underscore the necessity for
increased scrutiny to prevent the misuse of DMs. |
This paper proposes SilentBadDiffusion, a novel backdoor attack to induce copyright infringement in text-to-image diffusion models by poisoning the training data. |
This work exposes vulnerabilities in current copyright protection strategies relying on access restriction and highlights the need for more robust methods. |
SilentBadDiffusion dissects copyrighted images into elements, generates non-infringing images with those elements, and trains the model on this poisoned dataset, embedding connections that are triggered by specific prompts. |
Diffusion models trained on poisoned datasets with a small poisoning ratio (e.g., 0.20%) can generate copyrighted images when prompted with specific triggers.
The poisoning data seamlessly blends with clean data, making detection difficult.
More advanced diffusion models, with stronger composition abilities, are more susceptible to this attack. |
The current attack assumes decomposable copyrighted images, future work can explore broader target types.
Future research can explore defenses against this attack and investigate theoretical foundations of memorization and generalization in diffusion models. |
generative ai, diffusion model, data poisoning attack, copyright infringement attack, memorization |
2401.04099
Report |
AGG: Amortized Generative 3D Gaussians for Single Image to 3D |
Dejia Xu, Ye Yuan, Morteza Mardani, Sifei Liu, Jiaming Song, Zhangyang Wang, Arash Vahdat |
Given the growing need for automatic 3D content creation pipelines, various
3D representations have been studied to generate 3D objects from a single
image. Due to its superior rendering efficiency, 3D Gaussian splatting-based
models have recently excelled in both 3D reconstruction and generation. 3D
Gaussian splatting approaches for image to 3D generation are often
optimization-based, requiring many computationally expensive score-distillation
steps. To overcome these challenges, we introduce an Amortized Generative 3D
Gaussian framework (AGG) that instantly produces 3D Gaussians from a single
image, eliminating the need for per-instance optimization. Utilizing an
intermediate hybrid representation, AGG decomposes the generation of 3D
Gaussian locations and other appearance attributes for joint optimization.
Moreover, we propose a cascaded pipeline that first generates a coarse
representation of the 3D data and later upsamples it with a 3D Gaussian
super-resolution module. Our method is evaluated against existing
optimization-based 3D Gaussian frameworks and sampling-based pipelines
utilizing other 3D representations, where AGG showcases competitive generation
abilities both qualitatively and quantitatively while being several orders of
magnitude faster. Project page: https://ir1d.github.io/AGG/ |
Introduces AGG, a novel cascaded generative framework that produces 3D Gaussian-based objects from a single image without per-instance optimization. |
Addresses the growing need for automatic 3D content creation pipelines and the limitations of optimization-based 3D Gaussian generation approaches. |
Utilizes a hybrid generator for coarse Gaussian prediction, followed by a UNet-based super-resolution module for refinement; decomposes geometry and texture generation for joint optimization. |
AGG demonstrates competitive generation quality compared to existing optimization-based 3D Gaussian pipelines and sampling-based frameworks.
AGG achieves significantly faster inference speeds (several orders of magnitude) compared to baselines.
Ablation studies confirm the effectiveness of the proposed hybrid generator and super-resolution module. |
The number of generated 3D Gaussians is limited for representing highly complex geometry.
Future work will focus on extending AGG to handle multiple objects and occlusions. |
3d gaussian splatting, image-to-3d generation, amortized generation, hybrid representation, super-resolution |
2401.04092
Report |
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation |
Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein |
Despite recent advances in text-to-3D generative methods, there is a notable
absence of reliable evaluation metrics. Existing metrics usually focus on a
single criterion each, such as how well the asset aligned with the input text.
These metrics lack the flexibility to generalize to different evaluation
criteria and might not align well with human preferences. Conducting user
preference studies is an alternative that offers both adaptability and
human-aligned results. User studies, however, can be very expensive to scale.
This paper presents an automatic, versatile, and human-aligned evaluation
metric for text-to-3D generative models. To this end, we first develop a prompt
generator using GPT-4V to generate evaluating prompts, which serve as input to
compare text-to-3D models. We further design a method instructing GPT-4V to
compare two 3D assets according to user-defined criteria. Finally, we use these
pairwise comparison results to assign these models Elo ratings. Experimental
results suggest our metric strongly align with human preference across
different evaluation criteria. |
This paper introduces an automatic evaluation metric for text-to-3D generative models using GPT-4V, aiming for versatility and human-alignment. |
Existing metrics often lack flexibility for diverse evaluation criteria and may not align well with human judgment, hindering progress in text-to-3D generation. |
The method involves a prompt generator creating diverse input prompts and a 3D assets evaluator using GPT-4V to compare generated 3D shapes based on user-defined criteria, ultimately assigning Elo ratings to models. |
The proposed metric exhibits stronger alignment with human judgment across various evaluation criteria compared to existing metrics like CLIP similarity and PickScore.
The method allows for holistic evaluation, revealing relative strengths and weaknesses among different text-to-3D models.
The framework can be extended to assess other criteria, such as the diversity of generated 3D assets. |
The study's scale is limited due to resource constraints, necessitating larger-scale verification.
The reliance on GPT-4V introduces potential biases and limitations, requiring mitigation strategies and further investigation. |
text-to-3d generation, evaluation metrics, gpt-4v, human alignment, 3d shape comparison |
2401.03890
Report |
A Survey on 3D Gaussian Splatting |
Guikun Chen, Wenguan Wang |
3D Gaussian splatting (GS) has recently emerged as a transformative technique
in the realm of explicit radiance field and computer graphics. This innovative
approach, characterized by the utilization of millions of learnable 3D
Gaussians, represents a significant departure from mainstream neural radiance
field approaches, which predominantly use implicit, coordinate-based models to
map spatial coordinates to pixel values. 3D GS, with its explicit scene
representation and differentiable rendering algorithm, not only promises
real-time rendering capability but also introduces unprecedented levels of
editability. This positions 3D GS as a potential game-changer for the next
generation of 3D reconstruction and representation. In the present paper, we
provide the first systematic overview of the recent developments and critical
contributions in the domain of 3D GS. We begin with a detailed exploration of
the underlying principles and the driving forces behind the emergence of 3D GS,
laying the groundwork for understanding its significance. A focal point of our
discussion is the practical applicability of 3D GS. By enabling unprecedented
rendering speed, 3D GS opens up a plethora of applications, ranging from
virtual reality to interactive media and beyond. This is complemented by a
comparative analysis of leading 3D GS models, evaluated across various
benchmark tasks to highlight their performance and practical utility. The
survey concludes by identifying current challenges and suggesting potential
avenues for future research in this domain. Through this survey, we aim to
provide a valuable resource for both newcomers and seasoned researchers,
fostering further exploration and advancement in applicable and explicit
radiance field representation. |
This paper presents the first comprehensive survey of 3D Gaussian splatting (3D GS), a novel technique for scene representation and rendering that utilizes millions of learnable 3D Gaussians. |
3D GS represents a paradigm shift from implicit neural radiance field methods like NeRF, offering advantages such as real-time rendering capabilities and unprecedented editability. |
The paper discusses the principles of 3D GS, including its forward process (splatting, rendering) and optimization process (parameter optimization, density control). It also explores various extensions of 3D GS, such as data-efficient and memory-efficient approaches, as well as applications in robotics, dynamic scene reconstruction, and AI-generated content. |
3D GS based methods achieve state-of-the-art performance in various tasks, including localization, rendering quality (static and dynamic scenes), human avatar modeling, and surgical 3D reconstruction.
3D GS demonstrates significant advantages in terms of both accuracy and speed compared to NeRF based methods, particularly for applications requiring real-time performance.
The explicit representation of 3D GS enables easier manipulation and editing of scenes, opening up new possibilities in content creation and scene understanding. |
Current 3D GS techniques face challenges in modeling internal structures of objects and handling large-scale scene reconstruction.
Further research is needed to explore the full potential of 3D GS in robotics, simulation, and other emerging applications. |
3d gaussian splatting, explicit radiance field, real-time rendering, scene understanding, neural rendering |
2401.03854
Report |
TIER: Text-Image Encoder-based Regression for AIGC Image Quality Assessment |
Jiquan Yuan, Xinyan Cao, Jinming Che, Qinyuan Wang, Sen Liang, Wei Ren, Jinlong Lin, Xixin Cao |
Recently, AIGC image quality assessment (AIGCIQA), which aims to assess the
quality of AI-generated images (AIGIs) from a human perception perspective, has
emerged as a new topic in computer vision. Unlike common image quality
assessment tasks where images are derived from original ones distorted by
noise, blur, and compression, \textit{etc.}, in AIGCIQA tasks, images are
typically generated by generative models using text prompts. Considerable
efforts have been made in the past years to advance AIGCIQA. However, most
existing AIGCIQA methods regress predicted scores directly from individual
generated images, overlooking the information contained in the text prompts of
these images. This oversight partially limits the performance of these AIGCIQA
methods. To address this issue, we propose a text-image encoder-based
regression (TIER) framework. Specifically, we process the generated images and
their corresponding text prompts as inputs, utilizing a text encoder and an
image encoder to extract features from these text prompts and generated images,
respectively. To demonstrate the effectiveness of our proposed TIER method, we
conduct extensive experiments on several mainstream AIGCIQA databases,
including AGIQA-1K, AGIQA-3K, and AIGCIQA2023. The experimental results
indicate that our proposed TIER method generally demonstrates superior
performance compared to baseline in most cases. |
This paper introduces TIER, a text-image encoder-based regression framework for AIGC image quality assessment that leverages information from both generated images and their text prompts. |
Existing AIGCIQA methods often fail to consider the valuable information present in the text prompts, limiting their assessment accuracy. |
TIER utilizes a text encoder (BERT) to extract features from text prompts and an image encoder (ResNet, InceptionV4) to extract features from generated images. These features are concatenated and fed into a regression network to predict the quality score. |
TIER generally outperforms baseline methods that ignore text prompt information on AGIQA-1K, AGIQA-3K, and AIGCIQA2023 datasets.
The framework shows particular promise in predicting quality and authenticity scores.
Performance improvement for correspondence scores is not always guaranteed, suggesting a need for better understanding of the image-text relationship in certain cases. |
The method's performance in predicting correspondence scores can be inconsistent.
Future work could explore more sophisticated methods for fusing text and image features. |
aigc, aigciqa, image quality assessment, text encoder, image encoder |
2401.03707
Report |
FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring |
Geunhyuk Youk, Jihyong Oh, Munchurl Kim |
We present a joint learning scheme of video super-resolution and deblurring,
called VSRDB, to restore clean high-resolution (HR) videos from blurry
low-resolution (LR) ones. This joint restoration problem has drawn much less
attention compared to single restoration problems. In this paper, we propose a
novel flow-guided dynamic filtering (FGDF) and iterative feature refinement
with multi-attention (FRMA), which constitutes our VSRDB framework, denoted as
FMA-Net. Specifically, our proposed FGDF enables precise estimation of both
spatio-temporally-variant degradation and restoration kernels that are aware of
motion trajectories through sophisticated motion representation learning.
Compared to conventional dynamic filtering, the FGDF enables the FMA-Net to
effectively handle large motions into the VSRDB. Additionally, the stacked FRMA
blocks trained with our novel temporal anchor (TA) loss, which temporally
anchors and sharpens features, refine features in a course-to-fine manner
through iterative updates. Extensive experiments demonstrate the superiority of
the proposed FMA-Net over state-of-the-art methods in terms of both
quantitative and qualitative quality. Codes and pre-trained models are
available at: https://kaist-viclab.github.io/fmanet-site |
This paper proposes FMA-Net, a novel framework for Video Super-Resolution and Deblurring (VSRDB) that effectively handles small-to-large motion and restores clean, high-resolution videos from blurry, low-resolution inputs. |
VSRDB is crucial for enhancing video quality in real-world scenarios where videos often suffer from blur due to camera shake or object motion. Existing methods struggle to effectively address spatio-temporally variant degradations, limiting their performance. |
FMA-Net leverages Flow-Guided Dynamic Filtering (FGDF) and Iterative Feature Refinement with Multi-Attention (FRMA) to learn motion-aware degradation kernels and iteratively refine features for joint restoration. It employs a two-stage training strategy, pre-training a degradation learning network followed by joint training with a restoration network. |
FMA-Net significantly outperforms state-of-the-art VSR and deblurring methods, achieving notable PSNR, SSIM, and tOF improvements on REDS4, GoPro, and YouTube datasets.
The proposed FGDF mechanism proves highly effective in handling large motions, leading to significant performance gains over conventional dynamic filtering.
Ablation studies confirm the contribution of each component in FMA-Net, highlighting the effectiveness of multi-flow-mask pairs, temporal anchor loss, and the multi-attention module. |
FMA-Net currently uses a two-stage training strategy, which requires additional training time compared to an end-to-end approach.
Handling extreme conditions like object rotation remains challenging due to the difficulty in predicting accurate optical flow in such scenarios. Future work could explore incorporating learnable homography parameters or quaternion representations to address this. |
video super-resolution, video deblurring, joint restoration, dynamic filtering, optical flow |
2401.03433
Report |
SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing |
Songyan Chen, Jiancheng Huang |
Text-conditional image editing based on large diffusion generative model has
attracted the attention of both the industry and the research community. Most
existing methods are non-reference editing, with the user only able to provide
a source image and text prompt. However, it restricts user's control over the
characteristics of editing outcome. To increase user freedom, we propose a new
task called Specific Reference Condition Real Image Editing, which allows user
to provide a reference image to further control the outcome, such as replacing
an object with a particular one. To accomplish this, we propose a fast baseline
method named SpecRef. Specifically, we design a Specific Reference Attention
Controller to incorporate features from the reference image, and adopt a mask
mechanism to prevent interference between editing and non-editing regions. We
evaluate SpecRef on typical editing tasks and show that it can achieve
satisfactory performance. The source code is available on
https://github.com/jingjiqinggong/specp2p. |
This paper introduces "Specific Reference Condition Real Image Editing," a new image editing task that allows users to control editing outcomes by providing a reference image, and proposes SpecRef, a fast, training-free baseline method for this task. |
Existing non-reference editing methods limit user control as they only allow inputting a source image and text prompts, restricting users from specifying the desired characteristics of the edited output. |
SpecRef extracts reference features from the self-attention layers of a pre-trained Stable Diffusion model during the inversion of the reference image. It then incorporates these features into the editing process using a Specific Reference Attention Layer (SR-attn) that blends features from the source and reference images based on attention masks, guiding the generation towards the reference while preserving the unedited parts of the source image. |
SpecRef successfully edits images based on both text prompts and specific reference images.
It effectively addresses the limitations of non-reference editing by allowing users to specify the desired appearance of edited objects or regions.
Experiments demonstrate SpecRef's ability to perform various editing tasks like object replacement, clothing replacement, and scene replacement with promising results. |
SpecRef may fail when the reference image region significantly differs in size or shape from the source image's editing region, leading to unnatural results.
The reliance on cross-attention for transferring features from the reference to the edited image can cause issues when the spatial relationship between objects in both images is dissimilar. |
aigc, image editing, diffusion models, stable diffusion, reference image editing |
2401.03257
Report |
RustNeRF: Robust Neural Radiance Field with Low-Quality Images |
Mengfei Li, Ming Lu, Xiaofang Li, Shanghang Zhang |
Recent work on Neural Radiance Fields (NeRF) exploits multi-view 3D
consistency, achieving impressive results in 3D scene modeling and
high-fidelity novel-view synthesis. However, there are limitations. First,
existing methods assume enough high-quality images are available for training
the NeRF model, ignoring real-world image degradation. Second, previous methods
struggle with ambiguity in the training set due to unmodeled inconsistencies
among different views. In this work, we present RustNeRF for real-world
high-quality NeRF. To improve NeRF's robustness under real-world inputs, we
train a 3D-aware preprocessing network that incorporates real-world degradation
modeling. We propose a novel implicit multi-view guidance to address
information loss during image degradation and restoration. Extensive
experiments demonstrate RustNeRF's advantages over existing approaches under
real-world degradation. The code will be released. |
This paper introduces RustNeRF, a robust Neural Radiance Field (NeRF) framework designed to handle low-quality, degraded images for high-fidelity novel view synthesis. |
Existing NeRF methods struggle with real-world image degradations, leading to unsatisfactory novel views. RustNeRF aims to improve the robustness of NeRF in real-world scenarios with degraded image sets. |
RustNeRF utilizes a 3D-aware preprocessing network trained on a synthetic dataset with simulated real-world degradations. It employs a view selection mechanism to gather relevant information from neighboring views for restoring the target view. Additionally, it introduces an implicit multi-view guidance technique that casts multiple rays within a pixel to leverage information from different views, further enhancing details in the reconstructed scene. |
RustNeRF demonstrates superior performance compared to baseline NeRF models, particularly DVGO and Instant-NGP, on benchmark datasets like Blender and LLFF, exhibiting significant improvements in PSNR, SSIM, and LPIPS metrics.
The proposed 3D-aware restoration network effectively reduces artifacts and improves the overall quality of the reconstructed scene compared to using single-view restoration or off-the-shelf solutions like Real-ESRGAN.
Implicit multi-view guidance, coupled with quadtree acceleration to manage computational cost, further enhances details and reduces noise in the rendered views, especially in regions with high-frequency information. |
The current implementation of RustNeRF does not incorporate bundle adjustment to handle potential camera pose inaccuracies caused by degraded input images.
The degradation model used for training the restoration network relies on a combination of classical degradation models and could benefit from further exploration and refinement to better simulate complex real-world degradation processes. |
neural radiance fields, nerf, image restoration, novel view synthesis, 3d scene reconstruction |
2401.03253
Report |
VLLaVO: Mitigating Visual Gap through LLMs |
Shuhao Chen, Yulong Zhang, Weisen Jiang, Jiangang Lu, Yu Zhang |
Recent advances achieved by deep learning models rely on the independent and
identically distributed assumption, hindering their applications in real-world
scenarios with domain shifts. To tackle this issue, cross-domain learning aims
at extracting domain-invariant knowledge to reduce the domain shift between
training and testing data. However, in visual cross-domain learning,
traditional methods concentrate solely on the image modality, disregarding the
potential benefits of incorporating the text modality. In this work, we propose
VLLaVO, combining Vision language models and Large Language models as Visual
cross-dOmain learners. VLLaVO uses vision-language models to convert images
into detailed textual descriptions. A large language model is then finetuned on
textual descriptions of the source/target domain generated by a designed
instruction template. Extensive experimental results under domain
generalization and unsupervised domain adaptation settings demonstrate the
effectiveness of the proposed method. |
VLLaVO, a novel approach that integrates Vision Language Models (VLMs) and Large Language Models (LLMs) for addressing visual domain shifts in cross-domain learning. |
Existing cross-domain learning methods often solely rely on image modality, neglecting the potential of text modality. This paper explores leveraging the power of LLMs for improved domain-invariant feature learning in visual tasks. |
VLLaVO first utilizes VLMs to transform images into textual descriptions (tags, attributes, captions). Subsequently, an LLM is fine-tuned with a designed instruction template using these descriptions paired with image labels, enabling it to perform classification based on textual input. |
VLLaVO consistently achieves state-of-the-art performance on benchmark datasets for both Domain Generalization (DG) and Unsupervised Domain Adaptation (UDA) tasks.
The method demonstrates superior zero-shot learning capability, outperforming zero-shot LLM baselines in cross-dataset evaluations.
Analysis reveals that VLLaVO effectively learns domain-invariant features, as evidenced by t-SNE visualizations and sensitivity analysis, focusing on relevant keywords while mitigating domain-specific biases. |
The quality of extracted textual descriptions depends on the VLM's capabilities and can be further improved.
The current work focuses on visual classification, limiting its applicability to other visual tasks. Future research should explore extending VLLaVO to address domain shifts in tasks like segmentation or depth estimation. |
domain generalization, unsupervised domain adaptation, large language models, vision language models, cross-domain learning |
2401.03201
Report |
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding |
Zeju Li, Chao Zhang, Xiaoyan Wang, Ruilong Ren, Yifan Xu, Ruifei Ma, Xiangde Liu |
The remarkable potential of multi-modal large language models (MLLMs) in
comprehending both vision and language information has been widely
acknowledged. However, the scarcity of 3D scenes-language pairs in comparison
to their 2D counterparts, coupled with the inadequacy of existing approaches in
understanding of 3D scenes by LLMs, poses a significant challenge. In response,
we collect and construct an extensive dataset comprising 75K
instruction-response pairs tailored for 3D scenes. This dataset addresses tasks
related to 3D VQA, 3D grounding, and 3D conversation. To further enhance the
integration of 3D spatial information into LLMs, we introduce a novel and
efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment
stage between 3D scenes and language and extends the instruction prompt with
the 3D modality information including the entire scene and segmented objects.
We evaluate the effectiveness of our method across diverse tasks in the 3D
scene domain and find that our approach serves as a strategic means to enrich
LLMs' comprehension of the 3D world. Our code is available at
https://github.com/staymylove/3DMIT. |
This paper proposes 3DMIT, an efficient 3D multi-modal instruction tuning framework designed to train LLMs in understanding 3D scenes by leveraging global scene and fine-grained object information, without requiring an alignment stage. |
Existing methods for 3D scene understanding with LLMs are limited by the scarcity of 3D scene-language data and the inefficiency of aligning 3D data with text. |
The authors construct a 75K 3D scene-language instruction dataset and propose 3DMIT, which directly incorporates 3D scene and object features extracted by pre-trained encoders into the instruction prompt for LLM fine-tuning. |
3DMIT outperforms 3D-LLMs without alignment and achieves comparable results to 3D-LLMs with alignment on 3D VQA.
3DMIT demonstrates promising performance on 3D visual grounding, outperforming methods without alignment.
The ablation study shows the benefits of using pre-trained 3D object encoders and incorporating multi-view image tokens for MLLMs. |
LLMs still face challenges in numerical and computational tasks, limiting their performance on tasks requiring precise spatial understanding.
Further research is needed to explore how to effectively incorporate spatial location information into LLMs for improved 3D grounding. |
multi-modal, 3d-llms, 3d scene understanding, instruction tuning, prompt engineering |
2401.03140
Report |
Fair Sampling in Diffusion Models through Switching Mechanism |
Yujin Choi, Jinseong Park, Hoki Kim, Jaewook Lee, Saeroom Park |
Diffusion models have shown their effectiveness in generation tasks by
well-approximating the underlying probability distribution. However, diffusion
models are known to suffer from an amplified inherent bias from the training
data in terms of fairness. While the sampling process of diffusion models can
be controlled by conditional guidance, previous works have attempted to find
empirical guidance to achieve quantitative fairness. To address this
limitation, we propose a fairness-aware sampling method called
\textit{attribute switching} mechanism for diffusion models. Without additional
training, the proposed sampling can obfuscate sensitive attributes in generated
data without relying on classifiers. We mathematically prove and experimentally
demonstrate the effectiveness of the proposed method on two key aspects: (i)
the generation of fair data and (ii) the preservation of the utility of the
generated data. |
This paper proposes "attribute switching," a sampling method for diffusion models that enhances fairness without requiring additional training or classifiers. |
Diffusion models, while effective, can amplify biases present in training data. Existing fairness solutions often rely on classifiers or are computationally expensive. This method addresses these limitations by aiming for distributional fairness, ensuring generated data is independent of sensitive attributes like race or gender. |
The method leverages the finding that diffusion models learn features at different sampling stages. It switches the sensitive attribute condition at a specific transition point during sampling, transferring high-level features from one attribute to another. A theoretical condition for finding this optimal transition point is provided and validated empirically. |
The method successfully generates data satisfying epsilon-fairness, showing comparable performance to true data on fairness benchmarks.
It preserves data utility, exhibiting similar FID scores to standard diffusion model sampling, indicating the generation of high-quality samples.
The approach is effective across various pre-trained diffusion models, including those conditioned on text prompts, showcasing its versatility. |
While the method preserves high-level image features, elements with strong contextual connections might be unintentionally removed, requiring further investigation.
The study primarily focuses on distributional fairness. Exploring fairness in a broader context, considering various factors beyond distribution, is crucial for a holistic understanding. |
diffusion models, fairness, generative models, sampling methods, attribute switching |
2401.03048
Report |
Latte: Latent Diffusion Transformer for Video Generation |
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao |
We propose a novel Latent Diffusion Transformer, namely Latte, for video
generation. Latte first extracts spatio-temporal tokens from input videos and
then adopts a series of Transformer blocks to model video distribution in the
latent space. In order to model a substantial number of tokens extracted from
videos, four efficient variants are introduced from the perspective of
decomposing the spatial and temporal dimensions of input videos. To improve the
quality of generated videos, we determine the best practices of Latte through
rigorous experimental analysis, including video clip patch embedding, model
variants, timestep-class information injection, temporal positional embedding,
and learning strategies. Our comprehensive evaluation demonstrates that Latte
achieves state-of-the-art performance across four standard video generation
datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In
addition, we extend Latte to text-to-video generation (T2V) task, where Latte
achieves comparable results compared to recent T2V models. We strongly believe
that Latte provides valuable insights for future research on incorporating
Transformers into diffusion models for video generation. |
This paper introduces Latte, a novel Latent Diffusion Transformer for video generation, featuring a video Transformer backbone and four efficient model variants for capturing spatio-temporal video distribution. |
Generating high-quality videos is challenging due to their complex spatio-temporal information and high dimensionality. Latte explores the potential of Transformer-based latent diffusion models for realistic video generation. |
Latte leverages a pre-trained VAE for encoding videos into latent space tokens, processed by Transformer blocks. Four variants explore decomposing spatial and temporal dimensions for efficient information capture. Extensive empirical analysis identifies best practices for Latte's components. |
Latte achieves state-of-the-art performance on four video generation benchmarks, including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD.
Comprehensive ablation studies reveal optimal design choices for Transformer-based video diffusion models, such as uniform frame patch embedding, S-AdaLN for timestep/class information injection, and absolute temporal positional embedding.
Latte demonstrates promising results for text-to-video generation, comparable to existing methods like VideoFusion and VideoLDM. |
Exploring the impact of different pre-trained video Transformers on Latte's performance.
Investigating alternative methods for temporal information injection within the Transformer architecture. |
video generation, diffusion models, transformers, latent space, text-to-video generation |
2401.02957
Report |
Denoising Vision Transformers |
Jiawei Yang, Katie Z Luo, Jiefeng Li, Kilian Q Weinberger, Yonglong Tian, Yue Wang |
We delve into a nuanced but significant challenge inherent to Vision
Transformers (ViTs): feature maps of these models exhibit grid-like artifacts,
which detrimentally hurt the performance of ViTs in downstream tasks. Our
investigations trace this fundamental issue down to the positional embeddings
at the input stage. To address this, we propose a novel noise model, which is
universally applicable to all ViTs. Specifically, the noise model dissects ViT
outputs into three components: a semantics term free from noise artifacts and
two artifact-related terms that are conditioned on pixel locations. Such a
decomposition is achieved by enforcing cross-view feature consistency with
neural fields in a per-image basis. This per-image optimization process
extracts artifact-free features from raw ViT outputs, providing clean features
for offline applications. Expanding the scope of our solution to support online
functionality, we introduce a learnable denoiser to predict artifact-free
features directly from unprocessed ViT outputs, which shows remarkable
generalization capabilities to novel data without the need for per-image
optimization. Our two-stage approach, termed Denoising Vision Transformers
(DVT), does not require re-training existing pre-trained ViTs and is
immediately applicable to any Transformer-based architecture. We evaluate our
method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP,
DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT
consistently and significantly improves existing state-of-the-art
general-purpose models in semantic and geometric tasks across multiple datasets
(e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT
design, especially regarding the naive use of positional embeddings. |
Identifies and addresses the issue of noise artifacts in Vision Transformer (ViT) features, attributing them to positional embeddings and proposing a two-stage denoising approach called NDFT. |
These artifacts hinder feature interpretability, disrupt semantic coherence, and negatively impact the performance of ViTs in downstream tasks. |
A novel noise model decomposes ViT outputs into semantic and artifact components. A per-image denoising technique using neural fields extracts artifact-free features, and a generalizable denoiser network is trained for real-time inference. |
Noise artifacts are prevalent in ViT features across various training algorithms.
NDFT effectively removes artifacts, leading to visually cleaner feature maps and enhanced semantic coherence.
Significant performance improvements are observed in downstream tasks like semantic segmentation and depth prediction after denoising. |
The fundamental reason for the existence of these artifacts and their variation across training algorithms is not fully understood.
Exploring alternative positional embedding approaches and ViT architectures could further mitigate artifacts. |
vision transformers, feature denoising, positional embeddings, neural fields, dense prediction tasks |
2401.02955
Report |
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively |
Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, Chen Change Loy |
The CLIP and Segment Anything Model (SAM) are remarkable vision foundation
models (VFMs). SAM excels in segmentation tasks across diverse domains, while
CLIP is renowned for its zero-shot recognition capabilities. This paper
presents an in-depth exploration of integrating these two models into a unified
framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired
model designed for simultaneous interactive segmentation and recognition,
leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The
former adapts SAM's knowledge into the CLIP via distillation and learnable
transformer adapters, while the latter transfers CLIP knowledge into SAM,
enhancing its recognition capabilities. Extensive experiments on various
datasets and detectors show the effectiveness of Open-Vocabulary SAM in both
segmentation and recognition tasks, significantly outperforming the naive
baselines of simply combining SAM and CLIP. Furthermore, aided with image
classification data training, our method can segment and recognize
approximately 22,000 classes. |
This paper introduces Open-Vocabulary SAM, a model that unifies the segmentation prowess of SAM with the zero-shot recognition capabilities of CLIP for simultaneous interactive segmentation and object recognition. |
Existing methods for combining SAM and CLIP are computationally expensive and struggle to recognize small objects. Open-Vocabulary SAM aims to address these limitations with a unified architecture and knowledge transfer modules. |
The paper proposes a unified encoder-decoder framework with two novel modules: SAM2CLIP (distills knowledge from SAM encoder to CLIP encoder) and CLIP2SAM (transfers CLIP knowledge to the SAM decoder for recognition). |
Open-Vocabulary SAM outperforms naive baselines, achieving over 2% improvement in IoU and 3% in mAP on COCO.
The method demonstrates significant gains in recognizing small objects, achieving over 20% accuracy improvement on LVIS.
Trained on a large dataset, Open-Vocabulary SAM can segment and recognize approximately 22,000 classes, acting as an effective interactive annotation tool. |
The model's performance slightly decreases when used with less robust detectors.
Future work will explore using coarse masks or language descriptions as interactive prompts. |
open-vocabulary learning, interactive segmentation, object recognition, vision foundation models, knowledge distillation |
2401.02739
Report |
Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors |
Top Piriyakulkij, Yingheng Wang, Volodymyr Kuleshov |
We propose denoising diffusion variational inference (DDVI), an approximate
inference algorithm for latent variable models which relies on diffusion models
as flexible variational posteriors. Specifically, our method introduces an
expressive class of approximate posteriors with auxiliary latent variables that
perform diffusion in latent space by reversing a user-specified noising
process. We fit these models by optimizing a lower bound on the marginal
likelihood inspired by the wake-sleep algorithm. Our method is easy to
implement (it fits a regularized extension of the ELBO), is compatible with
black-box variational inference, and outperforms alternative classes of
approximate posteriors based on normalizing flows or adversarial networks. It
increases the expressivity of flow-based methods via non-invertible deep
recurrent architectures and avoids the instability of adversarial methods. We
use DDVI on a motivating task in biology -- inferring latent ancestry from
human genomes -- and we find that it outperforms strong baselines on the
Thousand Genomes dataset. |
The paper proposes Denoising Diffusion Variational Inference (DDVI), a new variational inference algorithm that uses diffusion models as flexible variational posteriors. |
DDVI enhances variational inference by introducing auxiliary latent variables and a user-specified noising process, leading to more expressive approximate posteriors and tighter bounds on the marginal likelihood. |
DDVI leverages a wake-sleep inspired lower bound, optimizing via alternating ELBO optimization and 'sleep' steps to reverse the noising process and fit the approximate posterior. |
DDVI outperforms baselines like AEVB, AEVB-IAF, and AAEB in unsupervised learning tasks on MNIST and CIFAR-10 with various complex priors.
In semi-supervised settings, DDVI achieves strong performance in classification accuracy and ELBO on both MNIST and CIFAR-10.
Applied to genotype analysis on the 1000 Genomes dataset, DDVI demonstrates superior clustering performance compared to baselines. |
While showing promise in dimensionality reduction and visualization, the paper acknowledges potential limitations in likelihood estimation compared to traditional methods.
Further exploration of architectural improvements is needed to enhance performance in density estimation and sample quality tasks. |
variational inference, diffusion models, expressive posteriors, wake-sleep algorithm, genotype analysis |
2401.02677
Report |
Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss |
Yatharth Gupta, Vishnu V. Jaddipal, Harish Prabhala, Sayak Paul, Patrick Von Platen |
Stable Diffusion XL (SDXL) has become the best open source text-to-image
model (T2I) for its versatility and top-notch image quality. Efficiently
addressing the computational demands of SDXL models is crucial for wider reach
and applicability. In this work, we introduce two scaled-down variants, Segmind
Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter
UNets, respectively, achieved through progressive removal using layer-level
losses focusing on reducing the model size while preserving generative quality.
We release these models weights at https://hf.co/Segmind. Our methodology
involves the elimination of residual networks and transformer blocks from the
U-Net structure of SDXL, resulting in significant reductions in parameters, and
latency. Our compact models effectively emulate the original SDXL by
capitalizing on transferred knowledge, achieving competitive results against
larger multi-billion parameter SDXL. Our work underscores the efficacy of
knowledge distillation coupled with layer-level losses in reducing model size
while preserving the high-quality generative capabilities of SDXL, thus
facilitating more accessible deployment in resource-constrained environments. |
Introduces SSD-1B and Segmind-Vega, scaled-down variants of Stable Diffusion XL (SDXL) with 1.3B and 0.74B parameter UNets respectively, achieved through progressive layer removal and knowledge distillation. |
Addresses the computational demands of SDXL, making it more accessible for wider reach and applicability in resource-constrained environments. |
Employs progressive layer removal from the SDXL U-Net, guided by layer-level losses and knowledge distillation from multiple teacher models (SDXL base, Zavychroma-XL, Juggernaut-XL). |
SSD-1B and Segmind-Vega achieve competitive image generation quality compared to the full SDXL model.
Inference speedup of up to 60% for SSD-1B and 100% for Segmind-Vega.
Human preference study shows SSD-1B is marginally preferred over SDXL despite its smaller size. |
Limitations in generating specific image elements like text, hands, and full-body shots.
Future work includes exploring the technique on other large models like LLMs and MLMs. |
stable diffusion, sdxl, model compression, knowledge distillation, text-to-image synthesis |
2401.02620
Report |
Progress and Prospects in 3D Generative AI: A Technical Overview including 3D human |
Song Bai, Jie Li |
While AI-generated text and 2D images continue to expand its territory, 3D
generation has gradually emerged as a trend that cannot be ignored. Since the
year 2023 an abundant amount of research papers has emerged in the domain of 3D
generation. This growth encompasses not just the creation of 3D objects, but
also the rapid development of 3D character and motion generation. Several key
factors contribute to this progress. The enhanced fidelity in stable diffusion,
coupled with control methods that ensure multi-view consistency, and realistic
human models like SMPL-X, contribute synergistically to the production of 3D
models with remarkable consistency and near-realistic appearances. The
advancements in neural network-based 3D storing and rendering models, such as
Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have
accelerated the efficiency and realism of neural rendered models. Furthermore,
the multimodality capabilities of large language models have enabled language
inputs to transcend into human motion outputs. This paper aims to provide a
comprehensive overview and summary of the relevant papers published mostly
during the latter half year of 2023. It will begin by discussing the AI
generated object models in 3D, followed by the generated 3D human models, and
finally, the generated 3D human motions, culminating in a conclusive summary
and a vision for the future. |
This paper presents a comprehensive overview of recent advancements in AI-powered 3D content generation, focusing on object and human model creation, and human motion synthesis. |
The rapid progress in AI-generated 3D content is transforming various fields, including gaming, entertainment, and education, by enabling faster and more efficient creation of realistic 3D assets. |
The paper reviews various techniques, including diffusion models, neural radiance fields (NeRF), 3D Gaussian Splatting (3DGS), and large language models, highlighting their applications and limitations. |
Recent models achieve high-fidelity 3D generation with resolutions up to 8K, and some methods can generate models in seconds.
AI-powered human model generation leverages models like SMPL-X for realistic results, with advancements in both iterative and single-pass generation methods.
Human motion synthesis has seen progress in generating complex movements from text and interacting with objects, though challenges remain in achieving perfect realism. |
The lack of large, diverse 3D datasets compared to 2D image datasets limits the generalization capabilities of some models.
Precise control and realism in human-object interaction animations are still areas for improvement. |
aigc, generative ai, text-to-3d, 3d generation, metaverse |
2401.02473
Report |
VASE: Object-Centric Appearance and Shape Manipulation of Real Videos |
Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, Nicu Sebe |
Recently, several works tackled the video editing task fostered by the
success of large-scale text-to-image generative models. However, most of these
methods holistically edit the frame using the text, exploiting the prior given
by foundation diffusion models and focusing on improving the temporal
consistency across frames. In this work, we introduce a framework that is
object-centric and is designed to control both the object's appearance and,
notably, to execute precise and explicit structural modifications on the
object. We build our framework on a pre-trained image-conditioned diffusion
model, integrate layers to handle the temporal dimension, and propose training
strategies and architectural modifications to enable shape control. We evaluate
our method on the image-driven video editing task showing similar performance
to the state-of-the-art, and showcasing novel shape-editing capabilities.
Further details, code and examples are available on our project page:
https://helia95.github.io/vase-website/ |
Introduces VASE, a framework for object-centric video editing that enables both appearance and structural modifications to objects in real videos using a single keyframe. |
Existing video editing methods often lack the granularity for object-centric edits, struggle to capture precise nuances from text prompts, and rarely offer explicit control over object structure. |
Leverages a pre-trained image-conditioned diffusion model with temporal layers, a ControlNet for motion and structure guidance, a Joint Flow-Structure Augmentation pipeline, a Flow-Completion Network, and an Auxiliary Segmentation Head. |
Achieves high-quality appearance editing comparable to state-of-the-art methods.
Demonstrates precise and user-controlled shape editing capabilities.
Maintains temporal consistency while allowing for efficient editing without per-video training or complex video decomposition. |
Performance can be affected by strong occlusions or significant perspective changes.
Maintaining consistent edits in very long videos remains a challenge. |
video editing, diffusion models, object-centric, shape editing, appearance editing |
2401.02436
Report |
Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis |
Simon Niedermayr, Josef Stumpfegger, Rüdiger Westermann |
Recently, high-fidelity scene reconstruction with an optimized 3D Gaussian
splat representation has been introduced for novel view synthesis from sparse
image sets. Making such representations suitable for applications like network
streaming and rendering on low-power devices requires significantly reduced
memory consumption as well as improved rendering efficiency. We propose a
compressed 3D Gaussian splat representation that utilizes sensitivity-aware
vector clustering with quantization-aware training to compress directional
colors and Gaussian parameters. The learned codebooks have low bitrates and
achieve a compression rate of up to $31\times$ on real-world scenes with only
minimal degradation of visual quality. We demonstrate that the compressed splat
representation can be efficiently rendered with hardware rasterization on
lightweight GPUs at up to $4\times$ higher framerates than reported via an
optimized GPU compute pipeline. Extensive experiments across multiple datasets
demonstrate the robustness and rendering speed of the proposed approach. |
This paper introduces a novel compression and rendering pipeline for 3D Gaussian splat representations used in novel view synthesis, significantly reducing memory consumption and improving rendering efficiency. |
High-fidelity scene reconstruction methods often demand extensive memory, making them impractical for applications like network streaming and mobile rendering. This work addresses this limitation, enabling wider adoption of such representations. |
The pipeline employs sensitivity-aware vector clustering to compress directional colors and Gaussian parameters into compact codebooks. Quantization-aware training refines the scene at reduced bit-rates, and entropy encoding exploits spatial coherence for further compression. Rendering is optimized using GPU sorting and hardware rasterization. |
Achieves up to 31x compression on real-world scenes with minimal quality loss.
Demonstrates up to 4x faster rendering speeds compared to prior compute pipeline approaches.
Compressed scenes are suitable for low-power devices and network streaming applications. |
Aggressively compressing Gaussian positions without significant error remains a challenge.
Future work aims to reduce memory footprint during training and explore volumetric scene reconstruction. |
novel view synthesis, 3d gaussian splatting, scene compression, quantization-aware training, gpu rasterization |
2401.02418
Report |
Learning to Prompt with Text Only Supervision for Vision-Language Models |
Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari |
Foundational vision-language models such as CLIP are becoming a new paradigm
in vision, due to their excellent generalization abilities. However, adapting
these models for downstream tasks while maintaining their generalization
remains a challenge. In literature, one branch of methods adapts CLIP by
learning prompts using visual information. While effective, most of these works
require labeled data which is not practical, and often struggle to generalize
towards new datasets due to over-fitting on the source data. An alternative
approach resorts to training-free methods by generating class descriptions from
large language models (LLMs) and perform prompt ensembling. However, these
methods often generate class specific prompts that cannot be transferred to
other classes, which incur higher costs by generating LLM descriptions for each
class separately. In this work, we propose to combine the strengths of these
both streams of methods by learning prompts using only text data derived from
LLMs. As supervised training of prompts is not trivial due to absence of
images, we develop a training approach that allows prompts to extract rich
contextual knowledge from LLM data. Moreover, with LLM contextual data mapped
within the learned prompts, it enables zero-shot transfer of prompts to new
classes and datasets potentially cutting the LLM prompt engineering cost. To
the best of our knowledge, this is the first work that learns generalized
prompts using text only data. We perform extensive evaluations on 4 benchmarks
where our method improves over prior ensembling works while being competitive
to those utilizing labeled images. Our code and pre-trained models are
available at https://github.com/muzairkhattak/ProText. |
ProText, a novel approach to adapt CLIP for downstream visual recognition tasks, leverages text-only supervision from Large Language Models (LLMs) to learn generalized and transferable prompts. |
Existing methods for adapting CLIP either rely on labeled image data, which can be impractical, or employ class-specific LLM prompts that lack transferability to new classes and datasets. ProText addresses both limitations. |
ProText curates text-to-text data from LLMs by pairing class-name templates with corresponding descriptions. It then trains learnable prompts to map these templates to rich contextual features aligned with LLM descriptions, effectively embedding LLM knowledge within the prompts. |
In cross-dataset transfer, ProText outperforms CLIP and CuPL by +2.1% on average, demonstrating its generalization ability without using any visual samples.
ProText surpasses prior image-supervised prompt learning methods in base-to-novel class generalization, achieving a higher average novel class accuracy of 76.98%.
ProText consistently outperforms CuPL and WaffleCLIP in text-only supervised setting across diverse image datasets, indicating its effectiveness in utilizing LLM data for prompt learning. |
The performance of ProText is dependent on the quality and size of LLM-generated text data, with potential for further improvement as text data quality increases.
Exploring alternative techniques for contextual mapping, beyond prompt learning, could be a potential direction for future work. |
prompt learning, clip, zero-shot learning, vision-language models, large language models |
2401.02416
Report |
ODIN: A Single Model for 2D and 3D Segmentation |
Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki |
State-of-the-art models on contemporary 3D segmentation benchmarks like
ScanNet consume and label dataset-provided 3D point clouds, obtained through
post processing of sensed multiview RGB-D images. They are typically trained
in-domain, forego large-scale 2D pre-training and outperform alternatives that
featurize the posed RGB-D multiview images instead. The gap in performance
between methods that consume posed images versus post-processed 3D point clouds
has fueled the belief that 2D and 3D perception require distinct model
architectures. In this paper, we challenge this view and propose ODIN
(Omni-Dimensional INstance segmentation), a model that can segment and label
both 2D RGB images and 3D point clouds, using a transformer architecture that
alternates between 2D within-view and 3D cross-view information fusion. Our
model differentiates 2D and 3D feature operations through the positional
encodings of the tokens involved, which capture pixel coordinates for 2D patch
tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art
performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation
benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It
outperforms all previous works by a wide margin when the sensed 3D point cloud
is used in place of the point cloud sampled from 3D mesh. When used as the 3D
perception engine in an instructable embodied agent architecture, it sets a new
state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and
checkpoints can be found at the project website (https://odin-seg.github.io). |
This paper proposes ODIN, a novel model for 2D and 3D instance segmentation that effectively leverages pre-trained 2D backbones and operates directly on posed RGB-D images, achieving state-of-the-art performance in various benchmarks. |
Existing 3D segmentation models typically rely on pre-processed 3D point clouds, limiting their applicability to real-world scenarios where raw sensor data is prevalent. This work bridges the gap between 2D and 3D perception by unifying them into a single model that can directly process sensor data. |
ODIN alternates between 2D within-view fusion and 3D cross-view attention layers. It unprojects 2D features to 3D for cross-view contextualization and then projects them back to 2D. The model shares most of its parameters across 2D and 3D inputs, effectively leveraging pre-trained 2D backbones. |
ODIN sets new state-of-the-art performance on ScanNet200, Matterport3D, and AI2THOR 3D instance segmentation benchmarks, outperforming previous methods that use mesh-sampled point clouds.
The model also achieves competitive results on ScanNet and S3DIS benchmarks, demonstrating its effectiveness in handling real-world sensor data with misalignments.
When used as the 3D perception engine in an instructable embodied agent architecture, ODIN sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. |
ODIN's performance depends on the accuracy of depth and camera pose estimations.
Further research can explore scaling up 3D learning by jointly training on diverse 2D and 3D datasets for improved generalization. |
3d instance segmentation, 2d-3d perception, rgb-d processing, embodied vision, transformer networks |
2401.02414
Report |
Bring Metric Functions into Diffusion Models |
Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, Jiebo Luo |
We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising
Diffusion Probabilistic Model (DDPM) by effectively incorporating additional
metric functions in training. Metric functions such as the LPIPS loss have been
proven highly effective in consistency models derived from the score matching.
However, for the diffusion counterparts, the methodology and efficacy of adding
extra metric functions remain unclear. One major challenge is the mismatch
between the noise predicted by a DDPM at each step and the desired clean image
that the metric function works well on. To address this problem, we propose
Cas-DM, a network architecture that cascades two network modules to effectively
apply metric functions to the diffusion model training. The first module,
similar to a standard DDPM, learns to predict the added noise and is unaffected
by the metric function. The second cascaded module learns to predict the clean
image, thereby facilitating the metric function computation. Experiment results
show that the proposed diffusion model backbone enables the effective use of
the LPIPS loss, leading to state-of-the-art image quality (FID, sFID, IS) on
various established benchmarks. |
This paper introduces a Cascaded Diffusion Model (CasDM) that enhances Denoising Diffusion Probabilistic Models (DDPMs) by incorporating additional metric functions, such as the LPIPS loss, during training. |
Metric functions like LPIPS have shown significant improvements in consistency models, but their application to diffusion models remained unclear due to the challenge of aligning multi-step noise prediction with single-step metric computation. |
CasDM employs two cascaded network modules. The first module predicts added noise like a standard DDPM. The second module refines the clean image prediction, facilitating effective metric function computation. This design isolates the noise prediction from the metric function's influence. |
CasDM with LPIPS loss achieves state-of-the-art image quality (FID, sFID, IS) on various benchmarks.
The architecture consistently improves performance across datasets, demonstrating the effectiveness of incorporating metric functions in diffusion models.
The LPIPS loss enhances the diversity and distribution alignment of generated images, potentially due to its semantic awareness from the VGG backbone. |
Exploration of more effective metric functions beyond LPIPS.
Investigation into further architectural improvements for the clean image prediction module. |
diffusion models, generative models, image generation, lpips loss, metric learning |
2401.02402
Report |
3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation |
Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng |
3D panoptic segmentation is a challenging perception task, especially in
autonomous driving. It aims to predict both semantic and instance annotations
for 3D points in a scene. Although prior 3D panoptic segmentation approaches
have achieved great performance on closed-set benchmarks, generalizing these
approaches to unseen things and unseen stuff categories remains an open
problem. For unseen object categories, 2D open-vocabulary segmentation has
achieved promising results that solely rely on frozen CLIP backbones and
ensembling multiple classification outputs. However, we find that simply
extending these 2D models to 3D does not guarantee good performance due to poor
per-mask classification quality, especially for novel stuff categories. In this
paper, we propose the first method to tackle 3D open-vocabulary panoptic
segmentation. Our model takes advantage of the fusion between learnable LiDAR
features and dense frozen vision CLIP features, using a single classification
head to make predictions for both base and novel classes. To further improve
the classification performance on novel classes and leverage the CLIP model, we
propose two novel loss functions: object-level distillation loss and
voxel-level distillation loss. Our experiments on the nuScenes and
SemanticKITTI datasets show that our method outperforms the strong baseline by
a large margin. |
This paper proposes the first method for 3D open-vocabulary panoptic segmentation, aiming to segment both unseen "things" and unseen "stuff" objects in autonomous driving scenarios. |
Existing 3D panoptic segmentation models struggle to generalize to unseen object categories, limiting their real-world applicability in fields like autonomous driving. |
The method fuses learned LiDAR features with frozen CLIP vision features, utilizing a single classification head for base and novel classes. Two novel distillation losses, object-level and voxel-level, improve classification performance on novel classes by leveraging CLIP's capabilities. |
The method significantly outperforms the baseline on nuScenes and SemanticKITTI datasets.
The voxel-level distillation loss is particularly effective for novel "stuff" categories.
The fusion of LiDAR and CLIP features improves performance for novel "things" classes. |
The model is evaluated on benchmarks with a limited number of categories, necessitating larger datasets for comprehensive evaluation.
Future work could explore combining this method with approaches like RegionPLC for enhanced point-level discriminative features. |
autonomous driving, 3d panoptic segmentation, open vocabulary, vision-language, clip |
2401.02400
Report |
Learning the 3D Fauna of the Web |
Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, Jiajun Wu |
Learning 3D models of all animals on the Earth requires massively scaling up
existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an
approach that learns a pan-category deformable 3D animal model for more than
100 animal species jointly. One crucial bottleneck of modeling animals is the
limited availability of training data, which we overcome by simply learning
from 2D Internet images. We show that prior category-specific attempts fail to
generalize to rare species with limited training images. We address this
challenge by introducing the Semantic Bank of Skinned Models (SBSM), which
automatically discovers a small set of base animal shapes by combining
geometric inductive priors with semantic knowledge implicitly captured by an
off-the-shelf self-supervised feature extractor. To train such a model, we also
contribute a new large-scale dataset of diverse animal species. At inference
time, given a single image of any quadruped animal, our model reconstructs an
articulated 3D mesh in a feed-forward fashion within seconds. |
This paper presents 3D-Fauna, a method that learns a pan-category deformable 3D animal model for more than 100 quadruped animal species jointly from 2D internet images. |
Existing 3D animal reconstruction methods are limited to one or a few specific species. This work aims to achieve a more scalable solution by learning a single model for all animal species from readily available internet images. |
The paper proposes the Semantic Bank of Skinned Models (SBSM), which learns a low-dimensional base shape bank using unsupervised image features and interpolates between them to model diverse shapes. It also introduces a mask discriminator to prevent viewpoint collapse. |
The method reconstructs accurate articulated 3D animal shapes from single images across diverse species.
Quantitative evaluations show significant improvements over existing single-category methods on keypoint transfer tasks.
The Semantic Bank is shown to be crucial in preventing overfitting and capturing inter-species shape similarities. |
The method is currently limited to quadruped animals with similar skeletal structures.
Reconstructing accurate shapes for highly deformable animals, such as cats, remains challenging. |
3d reconstruction, animal modeling, deformable models, single-view reconstruction, unsupervised learning |
2401.02361
Report |
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection |
Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang |
Grounding-DINO is a state-of-the-art open-set detection model that tackles
multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase
Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness
has led to its widespread adoption as a mainstream architecture for various
downstream applications. However, despite its significance, the original
Grounding-DINO model lacks comprehensive public technical details due to the
unavailability of its training code. To bridge this gap, we present
MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline,
which is built with the MMDetection toolbox. It adopts abundant vision datasets
for pre-training and various detection and grounding datasets for fine-tuning.
We give a comprehensive analysis of each reported result and detailed settings
for reproduction. The extensive experiments on the benchmarks mentioned
demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny
baseline. We release all our models to the research community. Codes and
trained models are released at
https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino. |
This paper introduces MM-Grounding-DINO, an open-source and comprehensive pipeline for open-vocabulary object detection, phrase grounding, and referring expression comprehension built on the MMDetection toolbox. |
Grounding-DINO, while achieving state-of-the-art performance in these tasks, lacks publicly available training code, limiting reproducibility and further research. This work aims to fill this gap. |
The authors rebuilt Grounding-DINO using MMDetection, retaining the core architecture while adding a bias initialization to the contrastive embedding module. They pre-trained the model on a large dataset comprising COCO, Objects365, GRIT, V3Det, and referring expression datasets. |
MM-Grounding-DINO-Tiny achieves superior zero-shot performance compared to Grounding-DINO-Tiny on COCO (50.6 mAP), LVIS (41.4 mAP), ODinW benchmarks, and comparable results on RefCOCO, gRefCOCO.
The paper provides an extensive benchmark of results on OVD, PG, and REC tasks using a variety of datasets, offering a valuable resource for future research.
Fine-tuning experiments demonstrate MM-Grounding-DINO's strong generalizability across various downstream tasks like object detection in haze, underwater, and in paintings. |
The paper identifies limitations in the GRIT dataset, used as a substitute for the unavailable Cap4M, particularly the presence of noisy annotations and abstract phrases.
Future work could explore more robust evaluation metrics for REC tasks and address the model's limitations in understanding relational terms and detailed descriptions. |
open-vocabulary detection, phrase grounding, referring expression comprehension, mmdetection, zero-shot learning |
2401.02347
Report |
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training |
Longtian Qiu, Shan Ning, Xuming He |
Image captioning aims at generating descriptive and meaningful textual
descriptions of images, enabling a broad range of vision-language applications.
Prior works have demonstrated that harnessing the power of Contrastive Image
Language Pre-training (CLIP) offers a promising approach to achieving zero-shot
captioning, eliminating the need for expensive caption annotations. However,
the widely observed modality gap in the latent space of CLIP harms the
performance of zero-shot captioning by breaking the alignment between paired
image-text features. To address this issue, we conduct an analysis on the CLIP
latent space which leads to two findings. Firstly, we observe that the CLIP's
visual feature of image subregions can achieve closer proximity to the paired
caption due to the inherent information loss in text descriptions. In addition,
we show that the modality gap between a paired image-text can be empirically
modeled as a zero-mean Gaussian distribution. Motivated by the findings, we
propose a novel zero-shot image captioning framework with text-only training to
reduce the modality gap. In particular, we introduce a subregion feature
aggregation to leverage local region information, which produces a compact
visual representation for matching text representation. Moreover, we
incorporate a noise injection and CLIP reranking strategy to boost captioning
performance. We also extend our framework to build a zero-shot VQA pipeline,
demonstrating its generality. Through extensive experiments on common
captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that
our method achieves remarkable performance improvements. Code is available at
https://github.com/Artanic30/MacCap. |
This paper proposes MacCap, a novel zero-shot image captioning framework that reduces the modality gap in CLIP's latent space by leveraging subregion image features and text-only training with noise injection and CLIP reranking. |
Zero-shot image captioning with text-only training eliminates the need for expensive caption annotations and enables efficient development of vision-language applications, particularly for LLMs. |
MacCap analyzes CLIP's latent space, revealing closer proximity of subregion image features to paired captions and a Gaussian distribution for the modality gap. It introduces region noise injection during training, subregion feature aggregation during inference, and a multiple sampling and filtering strategy with CLIP reranking. |
MacCap outperforms previous zero-shot captioning methods in cross-domain and in-domain settings.
Subregion feature aggregation effectively reduces the modality gap in CLIP.
Noise injection and CLIP reranking further improve captioning quality. |
The impact of sampling and filtering on semantic comprehension is limited.
Further exploration is needed to apply MacCap to other vision-language tasks. |
image captioning, zero-shot learning, clip, modality gap, vision-language models |
2401.02330
Report |
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model |
Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang |
In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient
multi-modal assistant that harnesses the power of the recently advanced small
language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a
notable advancement in the realm of compact multi-modal models. It demonstrates
that even smaller language models, with as few as 2.7B parameters, can
effectively engage in intricate dialogues that integrate both textual and
visual elements, provided they are trained with high-quality corpora. Our model
delivers commendable performance on publicly available benchmarks that
encompass visual comprehension, reasoning, and knowledge-based perception.
Beyond its remarkable performance in multi-modal dialogue tasks, our model
opens new avenues for applications in time-sensitive environments and systems
that require real-time interaction, such as embodied agents. It highlights the
potential of smaller language models to achieve sophisticated levels of
understanding and interaction, while maintaining greater resource
efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}. |
The paper introduces LLaVA-Phi, an efficient multi-modal assistant that leverages the compact language model Phi-2 for multi-modal dialogues, demonstrating the capabilities of smaller language models in visual-language tasks. |
This work addresses the limitations of large vision-language models, such as high computational costs, by exploring the effectiveness of smaller, more efficient models for real-time applications on edge devices. |
LLaVA-Phi is built by fine-tuning Phi-2 with a high-quality dataset and then training it with the LLaVA pipeline, which includes pre-training and visual instruction tuning. |
LLaVA-Phi achieves performance comparable to or surpassing larger multi-modal models on various benchmarks, including VQA-v2, VizWizQA, and ScienceQA.
It demonstrates strong generalization ability in handling complex questions, generating code from visual input, and solving mathematical problems.
The model outperforms other efficient vision-language models like MobileVLM on multiple benchmarks. |
The current architecture of LLaVA-Phi is limited to English instructions due to the codegen-mono tokenizer used by Phi-2.
Future work will focus on exploring the impact of visual encoder size, refining training strategies (e.g., direct preference optimization, RLHF), and further reducing model size while maintaining or improving performance. |
multi-modal learning, vision-language models, small language models, efficient ai, real-time applications |
2401.02317
Report |
BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model |
Yiran Song, Qianyu Zhou, Xiangtai Li, Deng-Ping Fan, Xuequan Lu, Lizhuang Ma |
In this paper, we address the challenge of image resolution variation for the
Segment Anything Model (SAM). SAM, known for its zero-shot generalizability,
exhibits a performance degradation when faced with datasets with varying image
sizes. Previous approaches tend to resize the image to a fixed size or adopt
structure modifications, hindering the preservation of SAM's rich prior
knowledge. Besides, such task-specific tuning necessitates a complete
retraining of the model, which is cost-expensive and unacceptable for
deployment in the downstream tasks. In this paper, we reformulate this issue as
a length extrapolation problem, where token sequence length varies while
maintaining a consistent patch size for images of different sizes. To this end,
we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's
adaptability to varying image resolutions while eliminating the need for
structure modifications. Firstly, we introduce a new scaling factor to ensure
consistent magnitude in the attention layer's dot product values when the token
sequence length changes. Secondly, we present a bias-mode attention mask that
allows each token to prioritize neighboring information, mitigating the impact
of untrained distant information. Our BA-SAM demonstrates efficacy in two
scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets,
including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to
significantly mitigate performance degradation in the zero-shot setting and
achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we
propose a generalized model and benchmark, showcasing BA-SAM's generalizability
across all four datasets simultaneously. Code is available at
https://github.com/zongzi13545329/BA-SAM |
This paper proposes BA-SAM, a Scalable Bias-Mode Attention Mask, to enhance the Segment Anything Model's (SAM) adaptability to varying image resolutions without structural modifications. |
SAM, despite its zero-shot generalizability, suffers performance degradation with datasets of varying image sizes, limiting its application in downstream tasks. |
The paper introduces: 1) a new scaling factor to maintain consistent magnitude in the attention layer's dot product values across varying token sequence lengths and 2) a bias-mode attention mask to prioritize neighboring information for each token, mitigating the impact of untrained distant information. |
BA-SAM significantly mitigates performance degradation in zero-shot settings when inferring on higher resolutions.
BA-SAM achieves state-of-the-art accuracy on various segmentation tasks with minimal fine-tuning.
A proposed generalized BA-SAM model demonstrates strong generalizability across four datasets simultaneously. |
The paper primarily focuses on enhancing SAM's performance on resolution variations, with other factors potentially influencing its performance.
Future work could investigate extending BA-SAM to other vision transformer architectures beyond SAM. |
segment anything model, resolution variation, attention mechanism, zero-shot learning, computer vision |
2401.02281
Report |
PEGASUS: Physically Enhanced Gaussian Splatting Simulation System for 6DOF Object Pose Dataset Generation |
Lukas Meyer, Floris Erich, Yusuke Yoshiyasu, Marc Stamminger, Noriaki Ando, Yukiyasu Domae |
We introduce Physically Enhanced Gaussian Splatting Simulation System
(PEGASUS) for 6DOF object pose dataset generation, a versatile dataset
generator based on 3D Gaussian Splatting. Environment and object
representations can be easily obtained using commodity cameras to reconstruct
with Gaussian Splatting. PEGASUS allows the composition of new scenes by
merging the respective underlying Gaussian Splatting point cloud of an
environment with one or multiple objects. Leveraging a physics engine enables
the simulation of natural object placement within a scene through interaction
between meshes extracted for the objects and the environment. Consequently, an
extensive amount of new scenes - static or dynamic - can be created by
combining different environments and objects. By rendering scenes from various
perspectives, diverse data points such as RGB images, depth maps, semantic
masks, and 6DoF object poses can be extracted. Our study demonstrates that
training on data generated by PEGASUS enables pose estimation networks to
successfully transfer from synthetic data to real-world data. Moreover, we
introduce the Ramen dataset, comprising 30 Japanese cup noodle items. This
dataset includes spherical scans that captures images from both object
hemisphere and the Gaussian Splatting reconstruction, making them compatible
with PEGASUS. |
Introduces PEGASUS, a dataset generation tool for 6DoF object pose estimation that uses 3D Gaussian Splatting and physics simulations to create photorealistic scenes with accurate object poses. |
Addresses the need for domain-specific datasets for robotics in service sectors, particularly for tasks like object pose estimation in convenience stores, where existing datasets are limited. |
Combines 3D Gaussian Splatting reconstructions of environments and objects. Uses a physics engine (PyBullet) to realistically place objects within the environment. Renders novel views of the scene, extracting RGB images, depth maps, segmentation masks, and object poses. |
Training on PEGASUS-generated data enables pose estimation networks (specifically DOPE) to successfully transfer from synthetic to real-world data.
Introduces the 'Ramen' dataset, containing over 30 Japanese cup noodle products with spherical scans and 3D Gaussian Splatting reconstructions.
Demonstrates the effectiveness of PEGASUS by successfully training DOPE for a real-world grasping task using a UR5 robot. |
Lacks realistic shadow rendering; incorporating shadow maps or screen space ambient occlusion is planned.
Scanning texture-less environments can lead to noisy Gaussian Splatting reconstructions, causing visual artifacts. |
dataset generation, robotics, radiance fields, sim2real, gaussian splatting |
2401.02142
Report |
GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation |
Xuehao Gao, Yang Yang, Zhenyu Xie, Shaoyi Du, Zhongqian Sun, Yang Wu |
In this paper, we propose a novel cascaded diffusion-based generative
framework for text-driven human motion synthesis, which exploits a strategy
named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy
sets up generation objectives by grouping body joints of detailed skeletons in
close semantic proximity together and then replacing each of such joint group
with a single body-part node. Such an operation recursively abstracts a human
pose to coarser and coarser skeletons at multiple granularity levels. With
gradually increasing the abstraction level, human motion becomes more and more
concise and stable, significantly benefiting the cross-modal motion synthesis
task. The whole text-driven human motion synthesis problem is then divided into
multiple abstraction levels and solved with a multi-stage generation framework
with a cascaded latent diffusion model: an initial generator first generates
the coarsest human motion guess from a given text description; then, a series
of successive generators gradually enrich the motion details based on the
textual description and the previous synthesized results. Notably, we further
integrate GUESS with the proposed dynamic multi-condition fusion mechanism to
dynamically balance the cooperative effects of the given textual condition and
synthesized coarse motion prompt in different generation stages. Extensive
experiments on large-scale datasets verify that GUESS outperforms existing
state-of-the-art methods by large margins in terms of accuracy, realisticness,
and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS. |
This paper presents GUESS (Gradually Enriching Synthesis), a novel cascaded diffusion-based framework for text-driven human motion generation. |
Existing methods struggle with the large discrepancy between text and motion modalities. GUESS addresses this by mimicking the human brain's coarse-to-fine imagination process, progressively generating motion from abstract body part levels to detailed skeletons. |
GUESS uses multi-scale skeletal representation to abstract human poses. It employs a variational autoencoder for motion embedding and a cascaded latent diffusion model for generating motion, guided by text descriptions and coarser motion guesses. It also introduces a dynamic multi-condition fusion mechanism to adaptively balance text and motion cues during generation. |
GUESS significantly outperforms state-of-the-art methods on HumanML3D and KIT-ML datasets in terms of fidelity, text-motion consistency, and diversity.
The proposed multi-scale and cascaded generation significantly reduces body-joint jittering and improves motion trajectory adherence to text descriptions.
Dynamic multi-condition fusion effectively balances text and motion cues, leading to better generation quality. |
The current multi-stage scheme uses a fixed number of inference stages, which could be made adaptive to different text inputs.
Motion guess can be extended from spatial to temporal dimensions for generating sequences with increasing temporal resolution. |
human motion synthesis, text-driven generation, cascaded diffusion model, multi-scale representation, dynamic multi-condition fusion |
2401.02126
Report |
Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance |
Jiacheng Wang, Ping Liu, Wei Xu |
Existing text-to-image editing methods tend to excel either in rigid or
non-rigid editing but encounter challenges when combining both, resulting in
misaligned outputs with the provided text prompts. In addition, integrating
reference images for control remains challenging. To address these issues, we
present a versatile image editing framework capable of executing both rigid and
non-rigid edits, guided by either textual prompts or reference images. We
leverage a dual-path injection scheme to handle diverse editing scenarios and
introduce an integrated self-attention mechanism for fusion of appearance and
structural information. To mitigate potential visual artifacts, we further
employ latent fusion techniques to adjust intermediate latents. Compared to
previous work, our approach represents a significant advance in achieving
precise and versatile image editing. Comprehensive experiments validate the
efficacy of our method, showcasing competitive or superior results in
text-based editing and appearance transfer tasks, encompassing both rigid and
non-rigid settings. |
This paper proposes a versatile image editing framework capable of handling both rigid (e.g., color change) and non-rigid (e.g., shape change) edits, guided by either text prompts or reference images. |
Existing text-to-image editing methods struggle to effectively perform both rigid and non-rigid edits simultaneously, often resulting in misaligned outputs or limited control. This new framework addresses these limitations and enables more versatile and precise image editing. |
The framework leverages a dual-path injection scheme to handle different editing scenarios and introduces a unified self-attention mechanism to fuse appearance and structural information. Additionally, latent fusion techniques are employed to refine intermediate representations and mitigate visual artifacts. |
The method achieves competitive or superior results compared to existing text-based editing methods, demonstrating improved alignment with target prompts and better handling of both rigid and non-rigid edits.
It outperforms state-of-the-art appearance transfer methods, exhibiting superior preservation of both structural and appearance details.
Ablation studies confirm the effectiveness of the proposed dual-path injection scheme, unified self-attention mechanism, and latent fusion techniques. |
The method relies on pre-trained Stable Diffusion models, which might limit its generalizability to unseen domains or concepts.
Further exploration is needed to improve fine-grained control over the degree of rigid and non-rigid transformations. |
image editing, text-guided image manipulation, appearance transfer, diffusion models, self-attention |
2401.02032
Report |
DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection |
Yunfan Ye, Kai Xu, Yuhang Huang, Renjiao Yi, Zhiping Cai |
Limited by the encoder-decoder architecture, learning-based edge detectors
usually have difficulty predicting edge maps that satisfy both correctness and
crispness. With the recent success of the diffusion probabilistic model (DPM),
we found it is especially suitable for accurate and crisp edge detection since
the denoising process is directly applied to the original image size.
Therefore, we propose the first diffusion model for the task of general edge
detection, which we call DiffusionEdge. To avoid expensive computational
resources while retaining the final performance, we apply DPM in the latent
space and enable the classic cross-entropy loss which is uncertainty-aware in
pixel level to directly optimize the parameters in latent space in a
distillation manner. We also adopt a decoupled architecture to speed up the
denoising process and propose a corresponding adaptive Fourier filter to adjust
the latent features of specific frequencies. With all the technical designs,
DiffusionEdge can be stably trained with limited resources, predicting crisp
and accurate edge maps with much fewer augmentation strategies. Extensive
experiments on four edge detection benchmarks demonstrate the superiority of
DiffusionEdge both in correctness and crispness. On the NYUDv2 dataset,
compared to the second best, we increase the ODS, OIS (without post-processing)
and AC by 30.2%, 28.1% and 65.1%, respectively. Code:
https://github.com/GuHuangAI/DiffusionEdge. |
This paper proposes DiffusionEdge, the first diffusion model for edge detection, which generates accurate and crisp edge maps without post-processing. |
Existing learning-based edge detectors struggle to achieve both correctness and crispness simultaneously, often relying on post-processing steps. |
DiffusionEdge utilizes a decoupled diffusion architecture in latent space, employing an adaptive Fourier filter for frequency parsing and uncertainty distillation to maintain pixel-level uncertainty information from annotations. |
DiffusionEdge achieves state-of-the-art performance on BSDS, NYUDv2, Multicue, and BIPED datasets.
The model significantly outperforms other methods in crispness, as demonstrated by the Average Crispness (AC) metric.
DiffusionEdge generates high-quality edge maps with minimal noise, even in challenging scenarios with complex backgrounds and textures. |
The inference speed of DiffusionEdge can be further improved.
Exploring the application of DiffusionEdge to downstream tasks in an end-to-end manner is a promising direction. |
edge detection, diffusion model, computer vision, deep learning, image processing |
2401.02015
Report |
Improving Diffusion-Based Image Synthesis with Context Prediction |
Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, Bin Cui |
Diffusion models are a new class of generative models, and have dramatically
promoted image generation with unprecedented quality and diversity. Existing
diffusion models mainly try to reconstruct input image from a corrupted one
with a pixel-wise or feature-wise constraint along spatial axes. However, such
point-based reconstruction may fail to make each predicted pixel/feature fully
preserve its neighborhood context, impairing diffusion-based image synthesis.
As a powerful source of automatic supervisory signal, context has been well
studied for learning representations. Inspired by this, we for the first time
propose ConPreDiff to improve diffusion-based image synthesis with context
prediction. We explicitly reinforce each point to predict its neighborhood
context (i.e., multi-stride features/tokens/pixels) with a context decoder at
the end of diffusion denoising blocks in training stage, and remove the decoder
for inference. In this way, each point can better reconstruct itself by
preserving its semantic connections with neighborhood context. This new
paradigm of ConPreDiff can generalize to arbitrary discrete and continuous
diffusion backbones without introducing extra parameters in sampling procedure.
Extensive experiments are conducted on unconditional image generation,
text-to-image generation and image inpainting tasks. Our ConPreDiff
consistently outperforms previous methods and achieves a new SOTA text-to-image
generation results on MS-COCO, with a zero-shot FID score of 6.21. |
This paper proposes ConPreDiff, a novel method that enhances diffusion-based image synthesis by explicitly predicting neighborhood context. |
Existing diffusion models primarily focus on point-wise reconstruction, neglecting the preservation of local context, which is crucial for generating high-fidelity images. |
ConPreDiff introduces a context decoder to predict neighborhood distributions during training. This decoder is removed during inference, ensuring efficiency. An optimal transport loss based on the Wasserstein distance is employed to optimize the context prediction. |
ConPreDiff achieves state-of-the-art results on text-to-image generation benchmarks, surpassing previous diffusion and non-diffusion models.
The method significantly improves image inpainting performance across various mask distributions.
ConPreDiff consistently enhances unconditional image synthesis, demonstrating superior perceptual quality and data distribution coverage. |
Despite not adding inference parameters, ConPreDiff models have more trainable parameters than GANs.
ConPreDiff inherits the long sampling times of diffusion models compared to single-step generative approaches. |
diffusion models, image generation, context prediction, wasserstein distance, text-to-image synthesis |
2401.01970
Report |
FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding |
Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di, Mingyang Li |
Precisely perceiving the geometric and semantic properties of real-world 3D
objects is crucial for the continued evolution of augmented reality and robotic
applications. To this end, we present Foundation Model Embedded Gaussian
Splatting (FMGS), which incorporates vision-language embeddings of foundation
models into 3D Gaussian Splatting (GS). The key contribution of this work is an
efficient method to reconstruct and represent 3D vision-language models. This
is achieved by distilling feature maps generated from image-based foundation
models into those rendered from our 3D model. To ensure high-quality rendering
and fast training, we introduce a novel scene representation by integrating
strengths from both GS and multi-resolution hash encodings (MHE). Our effective
training procedure also introduces a pixel alignment loss that makes the
rendered feature distance of the same semantic entities close, following the
pixel-level semantic boundaries. Our results demonstrate remarkable multi-view
semantic consistency, facilitating diverse downstream tasks, beating
state-of-the-art methods by 10.2 percent on open-vocabulary language-based
object detection, despite that we are 851X faster for inference. This research
explores the intersection of vision, language, and 3D scene representation,
paving the way for enhanced scene understanding in uncontrolled real-world
environments. We plan to release the code on the project page. |
\algname{} embeds vision-language information from foundation models, such as CLIP, into a 3D Gaussian Splatting (GS) based scene representation for holistic 3D scene understanding. |
Existing 3D scene understanding methods are limited to either geometric understanding or closed-set object detection. This work explores open-vocabulary 3D scene understanding by leveraging the success of vision-language foundation models. |
The method distills CLIP and DINO embeddings into a multi-resolution hash encoding (MHE) field built upon 3D Gaussians generated by GS. A novel training procedure using a hybrid CLIP feature map and a pixel alignment loss ensures multi-view consistency and spatial accuracy. |
Achieves state-of-the-art performance on open-vocabulary 3D object detection, surpassing previous methods by a significant margin.
Demonstrates strong performance on semantic segmentation tasks, highlighting the quality of the learned feature embedding.
Exhibits superior inference speed compared to NeRF-based methods, enabling real-time open-vocabulary queries. |
Reliance on high-quality calibrated input images, limiting applicability in uncontrolled settings.
Performance is limited by the quality of the base foundation models used for training. |
3d gaussian splatting, vision-language embeddings, foundation models, open-vocabulary scene understanding, semantic segmentation |
2401.01952
Report |
Instruct-Imagen: Image Generation with Multi-modal Instruction |
Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia |
This paper presents instruct-imagen, a model that tackles heterogeneous image
generation tasks and generalizes across unseen tasks. We introduce *multi-modal
instruction* for image generation, a task representation articulating a range
of generation intents with precision. It uses natural language to amalgamate
disparate modalities (e.g., text, edge, style, subject, etc.), such that
abundant generation intents can be standardized in a uniform format.
We then build instruct-imagen by fine-tuning a pre-trained text-to-image
diffusion model with a two-stage framework. First, we adapt the model using the
retrieval-augmented training, to enhance model's capabilities to ground its
generation on external multimodal context. Subsequently, we fine-tune the
adapted model on diverse image generation tasks that requires vision-language
understanding (e.g., subject-driven generation, etc.), each paired with a
multi-modal instruction encapsulating the task's essence. Human evaluation on
various image generation datasets reveals that instruct-imagen matches or
surpasses prior task-specific models in-domain and demonstrates promising
generalization to unseen and more complex tasks. |
Introduces Instruct-Imagen, an image generation model that leverages multi-modal instructions to perform various visual generation tasks, generalizing to unseen and complex tasks. |
Addresses the limitations of existing image generation models that often specialize in specific modalities and struggle with complex, multi-modal instructions. |
Employs a two-stage training approach: 1) Retrieval-augmented training to enhance multi-modal context processing. 2) Multi-modal instruction-tuning on diverse image generation tasks paired with multi-modal instructions. |
Achieves comparable or superior performance to task-specific models in in-domain evaluation.
Demonstrates strong generalization ability in zero-shot settings, effectively handling unseen and complex multi-modal instructions.
Outperforms baselines in instruction following and output quality, highlighting the importance of multi-modal instruction tuning. |
Limited ability to handle image editing tasks in a zero-shot manner due to challenges in pixel-level consistency.
Reliance on a cascaded diffusion model hinders access to high-resolution input details, leading to artifacts in generated images. |
image generation, multi-modal learning, instruction tuning, zero-shot learning, diffusion models |
2401.01862
Report |
A Vision Check-up for Language Models |
Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba |
What does learning to model relationships between strings teach large
language models (LLMs) about the visual world? We systematically evaluate LLMs'
abilities to generate and recognize an assortment of visual concepts of
increasing complexity and then demonstrate how a preliminary visual
representation learning system can be trained using models of text. As language
models lack the ability to consume or output visual information as pixels, we
use code to represent images in our study. Although LLM-generated images do not
look like natural images, results on image generation and the ability of models
to correct these generated images indicate that precise modeling of strings can
teach language models about numerous aspects of the visual world. Furthermore,
experiments on self-supervised visual representation learning, utilizing images
generated with text models, highlight the potential to train vision models
capable of making semantic assessments of natural images using just LLMs. |
This paper investigates the visual knowledge acquired by Large Language Models (LLMs) through learning relationships between strings, particularly in generating and recognizing visual concepts using code as a proxy for images. |
The work is important because it explores the potential of leveraging LLMs, trained solely on text data, to understand and represent the visual world, potentially opening new avenues for vision-related tasks. |
The authors introduce a hierarchical dataset of visual concepts and evaluate LLMs on three tasks: (1) Generating code that renders visual concepts. (2) Recognizing visual concepts from code. (3) Improving generated code through text-based self-feedback. Additionally, they investigate if LLM-generated images can be used for training a vision system for natural images. |
LLMs can generate code representing complex visual scenes, but struggle with details like texture and object interactions.
LLMs struggle to recognize human-drawn images represented as code, indicating limitations in spatial reasoning and generalization beyond memorized prototypes.
Images generated by LLMs can be used to train a vision system for natural images, achieving state-of-the-art performance when combined with datasets that offer textural diversity. |
The study relies on code as an intermediary representation for images, which may not fully encapsulate the richness of the visual world.
Future work can explore using larger and more diverse code datasets, as well as more complex feedback mechanisms to further improve LLMs' visual understanding. |
large language models, visual knowledge, image generation, code representation, self-supervised learning |
2401.01827
Report |
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions |
David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo |
Most existing video diffusion models (VDMs) are limited to mere text
conditions. Thereby, they are usually lacking in control over visual appearance
and geometry structure of the generated videos. This work presents Moonshot, a
new video generation model that conditions simultaneously on multimodal inputs
of image and text. The model builts upon a core module, called multimodal video
block (MVB), which consists of conventional spatialtemporal layers for
representing video features, and a decoupled cross-attention layer to address
image and text inputs for appearance conditioning. In addition, we carefully
design the model architecture such that it can optionally integrate with
pre-trained image ControlNet modules for geometry visual conditions, without
needing of extra training overhead as opposed to prior methods. Experiments
show that with versatile multimodal conditioning mechanisms, Moonshot
demonstrates significant improvement on visual quality and temporal consistency
compared to existing models. In addition, the model can be easily repurposed
for a variety of generative applications, such as personalized video
generation, image animation and video editing, unveiling its potential to serve
as a fundamental architecture for controllable video generation. Models will be
made public on https://github.com/salesforce/LAVIS. |
This paper introduces LAVIN, a novel video generation model that leverages both image and text inputs for enhanced control over video generation. |
Existing video diffusion models (VDMs) often lack control over visual appearance and geometric structure, relying primarily on text prompts which are insufficient for detailed visual descriptions. |
The paper proposes a multimodal video block (MVB) incorporating decoupled cross-attention layers to simultaneously process image and text conditions. This design allows integration with pre-trained image ControlNet modules for geometric control without extra training. |
LAVIN demonstrates superior performance in subject-customized video generation, outperforming text-only models and achieving strong zero-shot customization.
The model excels in image animation, exhibiting better identity preservation, temporal consistency, and text alignment compared to existing methods.
LAVIN shows promising results in video editing, effectively replacing subjects and incorporating text-guided elements while maintaining high temporal consistency. |
The authors acknowledge the potential for generating harmful content and plan to implement safety measures like NSFW detectors before release.
Future work includes exploring additional applications and refining the model for improved performance on complex video generation tasks. |
video generation, diffusion models, multimodal conditioning, controlnet, image animation |
2401.01808
Report |
aMUSEd: An Open MUSE Reproduction |
Suraj Patil, William Berman, Robin Rombach, Patrick von Platen |
We present aMUSEd, an open-source, lightweight masked image model (MIM) for
text-to-image generation based on MUSE. With 10 percent of MUSE's parameters,
aMUSEd is focused on fast image generation. We believe MIM is under-explored
compared to latent diffusion, the prevailing approach for text-to-image
generation. Compared to latent diffusion, MIM requires fewer inference steps
and is more interpretable. Additionally, MIM can be fine-tuned to learn
additional styles with only a single image. We hope to encourage further
exploration of MIM by demonstrating its effectiveness on large-scale
text-to-image generation and releasing reproducible training code. We also
release checkpoints for two models which directly produce images at 256x256 and
512x512 resolutions. |
This paper presents aMUSEd, a lightweight, open-source masked image model (MIM) for text-to-image generation based on MUSE, focused on fast image generation. |
The authors argue that MIM is underexplored compared to latent diffusion, despite advantages like fewer inference steps, interpretability, and single-image style transfer capability. |
The paper introduces aMUSEd, an 800M parameter model utilizing a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone, trained on the LAION-2B dataset. |
aMUSEd achieves superior inference speed compared to non-distilled diffusion models and is competitive with distilled few-step models.
It demonstrates competitive CLIP scores but lags in FID and Inception scores compared to some state-of-the-art models.
aMUSEd shows impressive results in zero-shot image variation, in-painting, single-image style transfer with StyleDrop, and is extended for video generation. |
aMUSEd's FID and Inception scores are lower than some state-of-the-art models, indicating room for improvement in image quality.
The exploration of interpretability for token prediction-based image models is suggested as a future research direction. |
text-to-image generation, masked image modeling, open-source, fast inference, styledrop |
2401.01730
Report |
STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion |
Wei Yao, Hongwen Zhang, Yunlian Sun, Jinhui Tang |
The recovery of 3D human mesh from monocular images has significantly been
developed in recent years. However, existing models usually ignore spatial and
temporal information, which might lead to mesh and image misalignment and
temporal discontinuity. For this reason, we propose a novel Spatio-Temporal
Alignment Fusion (STAF) model. As a video-based model, it leverages coherence
clues from human motion by an attention-based Temporal Coherence Fusion Module
(TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local
information through predicted mesh projection on the feature maps. Based on the
spatial features, we further introduce a multi-stage adjacent Spatial Alignment
Fusion Module (SAFM) to enhance the feature representation of the target frame.
In addition to the above, we propose an Average Pooling Module (APM) to allow
the model to focus on the entire input sequence rather than just the target
frame. This method can remarkably improve the smoothness of recovery results
from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the
superiority of STAF. We achieve a state-of-the-art trade-off between precision
and smoothness. Our code and more video results are on the project page
https://yw0208.github.io/staf/ |
This paper introduces STAF, a novel Spatio-Temporal Alignment Fusion model for 3D human mesh recovery from videos, improving both accuracy and smoothness of the reconstruction. |
Existing methods for 3D human mesh recovery from videos often prioritize either accuracy or temporal smoothness, leading to issues like mesh misalignment or jitter. STAF addresses this limitation by effectively leveraging spatial and temporal information. |
STAF employs a multi-stage approach: 1) It uses a feature pyramid and an Average Pooling Module (APM) to capture global context and reduce dependence on individual frames. 2) A Temporal Coherence Fusion Module (TCFM) learns temporal dependencies from features extracted using grid sampling. 3) A Spatial Alignment Fusion Module (SAFM) refines the target frame's features by integrating information from adjacent frames using an attention mechanism based on initial mesh projections. |
STAF achieves state-of-the-art accuracy on 3DPW and MPII3D benchmarks, surpassing previous methods in key metrics like MPJPE and PVE.
The model exhibits high smoothness, as indicated by low acceleration error, exceeding most video-based methods.
STAF demonstrates strong generalization ability, achieving good performance even without in-domain training data on 3DPW. |
The over-smoothing issue may arise in extreme cases where human pose changes abruptly. Using a shorter sequence length mitigates this to an extent.
Future work could explore alternative methods to handle rapid pose transitions without sacrificing overall smoothness. |
3d human mesh recovery, video analysis, temporal coherence, spatial alignment, deep learning |
2401.01702
Report |
Image Sculpting: Precise Object Editing with 3D Geometry Control |
Jiraphon Yenphraphai, Xichen Pan, Sainan Liu, Daniele Panozzo, Saining Xie |
We present Image Sculpting, a new framework for editing 2D images by
incorporating tools from 3D geometry and graphics. This approach differs
markedly from existing methods, which are confined to 2D spaces and typically
rely on textual instructions, leading to ambiguity and limited control. Image
Sculpting converts 2D objects into 3D, enabling direct interaction with their
3D geometry. Post-editing, these objects are re-rendered into 2D, merging into
the original image to produce high-fidelity results through a coarse-to-fine
enhancement process. The framework supports precise, quantifiable, and
physically-plausible editing options such as pose editing, rotation,
translation, 3D composition, carving, and serial addition. It marks an initial
step towards combining the creative freedom of generative models with the
precision of graphics pipelines. |
This document provides guidelines for formatting author responses to paper reviews, limited to one page and intended to address factual errors or provide requested clarifications. |
Standardizes author response format, ensuring reviewers can efficiently assess rebuttals within a concise format. |
Details formatting requirements including length, column layout, font size, figure/table formatting, and citation style. |
Author responses limited to one page.
Content restricted to addressing errors or providing requested information, not adding new contributions.
Strict formatting guidelines ensure readability and consistency. |
One-page limit might be restrictive for complex rebuttals.
Guidelines don't address handling disagreements on subjective matters. |
author response, formatting guidelines, review process, academic publishing, latex template |
2401.01651
Report |
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI |
Fanda Fan, Chunjie Luo, Wanling Gao, Jianfeng Zhan |
The burgeoning field of Artificial Intelligence Generated Content (AIGC) is
witnessing rapid advancements, particularly in video generation. This paper
introduces AIGCBench, a pioneering comprehensive and scalable benchmark
designed to evaluate a variety of video generation tasks, with a primary focus
on Image-to-Video (I2V) generation. AIGCBench tackles the limitations of
existing benchmarks, which suffer from a lack of diverse datasets, by including
a varied and open-domain image-text dataset that evaluates different
state-of-the-art algorithms under equivalent conditions. We employ a novel text
combiner and GPT-4 to create rich text prompts, which are then used to generate
images via advanced Text-to-Image models. To establish a unified evaluation
framework for video generation tasks, our benchmark includes 11 metrics
spanning four dimensions to assess algorithm performance. These dimensions are
control-video alignment, motion effects, temporal consistency, and video
quality. These metrics are both reference video-dependent and video-free,
ensuring a comprehensive evaluation strategy. The evaluation standard proposed
correlates well with human judgment, providing insights into the strengths and
weaknesses of current I2V algorithms. The findings from our extensive
experiments aim to stimulate further research and development in the I2V field.
AIGCBench represents a significant step toward creating standardized benchmarks
for the broader AIGC landscape, proposing an adaptable and equitable framework
for future assessments of video generation tasks. We have open-sourced the
dataset and evaluation code on the project website:
https://www.benchcouncil.org/AIGCBench. |
This paper introduces AIGCBench, a comprehensive and scalable benchmark for evaluating Image-to-Video (I2V) generation tasks. |
Existing I2V benchmarks lack diverse, open-domain datasets and standardized evaluation metrics, hindering fair and comprehensive algorithm assessment. |
AIGCBench uses real-world and generated image-text datasets and 11 metrics across four dimensions: control-video alignment, motion effects, temporal consistency, and video quality. |
Closed-source projects (Pika, Gen2) outperform open-source ones (VideoCrafter, I2VGen-XL, SVD) in generating long, high-quality videos.
Current I2V algorithms lack fine-grained control over generated content, limiting precise alignment with textual descriptions.
AIGCBench's evaluation standard correlates well with human judgment, validating its effectiveness in assessing I2V algorithms. |
Limited test cases (3950) due to slow inference speeds and closed-source projects.
Inability to automatically evaluate fine-grained object motion alignment with text descriptions. |
artificial intelligence generated content, video generation, image-to-video benchmark, diffusion model, multimodal ai |
2401.01647
Report |
SIGNeRF: Scene Integrated Generation for Neural Radiance Fields |
Jan-Niklas Dihlmann, Andreas Engelhardt, Hendrik Lensch |
Advances in image diffusion models have recently led to notable improvements
in the generation of high-quality images. In combination with Neural Radiance
Fields (NeRFs), they enabled new opportunities in 3D generation. However, most
generative 3D approaches are object-centric and applying them to editing
existing photorealistic scenes is not trivial. We propose SIGNeRF, a novel
approach for fast and controllable NeRF scene editing and scene-integrated
object generation. A new generative update strategy ensures 3D consistency
across the edited images, without requiring iterative optimization. We find
that depth-conditioned diffusion models inherently possess the capability to
generate 3D consistent views by requesting a grid of images instead of single
views. Based on these insights, we introduce a multi-view reference sheet of
modified images. Our method updates an image collection consistently based on
the reference sheet and refines the original NeRF with the newly generated
image set in one go. By exploiting the depth conditioning mechanism of the
image diffusion model, we gain fine control over the spatial location of the
edit and enforce shape guidance by a selected region or an external mesh. |
SIGNeRF: a novel approach for fast and controllable NeRF scene editing and scene-integrated object generation using a reference-sheet-based assembly and a generative update strategy. |
Simplifies and enhances control over generative NeRF editing, enabling more complex and realistic modifications compared to previous methods. |
Utilizes ControlNet, a depth-conditioned image diffusion model, to generate a multi-view consistent reference sheet of edits. The reference sheet then guides the efficient update of the original NeRF dataset, resulting in a modified 3D scene. |
Achieves superior object generation and editing within complex NeRF scenes with consistent lighting and textures.
Offers precise control over object placement, orientation, size, and appearance using shape selection or proxy mesh guidance.
Provides a preview of the edited scene with the reference sheet before generating the complete dataset, unlike existing methods. |
Image downscaling for reference sheet generation can lead to loss of detail in the edits.
Extended scene modifications are limited due to the focus on a central object in the reference sheet. |
nerf, scene editing, 3d generation, image diffusion, controlnet |
2401.01520
Report |
S$^{2}$-DMs:Skip-Step Diffusion Models |
Yixuan Wang, Shuangyin Li |
Diffusion models have emerged as powerful generative tools, rivaling GANs in
sample quality and mirroring the likelihood scores of autoregressive models. A
subset of these models, exemplified by DDIMs, exhibit an inherent asymmetry:
they are trained over $T$ steps but only sample from a subset of $T$ during
generation. This selective sampling approach, though optimized for speed,
inadvertently misses out on vital information from the unsampled steps, leading
to potential compromises in sample quality. To address this issue, we present
the S$^{2}$-DMs, which is a new training method by using an innovative
$L_{skip}$, meticulously designed to reintegrate the information omitted during
the selective sampling phase. The benefits of this approach are manifold: it
notably enhances sample quality, is exceptionally simple to implement, requires
minimal code modifications, and is flexible enough to be compatible with
various sampling algorithms. On the CIFAR10 dataset, models trained using our
algorithm showed an improvement of 3.27% to 14.06% over models trained with
traditional methods across various sampling algorithms (DDIMs, PNDMs, DEIS) and
different numbers of sampling steps (10, 20, ..., 1000). On the CELEBA dataset,
the improvement ranged from 8.97% to 27.08%. Access to the code and additional
resources is provided in the github. |
This paper introduces Skip-Step Diffusion Models (S$^2$-DMs), a novel method to enhance the performance of diffusion models, particularly those employing accelerated sampling techniques like DDIMs. |
Diffusion models often suffer from slow sampling speed. While methods like DDIMs accelerate this by skipping steps, they introduce a discrepancy between the step-by-step training and skip-step sampling, compromising sample quality. S$^2$-DMs addresses this asymmetry. |
The core of S$^2$-DMs is the introduction of a novel 'skip-step loss' ($L_{skip}$) during training. This loss function encourages the model to learn from the information typically missed during skip-step sampling, thereby improving consistency. |
S$^2$-DMs consistently outperforms baseline models like DDIMs, PNDMs, and DEIS in image generation tasks on CIFAR10 and CelebA datasets, achieving better FID scores with the same number of sampling steps.
The integration of skip-step information leads to higher-quality samples, as demonstrated by visual comparisons. S$^2$-DMs generates sharper images with finer details compared to baselines.
The method is highly efficient and easy to implement. It requires minimal modifications to the training process and doesn't alter the sampling algorithm, making it user-friendly. |
The paper primarily focuses on image generation, and further exploration is needed to evaluate its applicability in other domains.
Future work will investigate the optimal integration of skip-step information into ODEs and explore its potential in non-continuous spaces. |
diffusion models, generative models, image generation, accelerated sampling, skip-step sampling |
2401.01339
Report |
Street Gaussians for Modeling Dynamic Urban Scenes |
Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, Sida Peng |
This paper aims to tackle the problem of modeling dynamic urban street scenes
from monocular videos. Recent methods extend NeRF by incorporating tracked
vehicle poses to animate vehicles, enabling photo-realistic view synthesis of
dynamic urban street scenes. However, significant limitations are their slow
training and rendering speed, coupled with the critical need for high precision
in tracked vehicle poses. We introduce Street Gaussians, a new explicit scene
representation that tackles all these limitations. Specifically, the dynamic
urban street is represented as a set of point clouds equipped with semantic
logits and 3D Gaussians, each associated with either a foreground vehicle or
the background. To model the dynamics of foreground object vehicles, each
object point cloud is optimized with optimizable tracked poses, along with a
dynamic spherical harmonics model for the dynamic appearance. The explicit
representation allows easy composition of object vehicles and background, which
in turn allows for scene editing operations and rendering at 133 FPS
(1066$\times$1600 resolution) within half an hour of training. The proposed
method is evaluated on multiple challenging benchmarks, including KITTI and
Waymo Open datasets. Experiments show that the proposed method consistently
outperforms state-of-the-art methods across all datasets. Furthermore, the
proposed representation delivers performance on par with that achieved using
precise ground-truth poses, despite relying only on poses from an off-the-shelf
tracker. The code is available at https://zju3dv.github.io/street_gaussians/. |
This paper presents Street-Gaussians, a novel explicit scene representation for efficiently reconstructing dynamic 3D street scenes from monocular videos and rendering high-fidelity novel views in real-time. |
Modeling dynamic 3D streets from images has many important applications, such as city simulation, autonomous driving, and gaming. Existing methods suffer from slow training and rendering speeds and rely heavily on accurate tracked vehicle poses. |
Street-Gaussians represents the dynamic urban street as a set of point clouds equipped with semantic logits and 3D Gaussians, each associated with either a foreground vehicle or the background. To model the dynamics, each object point cloud is optimized with optimizable tracked poses, along with a dynamic spherical harmonics model for the dynamic appearance. |
Street-Gaussians consistently outperforms state-of-the-art methods in terms of rendering quality on KITTI and Waymo Open datasets.
The method achieves real-time rendering speed of 133 FPS at a resolution of 1066x1600.
Street-Gaussians delivers performance on par with methods using precise ground-truth poses, despite relying only on poses from an off-the-shelf tracker. |
The method is limited to reconstructing rigid dynamic scenes and cannot handle non-rigid dynamic objects like pedestrians.
The performance is dependent on the recall rate of off-the-shelf trackers. |
3d scene reconstruction, dynamic scene modeling, neural rendering, autonomous driving, point cloud representation |
2401.01256
Report |
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM |
Fuchen Long, Zhaofan Qiu, Ting Yao, Tao Mei |
The recent innovations and breakthroughs in diffusion models have
significantly expanded the possibilities of generating high-quality videos for
the given prompts. Most existing works tackle the single-scene scenario with
only one video event occurring in a single background. Extending to generate
multi-scene videos nevertheless is not trivial and necessitates to nicely
manage the logic in between while preserving the consistent visual appearance
of key content across video scenes. In this paper, we propose a novel
framework, namely VideoDrafter, for content-consistent multi-scene video
generation. Technically, VideoDrafter leverages Large Language Models (LLM) to
convert the input prompt into comprehensive multi-scene script that benefits
from the logical knowledge learnt by LLM. The script for each scene includes a
prompt describing the event, the foreground/background entities, as well as
camera movement. VideoDrafter identifies the common entities throughout the
script and asks LLM to detail each entity. The resultant entity description is
then fed into a text-to-image model to generate a reference image for each
entity. Finally, VideoDrafter outputs a multi-scene video by generating each
scene video via a diffusion process that takes the reference images, the
descriptive prompt of the event and camera movement into account. The diffusion
model incorporates the reference images as the condition and alignment to
strengthen the content consistency of multi-scene videos. Extensive experiments
demonstrate that VideoDrafter outperforms the SOTA video generation models in
terms of visual quality, content consistency, and user preference. |
Proposes VideoDrafter, a framework for generating content-consistent multi-scene videos from text prompts. |
Most existing video generation methods focus on single-scene videos, leaving multi-scene generation with consistent content largely unexplored. |
Utilizes a Large Language Model (LLM) to convert prompts into multi-scene scripts and generate descriptions for common entities. Employs a text-to-image model to generate reference images for these entities, ensuring consistency across scenes. Introduces two diffusion models: VideoDrafter-Img for generating scene-reference images based on prompts and entity references, and VideoDrafter-Vid for producing video clips based on scene-reference images, action descriptions, and camera movements. |
VideoDrafter outperforms state-of-the-art video generation models in terms of visual quality (FID, FVD) and content consistency (Scene Consis.).
The use of entity reference images significantly enhances the consistency of entities across scenes.
Human evaluations confirm VideoDrafter's superiority in generating logically coherent and content-consistent multi-scene videos. |
The performance of open-source LLMs in script generation can be unstable, demanding careful prompt engineering and output verification.
The lack of optimization for Stable Diffusion on video frames might lead to suboptimal frame quality. |
video generation, diffusion models, multi-scene video, content consistency, large language models |
2401.01216
Report |
Noise-NeRF: Hide Information in Neural Radiance Fields using Trainable Noise |
Qinglong Huang, Yong Liao, Yanbin Hao, Pengyuan Zhou |
Neural radiance fields (NeRF) have been proposed as an innovative 3D
representation method. While attracting lots of attention, NeRF faces critical
issues such as information confidentiality and security. Steganography is a
technique used to embed information in another object as a means of protecting
information security. Currently, there are few related studies on NeRF
steganography, facing challenges in low steganography quality, model weight
damage, and a limited amount of steganographic information. This paper proposes
a novel NeRF steganography method based on trainable noise: Noise-NeRF.
Furthermore, we propose the Adaptive Pixel Selection strategy and Pixel
Perturbation strategy to improve the steganography quality and efficiency. The
extensive experiments on open-source datasets show that Noise-NeRF provides
state-of-the-art performances in both steganography quality and rendering
quality, as well as effectiveness in super-resolution image steganography. |
This paper proposes Noise-NeRF, a novel Neural Radiance Fields (NeRF) steganography method that embeds secret information using trainable noise without modifying the model weights, ensuring lossless steganography and preserving rendering quality. |
NeRF steganography, crucial for information confidentiality and model copyright protection, faces challenges in steganography quality, model weight damage, and limited information volume. Noise-NeRF addresses these limitations. |
Noise-NeRF introduces trainable noise to specific viewpoints, optimizing it iteratively using backpropagation to minimize the difference between the rendered steganographic image and the target. It employs Adaptive Pixel Selection and Pixel Perturbation strategies to enhance steganography quality and efficiency. |
Noise-NeRF achieves state-of-the-art steganography quality, achieving over 98% similarity on multiple benchmark datasets.
Unlike existing methods that modify model weights, Noise-NeRF maintains the original rendering quality of NeRF, ensuring lossless steganography.
Noise-NeRF demonstrates effectiveness in super-resolution image steganography, successfully embedding 2K resolution images into NeRF scenes with high fidelity. |
The current implementation of Noise-NeRF focuses on steganography for a single viewpoint; extending it to multiple viewpoints is a promising future direction.
Investigating the robustness of Noise-NeRF against various attacks and developing countermeasures will further enhance its practical applicability. |
neural radiance fields, steganography, implicit neural representation, information security, 3d reconstruction |
2401.01207
Report |
Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation |
Renshuai Liu, Bowen Ma, Wei Zhang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Xuan Cheng |
In human-centric content generation, the pre-trained text-to-image models
struggle to produce user-wanted portrait images, which retain the identity of
individuals while exhibiting diverse expressions. This paper introduces our
efforts towards personalized face generation. To this end, we propose a novel
multi-modal face generation framework, capable of simultaneous
identity-expression control and more fine-grained expression synthesis. Our
expression control is so sophisticated that it can be specialized by the
fine-grained emotional vocabulary. We devise a novel diffusion model that can
undertake the task of simultaneously face swapping and reenactment. Due to the
entanglement of identity and expression, it's nontrivial to separately and
precisely control them in one framework, thus has not been explored yet. To
overcome this, we propose several innovative designs in the conditional
diffusion model, including balancing identity and expression encoder, improved
midpoint sampling, and explicitly background conditioning. Extensive
experiments have demonstrated the controllability and scalability of the
proposed framework, in comparison with state-of-the-art text-to-image, face
swapping, and face reenactment methods. |
This paper introduces a novel multi-modal face generation framework that allows simultaneous control over identity, expression, and background, enabling fine-grained expression synthesis. |
Current text-to-image models struggle to generate user-desired portraits that retain individual identity while exhibiting diverse expressions. This framework addresses this limitation by allowing for precise control over these aspects. |
The framework leverages a novel diffusion model called DiffSFSR (Simultaneous Face Swapping and Reenactment) that takes a selfie photo (identity), a text prompt (background), and an expression label as input. It employs techniques like balancing identity and expression encoders, improved midpoint sampling, and explicit background conditioning for enhanced control and quality. |
The framework achieves fine-grained expression synthesis, surpassing state-of-the-art text-to-image methods in generating 135 distinct expressions.
DiffSFSR outperforms hybrid methods (combining separate face-swapping and reenactment techniques) in simultaneous face swapping and reenactment tasks.
User studies confirm the framework's ability to generate high-fidelity portraits with high consistency in identity and expression, exceeding existing methods in realism and image quality. |
The framework's expression synthesis relies on a dataset with potential inconsistencies between expression labels and actual images, which can lead to semantic mismatches.
Ambiguity and overlap between certain expression labels pose a challenge for accurate and distinct synthesis. |
face generation, diffusion models, expression synthesis, face swapping, face reenactment |
2401.01173
Report |
En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data |
Yifang Men, Biwen Lei, Yuan Yao, Miaomiao Cui, Zhouhui Lian, Xuansong Xie |
We present En3D, an enhanced generative scheme for sculpting high-quality 3D
human avatars. Unlike previous works that rely on scarce 3D datasets or limited
2D collections with imbalanced viewing angles and imprecise pose priors, our
approach aims to develop a zero-shot 3D generative scheme capable of producing
visually realistic, geometrically accurate and content-wise diverse 3D humans
without relying on pre-existing 3D or 2D assets. To address this challenge, we
introduce a meticulously crafted workflow that implements accurate physical
modeling to learn the enhanced 3D generative model from synthetic 2D data.
During inference, we integrate optimization modules to bridge the gap between
realistic appearances and coarse 3D shapes. Specifically, En3D comprises three
modules: a 3D generator that accurately models generalizable 3D humans with
realistic appearance from synthesized balanced, diverse, and structured human
images; a geometry sculptor that enhances shape quality using multi-view normal
constraints for intricate human anatomy; and a texturing module that
disentangles explicit texture maps with fidelity and editability, leveraging
semantical UV partitioning and a differentiable rasterizer. Experimental
results show that our approach significantly outperforms prior works in terms
of image quality, geometry accuracy and content diversity. We also showcase the
applicability of our generated avatars for animation and editing, as well as
the scalability of our approach for content-style free adaptation. |
Presents En3D, a zero-shot generative scheme for creating high-quality 3D human avatars from synthetic 2D data, eliminating the need for pre-existing 3D or 2D datasets. |
Addresses limitations of previous methods that relied on scarce 3D datasets or limited 2D collections, resulting in avatars with limited realism, geometric accuracy, and content diversity. |
Employs a three-module pipeline: 1) 3D generative modeling (3DGM) learns from synthetic 2D images with accurate physical parameters. 2) Geometric sculpting (GS) refines shapes using multi-view normal constraints. 3) Explicit texturing (ET) generates UV texture maps via semantic UV partitioning and a differentiable rasterizer. |
Significantly outperforms prior art in generating realistic and diverse 3D humans with high-fidelity geometry.
Demonstrates capabilities for avatar animation, texture editing, and content-style adaptation (e.g., generating portrait heads or Disney-style characters).
Achieves state-of-the-art results in quantitative metrics such as FID, IS-360, and normal accuracy. |
Limited detail in generated hands, sometimes requiring replacement with SMPL-X templates.
Future work could explore higher-resolution synthesis and more complex garment types. |
3d human generation, generative adversarial networks, text-to-3d, avatar animation, 3d shape and texture editing |
2401.01130
Report |
Joint Generative Modeling of Scene Graphs and Images via Diffusion Models |
Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal |
In this paper, we present a novel generative task: joint scene graph - image
generation. While previous works have explored image generation conditioned on
scene graphs or layouts, our task is distinctive and important as it involves
generating scene graphs themselves unconditionally from noise, enabling
efficient and interpretable control for image generation. Our task is
challenging, requiring the generation of plausible scene graphs with
heterogeneous attributes for nodes (objects) and edges (relations among
objects), including continuous object bounding boxes and discrete object and
relation categories. We introduce a novel diffusion model, DiffuseSG, that
jointly models the adjacency matrix along with heterogeneous node and edge
attributes. We explore various types of encodings for the categorical data,
relaxing it into a continuous space. With a graph transformer being the
denoiser, DiffuseSG successively denoises the scene graph representation in a
continuous space and discretizes the final representation to generate the clean
scene graph. Additionally, we introduce an IoU regularization to enhance the
empirical performance. Our model significantly outperforms existing methods in
scene graph generation on the Visual Genome and COCO-Stuff datasets, both on
standard and newly introduced metrics that better capture the problem
complexity. Moreover, we demonstrate the additional benefits of our model in
two downstream applications: 1) excelling in a series of scene graph completion
tasks, and 2) improving scene graph detection models by using extra training
samples generated from DiffuseSG. |
This paper introduces a novel task of joint scene graph and image generation and proposes \OurModel, a diffusion-based model, to generate plausible scene graphs with heterogeneous attributes including object bounding boxes, object categories, and relations. |
Generating scene graphs is important as it enables efficient and interpretable control for image generation and can provide synthetic data to augment the training of scene graph prediction models, which traditionally rely on costly annotated data. |
The authors employ a two-step approach: first, they train \OurModel to generate scene graphs by modeling the adjacency matrix and node/edge attributes in a continuous space using a graph transformer as the denoiser. Second, a pre-trained layout-to-image model generates images conditioned on the generated scene graphs. |
\OurModel significantly outperforms existing methods in scene graph generation on Visual Genome and COCO-Stuff datasets based on standard and newly introduced metrics.
The model shows promising results in scene graph completion tasks, demonstrating its capability to infer missing information.
Using generated scene graph-image pairs as additional training data improves the performance of downstream scene graph detection models. |
The current approach uses a two-step process for scene graph and image generation, which might limit the coherence between the generated outputs.
Future work includes exploring a single unified model for joint generation, improving the handling of the tail relations in scene graphs, and extending the approach to more complex image generation tasks. |
scene graph generation, image generation, diffusion models, graph transformers, generative models |
2401.01128
Report |
SSP: A Simple and Safe automatic Prompt engineering method towards realistic image synthesis on LVM |
Weijin Cheng, Jianzhi Liu, Jiawen Deng, Fuji Ren |
Recently, text-to-image (T2I) synthesis has undergone significant
advancements, particularly with the emergence of Large Language Models (LLM)
and their enhancement in Large Vision Models (LVM), greatly enhancing the
instruction-following capabilities of traditional T2I models. Nevertheless,
previous methods focus on improving generation quality but introduce unsafe
factors into prompts. We explore that appending specific camera descriptions to
prompts can enhance safety performance. Consequently, we propose a simple and
safe prompt engineering method (SSP) to improve image generation quality by
providing optimal camera descriptions. Specifically, we create a dataset from
multi-datasets as original prompts. To select the optimal camera, we design an
optimal camera matching approach and implement a classifier for original
prompts capable of automatically matching. Appending camera descriptions to
original prompts generates optimized prompts for further LVM image generation.
Experiments demonstrate that SSP improves semantic consistency by an average of
16% compared to others and safety metrics by 48.9%. |
This paper introduces SSP, a simple and safe prompt engineering method for Large Vision Models (LVMs) that enhances image generation quality and safety by appending optimal camera descriptions to original prompts. |
Existing prompt engineering methods for LVMs often introduce randomness, which can alter the original semantics, introduce unsafe factors, and raise safety concerns. SSP addresses these issues by providing specific camera descriptions that improve image quality while maintaining safety. |
The authors create a dataset of original prompts from multiple sources and manually select optimal cameras for different image categories based on FID and CLIP Score. They then fine-tune a BERT model to automatically match optimal camera descriptions to new prompts. |
SSP improves semantic consistency by an average of 16% compared to other methods.
SSP enhances safety metrics by 48.9% compared to baselines, demonstrating a significant reduction in unsafe content generation.
Text feature analysis reveals that SSP effectively influences prompt text features, leading to more realistic and visually appealing images. |
The evaluation of image authenticity solely relies on FID and lacks dedicated metrics.
The study is limited by the accessibility of various LVMs, hindering broader comparisons with other models. |
prompt engineering, large vision models, text-to-image synthesis, image generation, safety |
2401.01117
Report |
Q-Refine: A Perceptual Quality Refiner for AI-Generated Image |
Chunyi Li, Haoning Wu, Zicheng Zhang, Hongkun Hao, Kaiwei Zhang, Lei Bai, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai |
With the rapid evolution of the Text-to-Image (T2I) model in recent years,
their unsatisfactory generation result has become a challenge. However,
uniformly refining AI-Generated Images (AIGIs) of different qualities not only
limited optimization capabilities for low-quality AIGIs but also brought
negative optimization to high-quality AIGIs. To address this issue, a
quality-award refiner named Q-Refine is proposed. Based on the preference of
the Human Visual System (HVS), Q-Refine uses the Image Quality Assessment (IQA)
metric to guide the refining process for the first time, and modify images of
different qualities through three adaptive pipelines. Experimental shows that
for mainstream T2I models, Q-Refine can perform effective optimization to AIGIs
of different qualities. It can be a general refiner to optimize AIGIs from both
fidelity and aesthetic quality levels, thus expanding the application of the
T2I generation models. |
Q-Refine, a novel quality-aware refiner for AI-Generated Images (AIGIs), is proposed. It leverages Image Quality Assessment (IQA) metrics to guide the refining process based on the Human Visual System (HVS) preferences. |
Existing AIGI refiners lack quality awareness, leading to insufficient enhancement in low-quality regions and negative optimization in high-quality regions. |
Q-Refine employs an IQA module to predict a quality map and utilizes three adaptive pipelines: Gaussian Noise for low-quality regions, Mask Inpainting for medium-quality regions, and Global Enhancement for high-quality regions. |
Q-Refine outperforms existing refiners on mainstream AIGI quality databases, achieving state-of-the-art results in most quality metrics.
It effectively refines AIGIs of different qualities, demonstrating versatility across low, medium, and high-quality regions.
Q-Refine consistently improves AIGI quality without causing negative optimization, as evidenced by ablation studies. |
The IQA module's computational complexity might affect the efficiency of the refining process.
The selection of optimal thresholds for quality regions could be further investigated. |
ai-generated content, image quality assessment, image restoration, text-to-image synthesis, perceptual quality |
2401.01008
Report |
Fast Inference Through The Reuse Of Attention Maps In Diffusion Models |
Rosco Hunter, Łukasz Dudziak, Mohamed S. Abdelfattah, Abhinav Mehrotra, Sourav Bhattacharya, Hongkai Wen |
Text-to-image diffusion models have demonstrated unprecedented abilities at
flexible and realistic image synthesis. However, the iterative process required
to produce a single image is costly and incurs a high latency, prompting
researchers to further investigate its efficiency. Typically, improvements in
latency have been achieved in two ways: (1) training smaller models through
knowledge distillation (KD); and (2) adopting techniques from ODE-theory to
facilitate larger step sizes. In contrast, we propose a training-free approach
that does not alter the step-size of the sampler. Specifically, we find the
repeated calculation of attention maps to be both costly and redundant;
therefore, we propose a structured reuse of attention maps during sampling. Our
initial reuse policy is motivated by rudimentary ODE-theory, which suggests
that reuse is most suitable late in the sampling procedure. After noting a
number of limitations in this theoretical approach, we empirically search for a
better policy. Unlike methods that rely on KD, our reuse policies can easily be
adapted to a variety of setups in a plug-and-play manner. Furthermore, when
applied to Stable Diffusion-1.5, our reuse policies reduce latency with minimal
repercussions on sample quality. |
This paper introduces training-free reuse policies for attention maps in text-to-image diffusion models, reducing latency without retraining or increasing step size. |
Diffusion models, despite impressive performance, suffer from high latency due to the iterative nature and computational cost of U-Net calls, hindering their real-time applicability. |
The authors analyze attention map redundancy and propose two policies: HURRY, based on Lyapunov exponents suggesting late reuse, and PHAST, a refinement of HURRY through local search for optimal reuse steps. |
PHAST and HURRY significantly outperform random attention reuse policies.
These policies, at comparable latency, produce samples closer to a 20-step DDIM baseline than 13-step DDIM, indicating better fidelity.
Evaluation on MS-COCO shows comparable CLIP-Score and FID to baselines, with marginally lower FID suggesting minor distributional distortion. |
The assumption of binary step-wise policies, while empirically supported, might not be globally optimal.
The memory-latency trade-off, while addressed with reduced precision caching, requires further investigation for memory-constrained systems. |
diffusion models, text-to-image synthesis, latency reduction, attention mechanism, reuse policies |
2401.00935
Report |
Boundary Attention: Learning to Localize Boundaries under High Noise |
Mia Gaia Polansky, Charles Herrmann, Junhwa Hur, Deqing Sun, Dor Verbin, Todd Zickler |
We present a differentiable model that infers explicit boundaries, including
curves, corners and junctions, using a mechanism that we call boundary
attention. Boundary attention is a boundary-aware local attention operation
that, when applied densely and repeatedly, progressively refines a field of
variables that specify an unrasterized description of the local boundary
structure in every overlapping patch within an image. It operates in a
bottom-up fashion, similar to classical methods for sub-pixel edge localization
and edge-linking, but with a higher-dimensional description of local boundary
structure, a notion of spatial consistency that is learned instead of designed,
and a sequence of operations that is end-to-end differentiable. We train our
model using simple synthetic data and then evaluate it using photographs that
were captured under low-light conditions with variable amounts of noise. We
find that our method generalizes to natural images corrupted by real sensor
noise, and predicts consistent boundaries under increasingly noisy conditions
where other state-of-the-art methods fail. |
This work introduces Boundary Attention, a novel deep network model designed for robust boundary detection in images, particularly under significant noise. |
Robust boundary detection is crucial for various computer vision tasks but remains challenging, especially in noisy conditions. Existing methods often struggle to balance detail preservation and noise suppression. |
The model utilizes a novel iterative refinement approach. It operates locally and refines boundary estimates within spatial neighborhoods using learned geometric primitives (junctions) and adaptive attention mechanisms. |
The model demonstrates state-of-the-art performance on established boundary detection benchmarks, particularly under high noise levels.
It effectively leverages color information for boundary localization and grouping, even without relying on semantic understanding.
The learned junction representation exhibits a spatially smooth manifold in the model's hidden state, allowing for intuitive interpolation and manipulation of boundary structures. |
The model's reliance on local operations may limit its ability to incorporate global context for boundary detection in some cases.
Future work includes exploring extensions for handling more complex boundary structures and incorporating semantic information for enhanced performance. |
boundary detection, deep learning, iterative refinement, attention mechanisms, noise robustness |
2401.00909
Report |
Taming Mode Collapse in Score Distillation for Text-to-3D Generation |
Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra |
Despite the remarkable performance of score distillation in text-to-3D
generation, such techniques notoriously suffer from view inconsistency issues,
also known as "Janus" artifact, where the generated objects fake each view with
multiple front faces. Although empirically effective methods have approached
this problem via score debiasing or prompt engineering, a more rigorous
perspective to explain and tackle this problem remains elusive. In this paper,
we reveal that the existing score distillation-based text-to-3D generation
frameworks degenerate to maximal likelihood seeking on each view independently
and thus suffer from the mode collapse problem, manifesting as the Janus
artifact in practice. To tame mode collapse, we improve score distillation by
re-establishing the entropy term in the corresponding variational objective,
which is applied to the distribution of rendered images. Maximizing the entropy
encourages diversity among different views in generated 3D assets, thereby
mitigating the Janus problem. Based on this new objective, we derive a new
update rule for 3D score distillation, dubbed Entropic Score Distillation
(ESD). We theoretically reveal that ESD can be simplified and implemented by
just adopting the classifier-free guidance trick upon variational score
distillation. Although embarrassingly straightforward, our extensive
experiments successfully demonstrate that ESD can be an effective treatment for
Janus artifacts in score distillation. |
This paper proposes Entropic Score Distillation (ESD), a method to address the view inconsistency ("Janus") problem in text-to-3D generation using score distillation. |
Existing score distillation techniques for text-to-3D generation suffer from the "Janus" artifact where generated objects have multiple front faces. This is attributed to the mode collapse problem arising from the optimization degenerating to maximal likelihood seeking on each view independently. |
ESD introduces entropy regularization to the score distillation objective, encouraging diversity among different views of the generated 3D assets. It is implemented by leveraging the Classifier-Free Guidance (CFG) trick upon variational score distillation, mixing conditional and unconditional scores during training. |
ESD effectively mitigates the Janus problem, producing 3D objects with better view consistency.
ESD improves 3D generation quality compared to baseline methods, as demonstrated by qualitative and quantitative evaluations including FID and CLIP score.
The paper introduces Inception Quality (IQ) and Inception Variety (IV) metrics to numerically probe and evaluate model collapse and view diversity in text-to-3D generation. |
ESD might still be susceptible to mode collapse when the target image distribution is highly concentrated on one mode.
The applicability of ESD to multi-particle VSD or amortized text-to-3D training remains unexplored. |
text-to-3d generation, score distillation, janus problem, mode collapse, entropy regularization |
2401.00896
Report |
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation |
Wan-Duo Kurt Ma, J. P. Lewis, W. Bastiaan Kleijn |
Within recent approaches to text-to-video (T2V) generation, achieving
controllability in the synthesized video is often a challenge. Typically, this
issue is addressed by providing low-level per-frame guidance in the form of
edge maps, depth maps, or an existing video to be altered. However, the process
of obtaining such guidance can be labor-intensive. This paper focuses on
enhancing controllability in video synthesis by employing straightforward
bounding boxes to guide the subject in various ways, all without the need for
neural network training, finetuning, optimization at inference time, or the use
of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a
pre-trained (T2V) model, and easy to implement. The subject is directed by a
bounding box through the proposed spatial and temporal attention map editing.
Moreover, we introduce the concept of keyframing, allowing the subject
trajectory and overall appearance to be guided by both a moving bounding box
and corresponding prompts, without the need to provide a detailed mask. The
method is efficient, with negligible additional computation relative to the
underlying pre-trained model. Despite the simplicity of the bounding box
guidance, the resulting motion is surprisingly natural, with emergent effects
including perspective and movement toward the virtual camera as the box size
increases. |
TrailBlazer enhances diffusion-based text-to-video generation by enabling precise control over subject trajectories and appearance through simple bounding box and prompt keyframing. |
Existing text-to-video methods lack fine-grained control over subject motion, relying on labor-intensive frame-by-frame guidance. TrailBlazer provides an intuitive, user-friendly interface for casual users to direct subject motion. |
TrailBlazer leverages pre-trained video diffusion models (ZeroScope) and manipulates spatial and temporal attention maps during the denoising process based on user-defined bounding boxes and prompt keyframes. This guidance steers subject generation without requiring model training or optimization. |
TrailBlazer achieves accurate subject trajectory control, even with complex paths and dynamic bounding box sizes.
The method produces natural motion with emergent perspective effects and object orientation consistent with the specified trajectory.
TrailBlazer enables subject morphing by interpolating prompt embeddings, facilitating smooth transitions between identities within a video clip. |
TrailBlazer inherits limitations from the underlying diffusion model, including potential object deformations and challenges with multi-object generation.
The method's performance relies on consistency between the prompt and the keyframed bounding box trajectory. Extreme motion or unrealistic paths may lead to artifacts. |
text-to-video synthesis, diffusion models, motion control, trajectory guidance, subject morphing |
2401.00877
Report |
Improving the Stability of Diffusion Models for Content Consistent Super-Resolution |
Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hongwei Yong, Lei Zhang |
The generative priors of pre-trained latent diffusion models have
demonstrated great potential to enhance the perceptual quality of image
super-resolution (SR) results. Unfortunately, the existing diffusion
prior-based SR methods encounter a common problem, i.e., they tend to generate
rather different outputs for the same low-resolution image with different noise
samples. Such stochasticity is desired for text-to-image generation tasks but
problematic for SR tasks, where the image contents are expected to be well
preserved. To improve the stability of diffusion prior-based SR, we propose to
employ the diffusion models to refine image structures, while employing the
generative adversarial training to enhance image fine details. Specifically, we
propose a non-uniform timestep learning strategy to train a compact diffusion
network, which has high efficiency and stability to reproduce the image main
structures, and finetune the pre-trained decoder of variational auto-encoder
(VAE) by adversarial training for detail enhancement. Extensive experiments
show that our proposed method, namely content consistent super-resolution
(CCSR), can significantly reduce the stochasticity of diffusion prior-based SR,
improving the content consistency of SR outputs and speeding up the image
generation process. Codes and models can be found at
{https://github.com/csslc/CCSR}. |
This paper introduces CCSR, a novel approach for image super-resolution that enhances the stability of diffusion models. |
Existing diffusion prior-based SR methods often produce inconsistent results with varying noise samples, hindering their reliability for preserving image content. |
CCSR employs a two-stage framework: a diffusion stage with a non-uniform timestep sampling strategy to refine image structures, followed by adversarial training of the VAE decoder for detail enhancement. |
CCSR significantly reduces stochasticity in SR outputs, improving content consistency.
It achieves comparable or superior performance to state-of-the-art GAN-based and diffusion-based SR methods.
CCSR exhibits faster inference speeds compared to many diffusion-based methods due to its efficient sampling strategy. |
The paper primarily focuses on visual quality and stability, with limited exploration of fidelity-perceptual trade-offs.
Future work could investigate the impact of different VAE decoder architectures and training strategies on performance. |
image super-resolution, diffusion models, generative adversarial networks, content consistency, stability |
2401.00869
Report |
FlashVideo: A Framework for Swift Inference in Text-to-Video Generation |
Bin Lei, le Chen, Caiwen Ding |
In the evolving field of machine learning, video generation has witnessed
significant advancements with autoregressive-based transformer models and
diffusion models, known for synthesizing dynamic and realistic scenes. However,
these models often face challenges with prolonged inference times, even for
generating short video clips such as GIFs. This paper introduces FlashVideo, a
novel framework tailored for swift Text-to-Video generation. FlashVideo
represents the first successful adaptation of the RetNet architecture for video
generation, bringing a unique approach to the field. Leveraging the
RetNet-based architecture, FlashVideo reduces the time complexity of inference
from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for a sequence of length $L$,
significantly accelerating inference speed. Additionally, we adopt a
redundant-free frame interpolation method, enhancing the efficiency of frame
interpolation. Our comprehensive experiments demonstrate that FlashVideo
achieves a $\times9.17$ efficiency improvement over a traditional
autoregressive-based transformer model, and its inference speed is of the same
order of magnitude as that of BERT-based transformer models. |
Introduces FlashVideo, a novel text-to-video generation framework leveraging the RetNet architecture for fast inference. |
Existing video generation models, while advanced, suffer from slow inference times, especially for longer sequences. FlashVideo addresses this by significantly improving inference speed. |
Adapts RetNet for video generation with tailored training and inference frameworks. Introduces Serial Number tokens to enhance inter-frame relationship learning. Employs a redundant-free frame interpolation method for efficiency. |
Achieves a 9.17x speed improvement over traditional autoregressive transformer models.
Demonstrates inference speeds comparable to BERT-based transformer models.
Exhibits high-quality video generation capabilities, validated through quantitative metrics (FVD, PSNR, SSIM, LPIPS) and qualitative analysis. |
Limited evaluation on high-resolution video generation.
Further exploration of the trade-off between generation speed and quality. |
video generation, text-to-video, retnet, frame interpolation, deep learning |
2401.00847
Report |
Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera |
Jiye Lee, Hanbyul Joo |
We present a lightweight and affordable motion capture method based on two
smartwatches and a head-mounted camera. In contrast to the existing approaches
that use six or more expert-level IMU devices, our approach is much more
cost-effective and convenient. Our method can make wearable motion capture
accessible to everyone everywhere, enabling 3D full-body motion capture in
diverse environments. As a key idea to overcome the extreme sparsity and
ambiguities of sensor inputs with different modalities, we integrate 6D head
poses obtained from the head-mounted cameras for motion estimation. To enable
capture in expansive indoor and outdoor scenes, we propose an algorithm to
track and update floor level changes to define head poses, coupled with a
multi-stage Transformer-based regression module. We also introduce novel
strategies leveraging visual cues of egocentric images to further enhance the
motion capture quality while reducing ambiguities. We demonstrate the
performance of our method on various challenging scenarios, including complex
outdoor environments and everyday motions including object interactions and
social interactions among multiple individuals. |
This paper introduces a novel motion capture method using two smartwatches and a head-mounted camera, making motion capture accessible and affordable. |
Current motion capture methods rely on expensive and cumbersome equipment, limiting data availability for research in human motion understanding and human-machine interaction. |
The system leverages monocular SLAM for head pose estimation, utilizes a multi-stage Transformer network to regress full-body motion from IMU and head pose data, and employs a motion optimization module with visual cues for refining the captured motion. |
Despite using only upper body sensors, the system achieves comparable or better performance than state-of-the-art methods relying on full-body IMU setups.
The proposed floor level update algorithm enables accurate motion capture in expansive environments with varying ground levels.
The motion optimization module effectively integrates visual cues from the head-mounted camera, enhancing motion capture quality, especially for subtle movements. |
The system depends on off-the-shelf models (e.g., SLAM) which may fail in rare cases.
The current method relies on a mean body shape model and could be improved by explicitly accounting for body shape variations. |
motion capture, wearable sensors, egocentric vision, human motion analysis, transformer networks |
2401.00834
Report |
Deblurring 3D Gaussian Splatting |
Byeonghyeon Lee, Howoong Lee, Xiangyu Sun, Usman Ali, Eunbyung Park |
Recent studies in Radiance Fields have paved the robust way for novel view
synthesis with their photorealistic rendering quality. Nevertheless, they
usually employ neural networks and volumetric rendering, which are costly to
train and impede their broad use in various real-time applications due to the
lengthy rendering time. Lately 3D Gaussians splatting-based approach has been
proposed to model the 3D scene, and it achieves remarkable visual quality while
rendering the images in real-time. However, it suffers from severe degradation
in the rendering quality if the training images are blurry. Blurriness commonly
occurs due to the lens defocusing, object motion, and camera shake, and it
inevitably intervenes in clean image acquisition. Several previous studies have
attempted to render clean and sharp images from blurry input images using
neural fields. The majority of those works, however, are designed only for
volumetric rendering-based neural radiance fields and are not straightforwardly
applicable to rasterization-based 3D Gaussian splatting methods. Thus, we
propose a novel real-time deblurring framework, Deblurring 3D Gaussian
Splatting, using a small Multi-Layer Perceptron (MLP) that manipulates the
covariance of each 3D Gaussian to model the scene blurriness. While Deblurring
3D Gaussian Splatting can still enjoy real-time rendering, it can reconstruct
fine and sharp details from blurry images. A variety of experiments have been
conducted on the benchmark, and the results have revealed the effectiveness of
our approach for deblurring. Qualitative results are available at
https://benhenryl.github.io/Deblurring-3D-Gaussian-Splatting/ |
This paper presents Deblurring 3D-GS, the first real-time deblurring framework for 3D Gaussian Splatting (3D-GS), which modifies the covariance of each 3D Gaussian using a small MLP to model scene blurriness. |
Existing neural radiance field methods for deblurring either rely on time-consuming volumetric rendering or address only specific types of blur, hindering real-time applications. |
The method manipulates covariance matrices of 3D Gaussians during training to simulate blur, expanding dispersion for defocus blur and averaging shifted Gaussians for motion blur. At inference, it renders sharp images using unmodified Gaussians without MLP activation. |
Achieves state-of-the-art or competitive rendering quality on real and synthetic datasets with defocus and motion blur.
Significantly faster rendering speed (> 800 FPS) compared to existing deblurring NeRF models.
Proposed techniques for densifying sparse point clouds and depth-based pruning enhance reconstruction of fine details, especially at far plane. |
Extending existing NeRF deblurring methods to rasterization-based 3D-GS is not optimal.
Exploring compatibility with other 3D scene representations beyond 3D Gaussians is a potential future direction. |
neural radiance fields, deblurring, real-time rendering, 3d gaussian splatting, point cloud |
2401.00825
Report |
Sharp-NeRF: Grid-based Fast Deblurring Neural Radiance Fields Using Sharpness Prior |
Byeonghyeon Lee, Howoong Lee, Usman Ali, Eunbyung Park |
Neural Radiance Fields (NeRF) have shown remarkable performance in neural
rendering-based novel view synthesis. However, NeRF suffers from severe visual
quality degradation when the input images have been captured under imperfect
conditions, such as poor illumination, defocus blurring, and lens aberrations.
Especially, defocus blur is quite common in the images when they are normally
captured using cameras. Although few recent studies have proposed to render
sharp images of considerably high-quality, yet they still face many key
challenges. In particular, those methods have employed a Multi-Layer Perceptron
(MLP) based NeRF, which requires tremendous computational time. To overcome
these shortcomings, this paper proposes a novel technique Sharp-NeRF -- a
grid-based NeRF that renders clean and sharp images from the input blurry
images within half an hour of training. To do so, we used several grid-based
kernels to accurately model the sharpness/blurriness of the scene. The
sharpness level of the pixels is computed to learn the spatially varying blur
kernels. We have conducted experiments on the benchmarks consisting of blurry
images and have evaluated full-reference and non-reference metrics. The
qualitative and quantitative results have revealed that our approach renders
the sharp novel views with vivid colors and fine details, and it has
considerably faster training time than the previous works. Our project page is
available at https://benhenryl.github.io/SharpNeRF/ |
This paper proposes Sharp-NeRF, a fast grid-based NeRF framework for rendering sharp images from blurry inputs using discrete learnable blur kernels and a sharpness prior. |
Existing NeRF-based deblurring methods suffer from long training times due to their reliance on computationally expensive MLPs. |
Sharp-NeRF leverages a decomposed-grid representation for neural fields and introduces discrete learnable kernels optimized directly without requiring additional networks. A sharpness prior based on pre-computed per-pixel sharpness levels guides the assignment of blur kernels to groups of pixels with similar blurriness. Random patch sampling further accelerates training by reducing the number of rendered rays. |
Sharp-NeRF achieves comparable or better image quality compared to state-of-the-art deblurring NeRF models.
Sharp-NeRF achieves significantly faster training times, completing training in under half an hour.
The use of a sharpness prior and discrete learnable kernels are shown to be crucial for achieving high-quality deblurring results. |
The current implementation of Sharp-NeRF is designed specifically for defocus blur and may not generalize well to other types of blur, such as motion blur.
The sharpness prior is pre-computed and does not account for potential changes in blurriness during training. |
neural radiance fields, deblurring, image restoration, grid-based representations, sharpness prior |
2401.00736
Report |
Diffusion Models, Image Super-Resolution And Everything: A Survey |
Brian B. Moser, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Sebastian Palacio, Andreas Dengel |
Diffusion Models (DMs) have disrupted the image Super-Resolution (SR) field
and further closed the gap between image quality and human perceptual
preferences. They are easy to train and can produce very high-quality samples
that exceed the realism of those produced by previous generative methods.
Despite their promising results, they also come with new challenges that need
further research: high computational demands, comparability, lack of
explainability, color shifts, and more. Unfortunately, entry into this field is
overwhelming because of the abundance of publications. To address this, we
provide a unified recount of the theoretical foundations underlying DMs applied
to image SR and offer a detailed analysis that underscores the unique
characteristics and methodologies within this domain, distinct from broader
existing reviews in the field. This survey articulates a cohesive understanding
of DM principles and explores current research avenues, including alternative
input domains, conditioning techniques, guidance mechanisms, corruption spaces,
and zero-shot learning approaches. By offering a detailed examination of the
evolution and current trends in image SR through the lens of DMs, this survey
sheds light on the existing challenges and charts potential future directions,
aiming to inspire further innovation in this rapidly advancing area. |
This paper presents a comprehensive survey of Diffusion Models (DMs) for image Super-Resolution (SR), summarizing their theoretical foundations and analyzing their unique characteristics within this domain. |
DMs have shown groundbreaking potential in image SR, exceeding the realism of previous generative methods and challenging GAN-based approaches. |
The paper discusses different types of DMs (DDPMs, SGMs, SDEs), their relationship to other generative models, and improvements like efficient sampling techniques. It further explores concrete realizations of DMs in SR, alternative input domains, conditioning and guidance strategies, and zero-shot learning approaches. |
DMs, particularly DDPMs, have become a dominant force in image SR, demonstrating superior perceptual quality.
Alternative input domains like latent space and wavelet domain offer computational advantages and enhance control over image features.
Zero-shot SR methods using pre-trained DMs show promising results, enabling SR without prior image examples. |
The computational cost of DMs remains a significant hurdle for wider adoption and practical applications.
Further research is needed to develop standardized benchmarks and evaluation metrics specifically designed for comparing generative SR models like DMs. |
image super-resolution, diffusion models, generative models, deep learning, computer vision |
2401.00616
Report |
GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields |
Xiao Pan, Zongxin Yang, Shuai Bai, Yi Yang |
In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task
which targets synthesizing photo-realistic novel views given only one reference
image per scene. Previous One-shot Generalizable Neural Radiance Fields
(OG-NeRF) methods solve this task in an inference-time finetuning-free manner,
yet suffer the blurry issue due to the encoder-only architecture that highly
relies on the limited reference image. On the other hand, recent
diffusion-based image-to-3d methods show vivid plausible results via distilling
pre-trained 2D diffusion models into a 3D representation, yet require tedious
per-scene optimization. Targeting these issues, we propose the GD$^2$-NeRF, a
Generative Detail compensation framework via GAN and Diffusion that is both
inference-time finetuning-free and with vivid plausible details. In detail,
following a coarse-to-fine strategy, GD$^2$-NeRF is mainly composed of a
One-stage Parallel Pipeline (OPP) and a 3D-consistent Detail Enhancer
(Diff3DE). At the coarse stage, OPP first efficiently inserts the GAN model
into the existing OG-NeRF pipeline for primarily relieving the blurry issue
with in-distribution priors captured from the training dataset, achieving a
good balance between sharpness (LPIPS, FID) and fidelity (PSNR, SSIM). Then, at
the fine stage, Diff3DE further leverages the pre-trained image diffusion
models to complement rich out-distribution details while maintaining decent 3D
consistency. Extensive experiments on both the synthetic and real-world
datasets show that GD$^2$-NeRF noticeably improves the details while without
per-scene finetuning. |
GD$^2$-NeRF is a novel coarse-to-fine generative detail compensation framework that hierarchically incorporates GAN and pre-trained diffusion models into OG-NeRF for One-shot Novel View Synthesis (O-NVS). |
Existing OG-NeRF methods for O-NVS, while inference-time finetuning-free, struggle with blurry outputs due to their reliance on limited information from reference images. |
GD$^2$-NeRF consists of two stages: 1) One-stage Parallel Pipeline (OPP) injects a GAN model into the OG-NeRF pipeline to address blurriness using in-distribution priors, and 2) Diffusion-based 3D-consistent Enhancer (Diff3DE) leverages pre-trained image diffusion models to complement rich out-distribution details. |
OPP effectively relieves blurriness while maintaining fidelity, achieving a good balance between sharpness and fidelity.
Diff3DE further enhances details with out-distribution priors while ensuring 3D consistency.
GD$^2$-NeRF significantly improves detail and consistency compared to previous OG-NeRF methods and Zero123-NVS on both synthetic and real-world datasets. |
The denoising process in Diff3DE, like many diffusion-based methods, is computationally inefficient.
Diff3DE primarily enhances existing details and may not correct significant geometry errors in the input. |
one-shot novel view synthesis, generalizable neural radiance fields, 3d reconstruction, generative adversarial networks, diffusion models |
2401.00604
Report |
SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity |
Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra |
Score distillation has emerged as one of the most prevalent approaches for
text-to-3D asset synthesis. Essentially, score distillation updates 3D
parameters by lifting and back-propagating scores averaged over different
views. In this paper, we reveal that the gradient estimation in score
distillation is inherent to high variance. Through the lens of variance
reduction, the effectiveness of SDS and VSD can be interpreted as applications
of various control variates to the Monte Carlo estimator of the distilled
score. Motivated by this rethinking and based on Stein's identity, we propose a
more general solution to reduce variance for score distillation, termed Stein
Score Distillation (SSD). SSD incorporates control variates constructed by
Stein identity, allowing for arbitrary baseline functions. This enables us to
include flexible guidance priors and network architectures to explicitly
optimize for variance reduction. In our experiments, the overall pipeline,
dubbed SteinDreamer, is implemented by instantiating the control variate with a
monocular depth estimator. The results suggest that SSD can effectively reduce
the distillation variance and consistently improve visual quality for both
object- and scene-level generation. Moreover, we demonstrate that SteinDreamer
achieves faster convergence than existing methods due to more stable gradient
updates. |
This paper introduces Stein Score Distillation (SSD), a novel variance reduction approach for text-to-3D score distillation, enabling improved quality and faster convergence in 3D asset synthesis. |
Score distillation methods suffer from high variance in gradient estimation due to noisy denoising and small batch sizes, leading to slow convergence and suboptimal 3D generation results. |
SSD leverages Stein's identity to construct flexible control variates, incorporating arbitrary baseline functions (e.g., depth/normal estimators) to reduce variance in score distillation. |
SSD, implemented as SteinDreamer, effectively reduces variance and improves visual quality in both object and scene-level 3D generation compared to DreamFusion and ProlificDreamer.
SteinDreamer generates 3D assets with finer details, smoother geometry, and fewer artifacts like Janus and ghosting.
The method accelerates convergence by 14%-22%, requiring fewer diffusion model calls to achieve text-aligned 3D results. |
Excessive variance reduction may lead to loss of detail in background regions.
Future work includes exploring alternative baseline functions to further enhance SSD's performance. |
text-to-3d generation, score distillation, variance reduction, "steins method", diffusion models |
2401.00551
Report |
A Generalist FaceX via Learning Unified Facial Representation |
Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yanhao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, Ying Tai |
This work presents FaceX framework, a novel facial generalist model capable
of handling diverse facial tasks simultaneously. To achieve this goal, we
initially formulate a unified facial representation for a broad spectrum of
facial editing tasks, which macroscopically decomposes a face into fundamental
identity, intra-personal variation, and environmental factors. Based on this,
we introduce Facial Omni-Representation Decomposing (FORD) for seamless
manipulation of various facial components, microscopically decomposing the core
aspects of most facial editing tasks. Furthermore, by leveraging the prior of a
pretrained StableDiffusion (SD) to enhance generation quality and accelerate
training, we design Facial Omni-Representation Steering (FORS) to first
assemble unified facial representations and then effectively steer the SD-aware
generation process by the efficient Facial Representation Controller (FRC).
%Without any additional features, Our versatile FaceX achieves competitive
performance compared to elaborate task-specific models on popular facial
editing tasks. Full codes and models will be available at
https://github.com/diffusion-facex/FaceX. |
This paper introduces FaceX, the first unified generalist model for diverse facial editing tasks. |
Existing facial editing methods are often task-specific and lack versatility. FaceX aims to address this limitation by providing a single model capable of performing various tasks like face swapping, reenactment, and attribute editing. |
FaceX decomposes facial images into identity, intra-personal variation (motion, texture, hair), and environmental factors. It leverages a pre-trained Stable Diffusion model, guided by assembled facial representations through a novel Facial Representation Controller (FRC). |
FaceX achieves competitive performance on popular tasks like face reenactment and swapping compared to task-specific methods.
It demonstrates strong capabilities in head swapping, outperforming state-of-the-art methods in terms of image quality and efficiency.
The model exhibits versatility by enabling progressive editing across different tasks and extending to animation and inpainting. |
While offering a unified framework, FaceX may be slightly suboptimal for specific tasks compared to dedicated approaches.
The paper acknowledges the potential for misuse and emphasizes the need for parallel development of forgery detection methods. |
facial editing, diffusion models, generalist model, stable diffusion, facial representation learning |
2401.00431
Report |
Wild2Avatar: Rendering Humans Behind Occlusions |
Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, Ehsan Adeli |
Rendering the visual appearance of moving humans from occluded monocular
videos is a challenging task. Most existing research renders 3D humans under
ideal conditions, requiring a clear and unobstructed scene. Those methods
cannot be used to render humans in real-world scenes where obstacles may block
the camera's view and lead to partial occlusions. In this work, we present
Wild2Avatar, a neural rendering approach catered for occluded in-the-wild
monocular videos. We propose occlusion-aware scene parameterization for
decoupling the scene into three parts - occlusion, human, and background.
Additionally, extensive objective functions are designed to help enforce the
decoupling of the human from both the occlusion and the background and to
ensure the completeness of the human model. We verify the effectiveness of our
approach with experiments on in-the-wild videos. |
This paper presents Wild2Avatar, a novel neural rendering method designed for generating high-fidelity 3D human avatars from in-the-wild monocular videos containing occlusions. |
Existing human rendering methods struggle with real-world occlusions due to a lack of ground-truth supervision and limitations in handling occluded 3D points. |
The method utilizes occlusion-aware scene parameterization to decouple the scene into three parts: occlusion, human, and background. It models each part with separate neural radiance fields and employs a combination of photometric, decomposition, occlusion decoupling, and geometry completeness losses for optimization. |
Wild2Avatar effectively decouples occlusions from the human body, enabling complete and high-fidelity human renderings even in the presence of obstacles.
The method demonstrates superior performance compared to state-of-the-art methods like Vid2Avatar, particularly in reconstructing occluded body parts and maintaining geometric consistency.
Quantitative evaluations using metrics such as PSNR, IoU, and a novel LLM-based quality assessment confirm the effectiveness of Wild2Avatar in handling occlusions and generating high-quality renderings. |
The method's reliance on accurate pose estimations can impact rendering quality, particularly for inaccurate priors.
Rendering occlusions increases inference time, leading to a slower optimization process. |
human rendering, neural radiance fields, occlusion handling, monocular video, scene decomposition |
2401.00374
Report |
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling |
Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black |
We propose EMAGE, a framework to generate full-body human gestures from audio
and masked gestures, encompassing facial, local body, hands, and global
movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new
mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with
FLAME head parameters and further refines the modeling of head, neck, and
finger movements, offering a community-standardized, high-quality 3D motion
captured dataset. EMAGE leverages masked body gesture priors during training to
boost inference performance. It involves a Masked Audio Gesture Transformer,
facilitating joint training on audio-to-gesture generation and masked gesture
reconstruction to effectively encode audio and body gesture hints. Encoded body
hints from masked gestures are then separately employed to generate facial and
body movements. Moreover, EMAGE adaptively merges speech features from the
audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance
the results' fidelity and diversity. Experiments demonstrate that EMAGE
generates holistic gestures with state-of-the-art performance and is flexible
in accepting predefined spatial-temporal gesture inputs, generating complete,
audio-synchronized results. Our code and dataset are available
https://pantomatrix.github.io/EMAGE/ |
Introduces EMAGE, a framework for generating full-body human gestures from audio and masked gestures, and BEAT2, a new mesh-level holistic co-speech gesture dataset. |
Addresses the limitations of existing datasets and models for generating realistic and expressive full-body co-speech gestures, aiming to improve coherence and cross-modal alignment between audio and motion. |
Presents BEAT2, combining SMPL-X body with FLAME head parameters, and EMAGE, using masked body gesture priors and a Masked Audio Gesture Transformer to generate gestures from audio and masked gesture input. |
EMAGE generates holistic gestures with state-of-the-art performance.
EMAGE accepts predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results.
BEAT2 provides a standardized, high-quality 3D motion captured dataset for co-speech gesture generation. |
EMAGE's performance may be influenced by the quality of the input audio and masked gestures.
Future work could explore more sophisticated methods for fusing audio and gesture information, potentially leading to even more expressive and realistic results. |
co-speech gesture generation, masked representation learning, holistic gesture dataset, smpl-x, flame |
2401.00370
Report |
UGPNet: Universal Generative Prior for Image Restoration |
Hwayoon Lee, Kyoungkook Kang, Hyeongmin Lee, Seung-Hwan Baek, Sunghyun Cho |
Recent image restoration methods can be broadly categorized into two classes:
(1) regression methods that recover the rough structure of the original image
without synthesizing high-frequency details and (2) generative methods that
synthesize perceptually-realistic high-frequency details even though the
resulting image deviates from the original structure of the input. While both
directions have been extensively studied in isolation, merging their benefits
with a single framework has been rarely studied. In this paper, we propose
UGPNet, a universal image restoration framework that can effectively achieve
the benefits of both approaches by simply adopting a pair of an existing
regression model and a generative model. UGPNet first restores the image
structure of a degraded input using a regression model and synthesizes a
perceptually-realistic image with a generative model on top of the regressed
output. UGPNet then combines the regressed output and the synthesized output,
resulting in a final result that faithfully reconstructs the structure of the
original image in addition to perceptually-realistic textures. Our extensive
experiments on deblurring, denoising, and super-resolution demonstrate that
UGPNet can successfully exploit both regression and generative methods for
high-fidelity image restoration. |
This paper presents UGPNet, a universal image restoration framework that combines the strengths of regression-based and generative prior-based restoration methods. |
Existing methods either excel at recovering image structure (regression-based) or synthesizing realistic high-frequency details (generative prior-based), but not both. UGPNet aims to bridge this gap, enabling high-fidelity image restoration with realistic textures. |
UGPNet leverages a three-module system: (1) a restoration module (flexible choice of network) recovers the original image structure, (2) a synthesis module (based on GAN inversion) generates high-frequency details, and (3) a fusion module combines the features from both modules to produce the final restored image. |
UGPNet demonstrates the ability to flexibly integrate diverse regression networks.
Compared to solely regression-based or generative prior-based methods, UGPNet achieves superior performance on deblurring, denoising, and super-resolution tasks.
UGPNet shows robustness in restoring out-of-distribution images compared to generative prior-based methods. |
UGPNet's performance depends on the accuracy of the chosen regression method.
While achieving high fidelity, UGPNet's sharpness might be less pronounced than its backbone generative model (StyleGAN2). |
image restoration, generative prior, deep learning, deblurring, denoising, super-resolution |
2401.00254
Report |
Morphing Tokens Draw Strong Masked Image Models |
Taekyung Kim, Byeongho Heo, Dongyoon Han |
Masked image modeling (MIM) is a promising option for training Vision
Transformers among various self-supervised learning (SSL) methods. The essence
of MIM lies in token-wise masked token predictions, with targets tokenized from
images or generated by pre-trained models such as vision-language models. While
tokenizers or pre-trained models are plausible MIM targets, they often offer
spatially inconsistent targets even for neighboring tokens, complicating models
to learn unified discriminative representations. Our pilot study confirms that
addressing spatial inconsistencies has the potential to enhance representation
quality. Motivated by the findings, we introduce a novel self-supervision
signal called Dynamic Token Morphing (DTM), which dynamically aggregates
contextually related tokens to yield contextualized targets. DTM is compatible
with various SSL frameworks; we showcase an improved MIM by employing DTM,
barely introducing extra training costs. Our experiments on ImageNet-1K and
ADE20K demonstrate the superiority of our methods compared with
state-of-the-art, complex MIM methods. Furthermore, the comparative evaluation
of the iNaturalists and fine-grained visual classification datasets further
validates the transferability of our method on various downstream tasks. Code
is available at https://github.com/naver-ai/dtm |
This paper introduces Dynamic Token Morphing (DTM), a novel masked image modeling method for Vision Transformers that addresses the spatial inconsistency problem in token-level supervision. |
Pre-trained models often generate spatially inconsistent token representations, which can disrupt representation learning and lead to suboptimal performance. |
DTM dynamically aggregates contextually related tokens using bipartite matching to create diverse and highly contextualized target representations for masked image modeling. |
DTM consistently improves fine-tuning accuracies across various SSL frameworks (MAE, BEiT v2, BYOL) and ViT scales (S/16, B/16, L/16).
The method surpasses state-of-the-art performance on ImageNet-1K and ADE20K datasets, demonstrating its effectiveness for image classification and semantic segmentation.
DTM enhances transferability and tuning robustness, as demonstrated by superior performance on iNaturalist and fine-grained visual classification datasets. |
The paper primarily focuses on ViT architectures and does not explore the application of DTM to other vision models like CNNs.
The study's computational limitations restricted the evaluation of DTM to ViT-L/16, leaving its performance on larger-scale models like ViT-G unexplored. |
masked image modeling, self-supervised learning, vision transformers, token aggregation, spatial inconsistency |
2401.00208
Report |
Inpaint4DNeRF: Promptable Spatio-Temporal NeRF Inpainting with Generative Diffusion Models |
Han Jiang, Haosen Sun, Ruoxuan Li, Chi-Keung Tang, Yu-Wing Tai |
Current Neural Radiance Fields (NeRF) can generate photorealistic novel
views. For editing 3D scenes represented by NeRF, with the advent of generative
models, this paper proposes Inpaint4DNeRF to capitalize on state-of-the-art
stable diffusion models (e.g., ControlNet) for direct generation of the
underlying completed background content, regardless of static or dynamic. The
key advantages of this generative approach for NeRF inpainting are twofold.
First, after rough mask propagation, to complete or fill in previously occluded
content, we can individually generate a small subset of completed images with
plausible content, called seed images, from which simple 3D geometry proxies
can be derived. Second and the remaining problem is thus 3D multiview
consistency among all completed images, now guided by the seed images and their
3D proxies. Without other bells and whistles, our generative Inpaint4DNeRF
baseline framework is general which can be readily extended to 4D dynamic
NeRFs, where temporal consistency can be naturally handled in a similar way as
our multiview consistency. |
Presents Inpaint4DNeRF, a novel framework for text-guided generative inpainting of Neural Radiance Fields (NeRFs), enabling the replacement of existing objects with new, semantically relevant content while maintaining 3D and 4D consistency. |
Addresses the limitations of current NeRF editing techniques that struggle to generate new content consistent with the existing background, bridging the gap between 2D image inpainting and 3D/4D scene manipulation. |
Employs a three-stage approach: 1) pre-processes training images by inpainting a few seed views and propagating them to other views using stable diffusion, 2) fine-tunes the NeRF with iterative dataset update to enforce multiview consistency, and 3) extends to 4D by propagating the inpainted content temporally. |
Generates novel 3D content within existing NeRFs that aligns with user-provided text prompts.
Maintains multiview consistency, ensuring the generated object appears seamless from different viewpoints.
Demonstrates potential for 4D dynamic NeRF inpainting by propagating edits across frames while maintaining temporal consistency. |
Limited capacity to handle complex geometry generation with wide camera angles.
Further improvement needed in consistency and temporal coherence for 4D inpainting. |
generative inpainting, neural radiance fields, nerf editing, diffusion models, 4d dynamic nerfs |
2401.00110
Report |
Diffusion Model with Perceptual Loss |
Shanchuan Lin, Xiao Yang |
Diffusion models trained with mean squared error loss tend to generate
unrealistic samples. Current state-of-the-art models rely on classifier-free
guidance to improve sample quality, yet its surprising effectiveness is not
fully understood. In this paper, we show that the effectiveness of
classifier-free guidance partly originates from it being a form of implicit
perceptual guidance. As a result, we can directly incorporate perceptual loss
in diffusion training to improve sample quality. Since the score matching
objective used in diffusion training strongly resembles the denoising
autoencoder objective used in unsupervised training of perceptual networks, the
diffusion model itself is a perceptual network and can be used to generate
meaningful perceptual loss. We propose a novel self-perceptual objective that
results in diffusion models capable of generating more realistic samples. For
conditional generation, our method only improves sample quality without
entanglement with the conditional input and therefore does not sacrifice sample
diversity. Our method can also improve sample quality for unconditional
generation, which was not possible with classifier-free guidance before. |
This paper proposes a novel self-perceptual objective for diffusion model training that leverages the model itself as a perceptual network to improve the realism of generated samples. |
Diffusion models often produce unrealistic samples when trained with standard mean squared error loss. While classifier-free guidance methods have addressed this, they are limited to conditional generation and can negatively impact sample diversity. |
The authors freeze a pre-trained diffusion model and use it as a perceptual loss network. During training, the online diffusion model predicts the denoised image and noise, which are then used to generate a reconstructed image at a random timestep. The perceptual loss is calculated by comparing the hidden features of the reconstructed and ground truth images from the frozen model. |
The self-perceptual objective improves both the Fréchet Inception Distance (FID) and Inception Score (IS) compared to models trained solely with MSE loss.
Qualitative results demonstrate enhanced sample quality with the proposed method, generating more realistic images than MSE alone.
Unlike classifier-free guidance, the self-perceptual objective can be applied to unconditional image generation, leading to improvements in this domain. |
While the self-perceptual objective improves sample quality, it does not yet surpass the performance of classifier-free guidance combined with MSE loss for text-to-image generation.
Future work involves exploring the combination of the self-perceptual objective with other guidance techniques and investigating its application across various modalities beyond images. |
diffusion models, perceptual loss, image generation, classifier-free guidance, self-supervision |
2401.00094
Report |
Generating Enhanced Negatives for Training Language-Based Object Detectors |
Shiyu Zhao, Long Zhao, Vijay Kumar B. G, Yumin Suh, Dimitris N. Metaxas, Manmohan Chandraker, Samuel Schulter |
The recent progress in language-based open-vocabulary object detection can be
largely attributed to finding better ways of leveraging large-scale data with
free-form text annotations. Training such models with a discriminative
objective function has proven successful, but requires good positive and
negative samples. However, the free-form nature and the open vocabulary of
object descriptions make the space of negatives extremely large. Prior works
randomly sample negatives or use rule-based techniques to build them. In
contrast, we propose to leverage the vast knowledge built into modern
generative models to automatically build negatives that are more relevant to
the original data. Specifically, we use large-language-models to generate
negative text descriptions, and text-to-image diffusion models to also generate
corresponding negative images. Our experimental analysis confirms the relevance
of the generated negative data, and its use in language-based detectors
improves performance on two complex benchmarks. Code is available at
\url{https://github.com/xiaofeng94/Gen-Enhanced-Negs}. |
This paper proposes a novel method to automatically generate relevant negative text descriptions and corresponding negative images to improve the training of language-based object detectors. |
Negative samples are crucial for training discriminative models, and existing methods for generating negatives for language-based object detection are limited in relevance and scope. This method addresses the need for better negative samples in this field. |
The methodology involves utilizing large language models (LLMs) to generate negative text descriptions through techniques like concept-foiling and recombination. Furthermore, text-to-image diffusion models are employed to create negative images based on the generated texts, incorporating noise mitigation strategies. |
Adding the generated negative data during training consistently improves the performance of language-based object detectors on OmniLabel and D³ benchmarks.
The analysis shows that LLM-generated negative texts are more diverse and capture more complex relationships than rule-based methods.
Generated negative images, after filtering, provide a complementary training signal, further enhancing the accuracy of language-based object detection, especially on the OmniLabel benchmark. |
The quality of generated negative images depends on the capabilities of current text-to-image generation models, which can still produce noisy or unrealistic outputs.
The current approach focuses on generating negatives for individual object descriptions, and future work could explore generating negatives for a set of descriptions within an image. |
language-based object detection, negative sample generation, large language models, text-to-image synthesis, computer vision |
2401.00027
Report |
Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring |
Xin Gao, Tianheng Qiu, Xinyu Zhang, Hanlin Bai, Kang Liu, Xuan Huang, Hu Wei, Guoying Zhang, Huaping Liu |
Coarse-to-fine schemes are widely used in traditional single-image motion
deblur; however, in the context of deep learning, existing multi-scale
algorithms not only require the use of complex modules for feature fusion of
low-scale RGB images and deep semantics, but also manually generate
low-resolution pairs of images that do not have sufficient confidence. In this
work, we propose a multi-scale network based on single-input and
multiple-outputs(SIMO) for motion deblurring. This simplifies the complexity of
algorithms based on a coarse-to-fine scheme. To alleviate restoration defects
impacting detail information brought about by using a multi-scale architecture,
we combine the characteristics of real-world blurring trajectories with a
learnable wavelet transform module to focus on the directional continuity and
frequency features of the step-by-step transitions between blurred images to
sharp images. In conclusion, we propose a multi-scale network with a learnable
discrete wavelet transform (MLWNet), which exhibits state-of-the-art
performance on multiple real-world deblurred datasets, in terms of both
subjective and objective quality as well as computational efficiency. |
This paper introduces MLWNet, a novel single-input multi-output (SIMO) multi-scale network incorporating a learnable discrete wavelet transform (DWT) for superior motion deblurring in images. |
Existing deep learning deblurring methods, particularly those using coarse-to-fine schemes, often suffer from high complexity, rely on unreliable manually downsampled images, and struggle to restore high-frequency details. MLWNet addresses these limitations, aiming for enhanced efficiency and detail restoration. |
MLWNet employs a SIMO architecture, taking a single image as input and progressively generating sharper outputs at different scales. It features learnable wavelet transform nodes (LWNs) within its structure to effectively capture directional continuity and frequency features for improved detail restoration. The training incorporates a multi-scale loss and a wavelet loss to ensure both pixel-level accuracy and proper wavelet kernel learning. |
MLWNet achieves state-of-the-art performance on real-world deblurring datasets (RealBlur, RSBlur) exceeding previous benchmarks in PSNR and SSIM.
The method demonstrates superior detail restoration, particularly in low-light conditions, compared to competing algorithms.
It exhibits strong generalization ability, evidenced by its performance on unseen real-world blurry images. |
While excelling in realistic blur, MLWNet's performance on synthetic datasets doesn't reach the same level, potentially due to the nature of synthetic blur and its differences from real-world scenarios.
Future exploration could focus on adapting the learnable DWT module for improved handling of noise and high-frequency artifacts in synthetic blur. |
image deblurring, deep learning, multi-scale network, discrete wavelet transform, simo |